![]() |
|
| integis.ch > Projekte > Versus | |

Versus is a diploma thesis at the University of Applied Sciences Rapperswil, Switzerland, www.hsr.ch
The project Versus is about comparing two web page sampling methods.
If one could randomly choose web pages (“random sampling”), one could do some interesting statistics about the Internet. Some questions would be for example about the distribution of web pages in the various top level domains or, what is the amount of the web in percentage that is indexed by a particular search engine.
Methods for sampling web pages have some limitations (biases). There are always some sources of errors which can cause harm to the accuracy of the results.
Two methods for “random sampling” based on “random walks” on the web graph have been published:
These two papers used different ways of evaluation. The tests are not comparable. Therefore the main goal of the Versus project is to compare these two web page sampling methods.
Versus follows the Robots Exclusion Protocol (http://www.robotstxt.org/wc/exclusion.html). Before attempting to download any document from a site (say www.yoursite.org), Versus will attempt to download a document with the URL http://www.yoursite.org/robots.txt. The robots.txt file is created by the web master. It contains a set of rules indicating which parts of a site are off-limits to web crawlers.
The name of the user agent is:
versus 0.2 (+http://versus.integis.ch)Here is a robots.txt file that would prevent Versus from visiting any pages on your site:
User-Agent: versus
Disallow: /
To prevent all web crawlers from accessing your site, use the following robots.txt file instead: User-Agent: *
Disallow: /
Remember that Versus we can't access your robots.txt file, the crawler has no way of knowing that it should stay out. So, once you have created the file, you should make sure that it is visible by pointing your browser at the URL http://www.yoursite.org/robots.txt (replace www.yoursite.org by the name of your site).
