Interactive generation of uniformly random samples of World Wide Web pages

Walker, Andrew Mitchell (2003) Interactive generation of uniformly random samples of World Wide Web pages. (MSc(R) thesis), Kingston University, .

Abstract

The size and complexity of the World Wide Web means that for all practical purposes it is impossible to have information about the content of every web page in existence. Hence to learn about the structure of the web and the characteristics of the documents accessible through it, it is necessary to devise a means of collecting a random sample of documents to study. The results I obtained from the study of a small sample of pages and details of the design of the random web crawler, named Alienbot, that I used to collect the sample are presented here. Alienbot employs a unique crawling strategy, which utilises the blind interaction of a user to randomly select the pages to be crawled. The user interacts with the crawler through an interface based on the classic game Space Invaders, in which the action of shooting aliens determines the pages to be crawled. The sample of pages collected on the random web crawl performed by Alienbot is then used to estimate average properties of pages on the World Wide Web, examining elements such as links and images. The link usage is also expanded to look at a previously unstudied area of the web, a type of page known as a weblog [30]. The link usage on weblogs is compared and contrasted with that found on a typical page. I also use the data set to examine the extent of the use of valid HTML/XHTML mark-up to see how quickly web pages authors are adopting new web standards. The data set collected by Alienbot shows some interesting results, it is found that on most pages the majority of links are to different pages within the same website and that when people do link to a different website they are more likely to link to its homepage rather than a page deeper within the site. In general weblogs appear to exhibit significantly different link characteristics to other pages in the sample, in particular it appears weblog homepages are much more richly linked than homepages of other sites. It is also discovered that within the sample collected by Alienbot the use of valid mark-up is not common, with most pages that could be validated exhibiting many errors.

Actions (Repository Editors)

Item Control Page Item Control Page