Distributed framework for web crawling and information search

Mehrjardi, Mohammad Aghinanejad (2011) Distributed framework for web crawling and information search. (MSc(R) thesis), Kingston University, .


Web search engines are playing a vital role in the virtual and the real world. During the past few decades, search engines have been improved in many aspects to provide more satisfying results to the users. Search engines are very large scale software applications normally distributed on several computers to enhance the overall performance of the system. To gather web documents, search engines use software applications called Crawlers. Crawlers explore the web and find web documents. The in-hand thesis focuses on designing a fully-distributed framework for Crawlers to maximize the performance while minimizing the downtime of the system by decoupling its different parts. In the proposed framework the different parts of the system work independently in a way that failure of a part does not interrupt the working of the whole system. The framework has been designed based using Client-Server architecture and it has been implemented using .Net Framework and C# as its programming language. Tasks are distributed to computers called Worker nodes in the framework and the results of the processes are sent back to the main server. Data transformation and communication between Worker nodes and the server has been implemented by .Net Remoting where a port is assigned to a piece of software and whatever requests sent to that port is sent to the respective software to process. In order to demonstrate the benefits of the framework a number of tests have been carried out by processing a large number of URLs using different number of nodes.

Actions (Repository Editors)

Item Control Page Item Control Page