文章基本信息

标题：Web Mining Based Distributed Crawling with Instant Backup Supports
本地全文：下载
作者：Yogesh Pawar
期刊名称：International Journal of Computer Science & Technology
印刷版ISSN：2229-4333
电子版ISSN：0976-8491
出版年度：2012
卷号：3
期号：1Ver 2
出版社：Ayushmaan Technologies
摘要：As the World Wide Web is growing rapidly and data in the present day scenario is stored in a distributed manner. The need to develop a search Engine based architectural model for people to search through the Web. Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. The crawler is an important module of a web search engine. The quality of a crawler directly affects the searching quality of such web search engines. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. Given some URLs, the crawler should retrieve the web pages of those URLs, parse the HTML files, add new URLs into its queue and go back to the first phase of this cycle. The crawler also can retrieve some other information from the HTML files as it is parsing them to get the new URLs. In this paper, we describe the design of a web crawler that uses Page Rank algorithm for distributed searches and can be run on a network of workstations. The crawler initially search for all the stop words (such as a, an, the, and etc). While searching the web pages for some keyword the crawler will initially remove all collected stop word. Also at the same time the crawler will search for snippets from web documents. All the matching word & collected snippet will be stored in temporary cache memory created at central server of crawlers. Where after applying page rank algorithm on the basis of no. of visit of web pages we will arrange the pages according to their ranks & display the results. Since, due to extensive search on web through web crawlers the chances of various virus attacks are more & processing capacity of system may get halt so to provide solution in such scenario we can provide backup to our system by creating web services. The web service will be designed in such manner that any valid updations to any database servers will automatically updates the backup servers. Therefore, even in failure of any server system, we can continue with crawling process.
关键词：Data Mining; Web Mining; Page Rank; Crawler; Web Servicesurity