文章基本信息

标题：Web Mining Based Distributed Crawling with Instant Backup Supports
本地全文：下载
作者：Yogesh Pawar
期刊名称：International Journal of Computer Science & Technology
印刷版ISSN：2229-4333
电子版ISSN：0976-8491
出版年度：2012
卷号：3
期号：1
页码：311-315
语种：English
出版社：Ayushmaan Technologies
摘要：As the World Wide Web is growing rapidly and data in the presentday scenario is stored in a distributed manner. The need to developa search Engine based architectural model for people to searchthrough the Web. Broad web search engines as well as many morespecialized search tools rely on web crawlers to acquire largecollections of pages for indexing and analysis. The crawler is animportant module of a web search engine. The quality of a crawlerdirectly affects the searching quality of such web search engines.Such a web crawler may interact with millions of hosts over aperiod of weeks or months, and thus issues of robustness, ﬂexibility,and manageability are of major importance. Given some URLs,the crawler should retrieve the web pages of those URLs, parsethe HTML fles, add new URLs into its queue and go back to thefrst phase of this cycle. The crawler also can retrieve some otherinformation from the HTML fles as it is parsing them to get thenew URLs. In this paper, we describe the design of a web crawlerthat uses Page Rank algorithm for distributed searches and can berun on a network of workstations. The crawler initially search forall the stop words (such as a, an, the, and etc). While searchingthe web pages for some keyword the crawler will initially removeall collected stop word. Also at the same time the crawler willsearch for snippets from web documents. All the matching wordand collected snippet will be stored in temporary cache memorycreated at central server of crawlers. Where after applying pagerank algorithm on the basis of no. of visit of web pages we willarrange the pages according to their ranksanddisplay the results.Since, due to extensive search on web through web crawlers thechances of various virus attacks are moreandprocessing capacityof system may get halt so to provide solution in such scenariowe can provide backup to our system by creating web services.The web service will be designed in such manner that any validupdations to any database servers will automatically updates thebackup servers. Therefore, even in failure of any server system,we can continue with crawling process.