期刊名称:International Journal of Computer Science and Information Technologies
电子版ISSN:0975-9646
出版年度:2016
卷号:7
期号:4
页码:2068-2070
出版社:TechScience Publications
摘要:Internet present a huge collection of usefulinformation so proposed technique which work oninformation extraction from web document has becomeresearch area. Data extraction is the act of process ofretrieving data of data sources for further dataprocessing or data migration. The proposed techniquework on two or more web documents generated by thesame server-side template and learns a regularexpression that models it and can later be used toextract data from similar documents. The techniqueintroduced some shared pattern that do provide anyrelevant data. The proposed technique will be comparedwith others in literature as large collection of webdocument.
关键词:Web Data Extraction; Automatic wrapper;generation; Web Crawler; Unsupervised learning