摘要:We propose a novel web page segmentationalgorithm based on finding the Gomory-Hu tree in a planargraph. The algorithm firstly distills vision and structureinformation from a web page to construct a weightedundirected graph, whose vertices are the leaf nodes of theDOM tree and the edges represent the visible positionrelationship between vertices. Then it partitions the graphwith the Gomory-Hu tree based clustering algorithm.Experimental results show that, compared with VIPS andChakrabarti et al.’s graph theoretic algorithm, ouralgorithm improves upon the other two with much higherprecision and recall, and its running time is far lower thanthat of Chakrabarti et al.’s graph theoretic algorithm.
关键词:Webpage segmentation;DOM tree;Gomory- Hu tree;Planar graph