Topical Web Crawling for Domain-Specific Resource Discovery Enhanced by Selectively Using Link-Conte

Topical Web Crawling for Domain-Specific Resource Discovery Enhanced by Selectively Using Link-Context

 

Lu Liu1, 2, Tao Peng1, 2, 3, and Wanli Zuo1, 3

1College of Computer Science and Technology, Jilin University, China

2Department of Computer Science, University of Illinois at Urbana-Champaign, USA

3Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education, China

   

  Abstract: To enable topical Web crawling, link-context is the critical contextual information of anchor text for retrieving domain-specific resources. While some link-contexts may misguide topical Web crawling and extract wrong Web pages, because several relevant anchor texts become irrelevant or several irrelevant anchor texts become relevant after calculating the relevance between the link-contexts and the feature terms of the specific topic. In view of above, this paper presents a heuristic-based approach by selectively using link-context and implements DOM tree to locate the anchor text. Unlike previous crawling algorithms, which only zero in on link-context and ignore whether it is really needed or not? Our method cares both link-context and evaluating its necessity to correctly use link-context to guide topical crawling. Accordingly, our topical crawler can retrieve more relevant Web pages. Experimental results indicate that this approach outperforms breadth-first, best-first, anchor text only, link-context both in harvest rate and target recall.

 Keywords: Topical crawling, domain-specific resource retrieving, selectively using link context, DOM tree

    Received November 17, 2012; accepted March 14, 2014 

Full Text

Read 1733 times Last modified on Sunday, 19 August 2018 04:39
Share
Top
We use cookies to improve our website. By continuing to use this website, you are giving consent to cookies being used. More details…