Building an Efficient Indexing for Crawling the Web Site with an Efficient Spider

Building an Efficient Indexing for Crawling
the Web Site with an Efficient Spider

Ghaleb Al-Gaphari
Faculty of Computer Science, University of Sana’a, Yemen


Abstract: Constructing a high performance web search engine requires an efficient indexing mechanism as well as a high performance web spider. With the present effort, we propose to investigate results of applying both, the Right-Truncated Index-Based Web Search Engine and the High-performance web spider in order to determine its usefulness for storing and retrieving Arabic documents on one hand and their effectiveness in finding and analyzing data to be indexed on the other hand. The Right-Truncated Index-Based Web Search Engine, being a program for reading any set of Arabic documents, accepts a query, and then processes both the documents and the query.  Thus, it selects (predicts) those documents most relevant to the query which was inserted. The program encompasses both a morphological component and a mathematical one.  The morphological component allows the researcher to run either a stemming algorithm or a right-truncated algorithm.  The chief advantage of the stemming algorithm is that it uses the least possible amount of storage for indexing by mapping the inflected and derived terms into a single, indexed stem-word.  On the other hand, the right-truncated algorithm reduces the amount of storage to a lesser degree, but increases the probability of retrieving relevant (user-favorable) documents, compared to the stemming algorithm.  One of the purposes of our investigation is to compare the efficiency of these two indexing mechanisms. This component computes the TF-IDF (term-weighting scheme) by multiplying the inverse document frequency-array with the term frequency-array for each term contained in every document.  Then, it computes the cosine-similarity shared by the query-vector and each individual document-vector in the collection.  The greater the cosine-similarity between the query-vector and the document-vector, the greater the relevancy the document presents to the query. Expressed differently, the greater the cosine-similarity between the terms of the query and the document which contains those terms, the higher the probability that said document will correspond to user-interest, thereby  improving the query's power to retrieve. This paper also describes building a simple search engine based on a crawler or a spider. The clawer is an algorithm to crawl the file systems from specified folder, and indexes different types of documents. A basic design and object model was developed to support  single search word results  as well as  multiple search words results.  It is capable of finding data to index by following (tracing) web links rather than searching directory listings in the file system. In this process files are downloaded through HTTP and HTML pages parsed in order to obtain more links without getting into a recursive loop. Also, this paper discusses how to improve indexing mechanism efficiency using a right truncated stemmer in terms of Arabic documents processing.

Keywords: Web search engine, truncation, indexing efficiency, spider, crawler.

Received March 18, 2007; accepted December 31, 2007 

Full Text
Read 4976 times Last modified on Wednesday, 20 January 2010 01:27
Share
Top
We use cookies to improve our website. By continuing to use this website, you are giving consent to cookies being used. More details…