IAJIT
Archive
- Volume 20, 2023
  
  January 2023, No.1
  
  March 2023, No.2
- Volume 19, 2022
  
  January 2022, No.1
  
  March 2022, No.2
  
  May 2022, No. 3
  
  Special Issue 2022, No. 3A
  
  July 2022, No. 4
  
  September 2022, No. 5
  
  November 2022, No. 6
- Volume 18, 2021
  
  January 2021, No.1
  
  March 2021, No.2
  
  May 2021, No. 3
  
  Special Issue 2021, No. 3A
  
  July 2021, No. 4
  
  September 2021, No. 5
  
  November 2021, No. 6
- Volume 17, 2020
  
  January 2020, No.1
  
  March 2020, No. 2
  
  May 2020, No. 3
  
  July 2020, No. 4
  
  Special Issue 2020, No. 4A
  
  September 2020, No. 5
  
  November 2020, No. 6
- Volume 16, 2019
  
  January 2019, No.1
  
  March 2019, No. 2
  
  May 2019, No. 3
  
  Special Issue 2019, No. 3A
  
  July 2019, No. 4
  
  September 2019, No. 5
  
  November 2019, No. 6
- Volume 15, 2018
  
  January 2018, No.1
  
  March 2018, No. 2
  
  May 2018, No. 3
  
  Special Issue 2018, No. 3A
  
  July 2018, No. 4
  
  September 2018, No. 5
  
  November 2018, No. 6
- Volume 14, 2017
  
  January 2017, No.1
  
  March 2017, No. 2
  
  May 2017, No. 3
  
  July 2017, No. 4
  
  pecial Issue 2017, No. 4A
  
  September 2017, No 5
  
  November 2017, No. 6
- Volume 13, 2016
  
  January 2016. No.1
  
  March 2016, No. 2
  
  May 2016, No. 3
  
  July 2016, No.4
  
  September 2016, No.5
  
  November 2016, No.6
- Volume 12, 2015
  
  January 2015. No.1
  
  March 2015. No.2
  
  May 2015. No.3
  
  July 2015. No.4
  
  September 2015. No. 5
  
  November 2015. No. 6
  
  December 2015. No. 6A
- Volume 11, 2014
  
  January 2014, No.1
  
  March 2014, No.2
  
  May 2014, No.3
  
  July 2014, No.4
  
  November 2014, No.6
- Volume 10, 2013
  
  January 2013, No. 1
  
  March 2013, No.2
  
  May 2013, No. 3
  
  July 2013, No. 4
  
  November 2013, No. 6
- Volume 9, 2012
  
  January 2012, No. 1
  
  March 2012, No. 2
  
  May 2012, No. 3
  
  July 2012, No. 4
  
  September 2012, No. 5
  
  November 2012, No. 6
- Volume 8, 2011
  
  January 2011, No 1
  
  April 2011, No. 2
  
  July 2011, No. 3
  
  October 2011, No.4
- Volume 7, 2010
  
  October 2010, No. 4
  
  July 2010, No. 3
  
  April 2010, No. 2
  
  January 2010, No. 1
- Volume 6, 2009
  
  January 2009, No. 1
  
  April 2009, No. 2
  
  July 2009, No. 3
  
  October 2009, No. 4
  
  November 2009, No. 5
- Volume 5, 2008
  
  January 2008, No. 1
  
  April 2008, No.2
  
  July 2008, No. 3
  
  October 2008, No. 4
- Volume 4, 2007
  
  January 2007, No. 1
  
  April 2007, No. 2
  
  July 2007, No.3
  
  October 2007, No. 4
- Volume 3, 2006
  
  January 2006, No. 1
  
  April 2006, No. 2
  
  July 2006, No. 3
  
  October 2006, No. 4
- Volume 2, 2005
  
  January 2005, No. 1
  
  April 2005, No. 2
  
  July 2005, No. 3
  
  October 2005, No. 4
- Volume 1, 2003-2004
  
  July 2004, No. 2
  
  January 2004, No. 1
  
  July 2003, No. 0
About IAJIT
About CCIS
IAJIT Impact Factor

Building an Efficient Indexing for Crawling the Web Site with an Efficient Spider

Written by Super User
Update: 24/06/2010

font size decrease font size increase font size
Print
Email
Rate this item
- 1
- 2
- 3
- 4
- 5
(0 votes)

Building an Efficient Indexing for Crawling
the Web Site with an Efficient Spider

Ghaleb Al-Gaphari
Faculty of Computer Science, University of Sana’a, Yemen

Abstract: Constructing a high performance web search engine requires an efficient indexing mechanism as well as a high performance web spider. With the present effort, we propose to investigate results of applying both, the Right-Truncated Index-Based Web Search Engine and the High-performance web spider in order to determine its usefulness for storing and retrieving Arabic documents on one hand and their effectiveness in finding and analyzing data to be indexed on the other hand. The Right-Truncated Index-Based Web Search Engine, being a program for reading any set of Arabic documents, accepts a query, and then processes both the documents and the query. Thus, it selects (predicts) those documents most relevant to the query which was inserted. The program encompasses both a morphological component and a mathematical one. The morphological component allows the researcher to run either a stemming algorithm or a right-truncated algorithm. The chief advantage of the stemming algorithm is that it uses the least possible amount of storage for indexing by mapping the inflected and derived terms into a single, indexed stem-word. On the other hand, the right-truncated algorithm reduces the amount of storage to a lesser degree, but increases the probability of retrieving relevant (user-favorable) documents, compared to the stemming algorithm. One of the purposes of our investigation is to compare the efficiency of these two indexing mechanisms. This component computes the TF-IDF (term-weighting scheme) by multiplying the inverse document frequency-array with the term frequency-array for each term contained in every document. Then, it computes the cosine-similarity shared by the query-vector and each individual document-vector in the collection. The greater the cosine-similarity between the query-vector and the document-vector, the greater the relevancy the document presents to the query. Expressed differently, the greater the cosine-similarity between the terms of the query and the document which contains those terms, the higher the probability that said document will correspond to user-interest, thereby improving the query's power to retrieve. This paper also describes building a simple search engine based on a crawler or a spider. The clawer is an algorithm to crawl the file systems from specified folder, and indexes different types of documents. A basic design and object model was developed to support single search word results as well as multiple search words results. It is capable of finding data to index by following (tracing) web links rather than searching directory listings in the file system. In this process files are downloaded through HTTP and HTML pages parsed in order to obtain more links without getting into a recursive loop. Also, this paper discusses how to improve indexing mechanism efficiency using a right truncated stemmer in terms of Arabic documents processing.

Keywords: Web search engine, truncation, indexing efficiency, spider, crawler.

Received March 18, 2007; accepted December 31, 2007

Full Text

Read 4996 times Last modified on Wednesday, 20 January 2010 01:27