Improving the Accuracy of English-Arabic Statistical Sentence Alignment

Improving the Accuracy of English-Arabic
Statistical Sentence Alignment

Mohammad Salameh1, Rached Zantout2, and Nashat Mansour1
1Department of Computer Science and Mathematics, Lebanese American University, Lebanon
2College of Computer and Information Sciences, Prince Sultan University, Saudi Arabia
 
Abstract: Multilingual natural language processing systems are increasingly relying on parallel corpus to ameliorate their output. Parallel corpora constitute the basic block for training a statistical natural language processing system and creating translation and language models. Several systems have been devised that automatically align words of a pair of sentences, teach in a language. Such systems have been used successfully with European languages. In this paper, one such system is used to align sentences in an English-Arabic corpus. The system works poorly given raw unaligned sentence English-Arabic sentence pairs. This prompted the development of a preprocessing step to be applied to the Arabic sentences. The same corpus was then preprocessed and a significant improvement is reported when alignment is attempted using the preprocessed unaligned sentences.

Keywords: Word alignment, sentence alignment, parallel corpora, and statistical natural language processing.

Received January 15, 2009; accepted August 3, 2009

Full Text
Read 5943 times Last modified on Wednesday, 15 December 2010 03:09
Share
Top
We use cookies to improve our website. By continuing to use this website, you are giving consent to cookies being used. More details…