Improving the Accuracy of English-Arabic
Statistical Sentence Alignment
Mohammad Salameh1, Rached Zantout2, and Nashat Mansour1
1Department of Computer Science and Mathematics, Lebanese American University, Lebanon
2College of Computer and Information Sciences, Prince Sultan University, Saudi Arabia
1Department of Computer Science and Mathematics, Lebanese American University, Lebanon
2College of Computer and Information Sciences, Prince Sultan University, Saudi Arabia
Abstract: Multilingual natural language processing systems are increasingly relying on parallel corpus to ameliorate their output. Parallel corpora constitute the basic block for training a statistical natural language processing system and creating translation and language models. Several systems have been devised that automatically align words of a pair of sentences, teach in a language. Such systems have been used successfully with European languages. In this paper, one such system is used to align sentences in an English-Arabic corpus. The system works poorly given raw unaligned sentence English-Arabic sentence pairs. This prompted the development of a preprocessing step to be applied to the Arabic sentences. The same corpus was then preprocessed and a significant improvement is reported when alignment is attempted using the preprocessed unaligned sentences.
Keywords: Word alignment, sentence alignment, parallel corpora, and statistical natural language processing.
Received January 15, 2009; accepted August 3, 2009