UCOM Offline Dataset-An Urdu Handwritten Dataset Generation

UCOM Offline Dataset-An Urdu Handwritten Dataset Generation

Saad Bin Ahmed1, Saeeda Naz2,3, Salahuddin Swati4, Muhammad Razzak1, Arif Umar2, and Akbar Khan4

1College of Public Health and Health Informatics, King Saud Bin Abdul Aziz University for Health

Sciences, Saudi Arabia

2Department of Information Technology, Hazara University, Pakistan

3GGPGC No.1, Higher Education Department, Pakistan

4COMSATS Institute of Information Technology, Pakistan

 Abstract: A benchmark database for character recognition is an essential part for efficient and robust development. Unfortunately, there is no comprehensive handwritten dataset for Urdu language that would be used to compare the state of the art techniques in the field of optical character recognition. In this paper, we present a new and publically available dataset comprising 600 pages of handwritten Urdu text written in Nasta’liq style in conjunction with detailed ground truth for the evaluation of handwritten Urdu character recognition. This dataset contains text lines written in Nasta’liq style by limited individuals on A4 size paper. The acquired data on page was scanned and text lines were segmented. UCOM database covers all Urdu characters and ligatures with different variation in addition to Urdu numeric data. We have considered that ligature consists of up to five characters in this dataset. The UCOM dataset can be used for handwritten character recognition as well as writer identification. We proposed and evaluated the strength of Recurrent Neural Networks (RNN) on UCOM offline database sample text line.

Keywords: RNN, optical character recognition, cursive, offline handwriting.

Received April 22, 2014; accepted October 26, 2014

Full Text

Read 3302 times Last modified on Wednesday, 08 May 2019 02:28
Share
Top
We use cookies to improve our website. By continuing to use this website, you are giving consent to cookies being used. More details…