Urdu OCR Compound Character Recognition Using Feed Forward Neural Networks by Zaheer Ahmad Peshawar Date 124-05-09

Urdu Compound Character Recognition Using Feed Forward Neural Networks
Zaheer Ahmad Inam Shamsher, Jehanzeb Khan Orakzai

Center of Information Technology, Institute of Management Sciences, Hayatabad, Peshawar, Pakistan e-mail: ahmad.zaheer@yahoo.com , inamshamsher1@yahoo.com, janzeb@yahoo.com the correct character otherwise Character not Recognized message is generated. The result percentage of the system is 70%. II. URDU A CURSIVE SCRIPT Urdu is the national language of Pakistan and one of the popular script in the Indian subcontinent evolved in the subcontinent from the mixture of Arabic, Turkish, Farsi and Hindi Languages with 58 character set defined by National Language Authority Pakistan[1-16] as shown in figure 2.1. But only 40 basic and one do-chashmi-hey is used to form all composite alphabets; therefore the working set is consists of 41 alphabets.
Abstract Urdu compound Character Recognition is a scarcely developed area and requires robust techniques to develop as Urdu being a family of Arabic script is cursive, right to left in nature and characters change their shapes and sizes when they are placed at initial, middle or at the end of a word. The developed system consists of two main modules
segmentation and classification. In the segmentation phase pixels strength is measured to detect words in a sentence and joints of characters in a compound/connected word for segmentation. In the next phase these segmented characters are feeded to a trained Neural Network for classification and recognition, where Feed Forward Neural Network is trained on 56 different classes of characters each having 100 samples. The main purpose of the system is to test the algorithm developed for segmentation of compound characters. The prototype of the system has been developed in Matlab, currently achieves 70% accuracy on the average.
KeywordsUrdu Script, OCR,Neural Networks, Arabic.
Fig-1. Character Set (58 alphabets) of Urdu Script.
I. INTRODUCTION
CR is a field of research in pattern recognition, artificial
intelligence and machine vision. An OCR system enables to take a book or a magazine article, feed it directly into an electronic computer file, and then edit the file using a word processor. Here an Urdu OCR (UOCR) is designed and developed to recognize images of Urdu text/characters. The sole purpose of the OCR developed is to test the algorithm developed for feature extractions and segmentation therefore no robust noise detection and removal techniques are applied. Similarly the system is fixed to work on a specified font (Ariel, size 36) without considering diacritics characters. The system gets a single line of Urdu text, converts text into words and then into characters. A Multilayer Feed Forward Neural Network is trained to recognize these segments as characters, for the purpose each character is feeded to a trained Neural Net, which on successful recognition shows
It is a modification of the Persian alphabet, which is itself a derivative of the Arabic alphabet. Urdu shares a common script and many characteristics of Arabic script with additional set of alphabets from Farsian and Hindi character sets. The graphical representation of each alphabet has more than one form depending on its position and context in the word. In general each letter has four forms that is beginning, middle, final and standalone as shown in table.1.
TABLE I DIFFERENT FORMS OF URDU CHARACTERS
# 0 1 1a
Charter
Forms
Name hamzah
alif alif madd
# 2 2h 3 3h 4 4h 5 5h 6 7 7h 8 8h 9 10 11 11h 12 12h 13 14 14h 15 15h 16 17 18 19 20 21 22 23 24 25 26 27 28 28h
Charter
Forms
Name b bh p ph t th t. t.h s jm jh = c h = ch bar. H x = kh dl dh d.l d.h zl r r r. r.h z = zh sn n = shn Sd, Sud d, ud T Z ain ain f qf kf kh
# 29 29h 30 30h 31 31h 32 32h 32a 32ah 33 33h 34 34b 35 53 35b
Charter
Forms
Name gf gh lm lm mm mm nn nn nn-e unnah nn-e unnah v v ht. h d-am h ht. y ht. y bar. y
III. PROBLEMS OF URDU SCRIPT Some problems will be presented here from character recognition point of view [7-15]. 1. Urdu is written from right to left in both printed and handwritten forms. 2. No upper or lower cases exist in Urdu, but sometimes the last character of a word is considerd as upper case because its always remains in its full form. 3. Urdu is always written cursively. Words are separated by spaces. However, there are 6 characters can be connected only from the right, these are: . , , , , , 4. Urdu characters are normally connected on an imaginary line called baseline and each alphabet in a character has some fixed size depending upon the pen (Qalam) used which is called khat. 5. Some Urdu characters have dots associated with the character, they can be above or below. 6. Some characters contain closed loop (refer to Table 1). Loop is an important feature to describe a character. Character contains two loops. The open portion of characters , and sometimes, if written by hand, is closed to form a triangle . The loop of character , and sometimes becomes too small that the internal opening part is disappeared .
7.
Hamza ( )zigzag shape, is not really a letter but it can cause difficulty in segmentation process as it resembles with the character ein ( .) 8. There are only three characters that represent vowels, , or . However, there are other shorter vowels represented by diacritics in the form of overscores or underscores but usage of overscore and underscore in Urdu is less as compare to Arabic language. 9. Dots may appear as two separated dots, touched dots, hat or as a stroke. 10. Another style of Urdu handwriting is the artistic or decorative calligraphy which is usually full of overlapping making the recognition process even more difficult by human being rather than by computers.
Input Urdu Text Image Preprocessing Segmentation Segmented Character Binary Character ( Resized )
IV. FEED FORWARD NEURAL NETWORKS Neural networks are composed of simple inputs and outputs nodes, operating in parallel. These elements are inspired by biological nervous systems. As in nature, the network function is determined largely by the connections between elements. We can train a neural network to perform a particular function by adjusting the values of the connections (weights) between elements. Commonly neural networks are adjusted, or trained, so that a particular input leads to a specific target output. The network is adjusted, based on a comparison of the output and the target, until the network output matches the target. Feed Forward Neural Networks often have one or more hidden layers of nodes followed by an output layer of neurons. Multiple layers of neurons with nonlinear transfer functions allow the network to learn nonlinear and linear relationships between input and output vectors. There are a number of algorithms to train Neural Networks. Back-propagation is one of them. The backpropagation (BP) algorithm is the most popular method for neural networks training and it has been used to solve numerous real life problems. BP is multilayer feed forward neural networks that consist in an iterative minimization of a cost function, by making weight connection adjustments according to the error between the computed and the desired output values. V. URDU CONNECTED CHARACTER RECOGNITION (UOCR) Any OCR consists of two main modules, one work as a feature extraction and segmentation and the other is used to recognize the segments as characters. The UOCR work similarly, it is also composed of two main modules with submodules as shown figure.2.
Character Code (Results)
Fig.2. Character Segmentation and Recognition
VI. FEATURE EXTRACTION AND SEGMENTATION During this phase, pixels strength is measured to detect words in a sentence and joints of characters in a word to segment sentences into words and words into characters. The pixels strength or energy is the number of black pixels in a specific direction. A search for finding a path in different directions e.g. bottom to top, right to left is made during which black pixels are counted, and select that path on which minimum number of black pixels are encountered (minimum number of black pixels are found). The method to find the strength/energy path/seam is to find the minimum value in the last row first (which becomes the (i,j)th pixel), saving the pixel location and change its status to 1, then working backwards by finding the minimum of the 3 neighboring pixels of (i,j) in the (i-1)th row and saving that pixel to the seam path. After the strength of the seam is found, the path of pixels that make up the seam are set to 1 in the image to increase its energy level and discourage these pixels contribution in the next search for seams. As a first priority those seams are selected which are vertically straight for words segmentation and for character segmentation vertical seams are preferred but if the size of the segment is large enough to a threshold value then horizontal seams are applied on the same segment to further get it segmented. In the table-II, colored cells of column II,III and IV as a unit make a seam, column V,VI when combined make a seam and column I,VIII independently make seams. These seams are selected for segmenting the image for words or characters.
a. NEURAL NETWORK ARCHITECTURE AND TRAINING

TABLE II. PIXEL L SELECTION
i i ii iii iv v vi vii 0 0 0 0 0 0 0
ii 0 0 1 1 1 0 0
iii 0 1 1 0 0 0 1
iv 0 0 1 1 1 1 1
v 1 0 1 0 0 0 0
vi 0 0 0 0 1 1 1
vii 1 1 1 1 1 1 1
viii 0 0 0 0 0 0 0
The Multilayer Feed Forward Neural Network(FFNN) used here for recognition of characters is consists of 21x15
TABLE III. CHARACTERS AND GARBAGE PRODUCED
Character Noon Chotee yee Seen Sheen,Swad,Dwad ) be, pe,te and tay ) Yee (unsegmented ) () () ( ( ()
Garbage
VII. GARBAGE CHARACTERS Dduring the whole process some garbage characters are produced, these are unnecessary, undesired segments of a character, which in many cases are merged with its parent character ( main part of character) but in some cases the algorithm remains unable to merge these small segments with its relevant segments and treated as characters till it is declared as garbage character in the recognition phase.In figure-3, is a line of Urdu text (upper line) and characters (315) input nodes, a single hidden layer with 2000 nodes and output layer of 6 nodes. Matlab functions tansig and logsig are used for hidden and output layer respectively. Training function trainscg was used in here because of its optimized memory usage with all of its defaults. Hidden layer of 2000 nodes was finally selected after testing on different layer sizes for its optimum results, where as Input layer of 315 nodes was selected keeping in view the average size of the characters produced by using Ariel font of size 36. The FFNN with above parameters taken 2000 epochs to get trained/meet the goal of 0.0005. b. TRAINING SET The 41 alphabets were classified into 56 categories to train the neural net, for example character sheen ( )and swad ( )are used as single classes in all of its forms but tay ( ) is divided into two classes. Same is the case with tee (.) Some of the training samples are shown in figure-3 below.
Fig.3. Line of Urdu text (above) Segmented character (below)
segmented (lower line ) from the above line text. The lower line shows both correctly segmented and garbage characters produced during the line of action. The 5th segmented character ( in second line ) from the right side and the 2nd last segmented character from the right side are not making their full or differentiable forms and even a human eye will not be able to correctly recognize it. As it more looks like re( )than noon ( ) or noon-ghuna ( ) . VIII. RECOGNITION USING NEURAL NETWORK Recognition phase is performed through Feed Forward Neural Network. It works as a second module of the software. But it is further classified into training and simulation parts.
Fig.3. Training set of single and two classes
c. SIMULATION RESULTS Neural Network output for different characters are shown in Figure-4. Recognition of character family of ( ,) pee (,) tee ( ) tay (, )cee ( ) and fee ( )is around 80 % same is the case of character family of kaf ( ) and gaf ( ) as these are the most simple characters and despite their similarity with each other they are totally different from the other characters. The character lam ( ) when used in middle of a word behaves like an alif ( ) which decrease its recognition percentage but alif is not misunderstood as lam ( ) in most of the cases. The character waw( )and choty yee ( )are as difficult to be differentiated by the NN as the segment of choty yee ( )after it produces the garbage is very similar to waw ( . )Characters fee ( , )mem ( ) and ein ( ) when used in the middle form of a character can deceive neural network for each other during the recognition process which leads to a low percentage for their recognition. Character noon ( ) when used in the beginning, it looks like ze () and zal ( ) and thus produces low results. In the segmentation part, garbage characters are produced during the segmentation of seen(, )sheen() ,swad(,)dwad( ,)noon( ,)noon ghuna( )which in most of the cases get passes the character test during segmentation, where as bee ( ,)pee ( ,)tee ( , ) tay () , cee ( ) and fee ( ) also produces garbage characters but in most of the cases they are identified as garbage characters. But the good thing is that, these character produce garbage only when they are located at the end of a word. Combination of lam ( )and alif ( ) when used in ( ) like words make some what a new character, in the segmentation phase as shown in figure- 5. This needs to be treated carefully.
Fig.5. Lam or Alif of Islam
Each time the algorithm produces the same results when used on same line of text and environment. The same is the case with a saved neural network results on same line of text. As compared to the neural network training time consumption of 5-7 hours, the simulation phase requires 0.131 seconds to segment and classify a character through a trained Neural Network where as the algorithm developed to segment the character only takes 0.078 seconds to segment a single character. Therefore it can be deduced easily that the neural network execution time is 0.052 per second. Matlab function
Fig.5. Character-wise Recognition %ge
profile and profreport is used on a number of text images to find the average execution time. IX. CONCLUSION This paper describes a system for Character Recognition of compound printed Urdu script. Most of the errors (garbage characters) are produced at the end character of a word when the word is ending on noon or a character having similar shape like noon. But as it is hard to find which character is the end character therefore the problem cannot be overcome easily. A large percentage of error is produced by the character seen(, )sheen(, )swad(,)dwad( ,)noon(,) noon ghuna( )which in most of the cases get passes the character test during segmentation, where as bee ( ,) pee ( ,)tee ( , ) tay ( , )cee ( ) and fee ( ) also produces garbage characters in some cases. REFERENCES [1] Zaheer Ahmad, Jehanzeb Khan, Urdu Nastaleeq OCR (Optical Character Recognition), Proceedings of World Academy of Science, Engineering and Technology, Volume 2, ISSN:1307-6884, December 2007. [2] A laymans Urdu Alphabet , Wikipedia.com. Feb,13,2009, available: http://en.wikipedia.org/wiki/Urdu_alphabet.[ Accessed: Mar. 3, 2009] [3] Amin, A. Arabic Character Recognition, Handbook of Character Recognition and Document Image Analysis, World Scientific Publishing Company, 1997, pp. 398. [4] Towards Neural Network Recognition Of Handwritten Arabic Letters By Tim Klassen thesis for MASTER OF COMPUTER SCIENCE (M.C.Sc.) 2001 [5] A laymans Connectors and non-connectors . available: http://www.columbia.edu/itc/mealac /pritchett/00urdu/urduscript/section00.html?urdu#00_02 . [ Accessed: Apr. 12, 2008] [6] A laymans Devangari and Urdu Alphabets. Nov,25,2008 available: http://freenethomepage.de/prilop/urdu-alphabet.html.[ Accessed: Mar. 3, 2009] [7] Shai Avidan and Ariel Shamir, seamcarving for content-aware image resizing.seamcarving.com available: www.seamcarving.com.[ Accessed: Mar. 3, 2009] [8] Ahmed M. Zeki and Mohamad S. Zakaria ,Challenges in Recognizing Arabic Character,International Islamic University Malaysia (IIUM), Kuala Lumpur, Malaysia, National University of Malaysia (UKM), Bangi, Selangor, Malaysia. [9] A. Amin, Off-line Arabic Character Recognition - the State of the Art, Pattern Recognition,Vol. 31, No. 5, 517-530, 1998. [10] F. Al-Fakhri, On-Line Computer Recognition of HandWritten Arabic Text, Masters Thesis, Science University of Malaysia, 1997.
[11] A. Zeki, Plausable inference Approach to Character Recognition, Masters Thesis, National University of Malaysia, 1999. [12] A. Amin, H. Al-Sadoun and S. Fischer, Hand-Printed Arabic Character Recognition System using An Arificial Network Pattern Recognition, Vol. 29, No. 4, pp. 663-675, 1996. [13] T. Kanungo, G. Marton and O. Bulbul, Performance Evaluation of Two Arabic Products, in Proceeding of AIPR Workshop on Advances in Computer Assisted Recognition, SPIE, Vol.3584, Washington DC, 1998. [14] T. Kanungo, G. Marton and O. Bulbul, OmniPage vs. Sakhr: Paired Model Evaluation of Two Arabic OCR Products, in Proceeding of SPIE Conference on Document Recognition and Retrieval (VI), Vol. 3651, San Jose, 1999. [15] A. Amin, Off line Arabic Character Recognition - A Survey, in Proceeding of the 4thInternational Conference Document Analysis and Recognition (ICDAR '97), pp. 596-599, 1997. [16] K. Jumari and M. Ali, A Survey and Comparative Evaluation of Selected off-line Arabic handwritten Character Recognition Systems, Jurnal Teknology, Malaysian University of Technology, 2001. [17] Inam Shamsheer, Zaheer Ahmad, OCR For Printed Urdu Script Using Feed Forward Neural Network, MLPR 2007: International Conference on Machine Learning and Pattern Recognition, Germany, 2007 [18] Hyder, S.S., "A System for Generating Urdu/Farsi/ Arabic Script", Information Processing 71, North Holland Publishing Co. Amsterdam, pp. 1145-1149, 1972. [19] Hyder, S.S., Richer, F., "The Theory and Design of a System for Printing and Communicating in ArabicUrdu-Farsi", 3ournal of Bio-Sciences Communications, Vol. 3, pp. 181-206, 1977. [20] Larry Chang & I. Scott MacKenzie. A Comparison of Two Handwriting Recognizers for Pen-based Computers1994. available: http://www.yorku.ca/mack/CASCON94.html. .[ Accessed: Aug. 3, 2008] [21] H. Bunke and P. S. P. Wang. Handbook of Character Recognition and Document Image Analysis. World Scientific Publishing, Singapore, 1997. [22] S. Mori, H. Nishida, and H. Yamada. Optical Character Recognition, Wiley Interscience, New Jersey, 1999. [23] Optical Character Recognition and the Years Ahead. The Business Press, Elmhurst, IL,1969. [24] Pas dauteur. Auerbach on Optical Character Recognition. Auerbach Publishers, Inc.,Princeton, 1971. [25] S. V. Rice, G. Nagy, and T. A. Nartker. Optical Character Recognition: An Illustrated Guide to the Frontier. Kluwer Academic Publishers, Boston, 1999. [26] H. F. Schantz. The History of OCR. Recognition Technologies Users Association, Boston,1982.

Urdu OCR Compound Character Recognition Using Feed Forward Neural Networks by Zaheer Ahmad Peshawar Date 124-05-09

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Urdu OCR Compound Character Recognition Using Feed Forward Neural Networks by Zaheer Ahmad Peshawar Date 124-05-09

Uploaded by

Copyright:

Available Formats

Urdu Compound Character Recognition Using Feed Forward Neural Networks

Zaheer Ahmad Inam Shamsher, Jehanzeb Khan Orakzai

KeywordsUrdu Script, OCR,Neural Networks, Arabic.

Fig-1. Character Set (58 alphabets) of Urdu Script.

CR is a field of research in pattern recognition, artificial

TABLE I DIFFERENT FORMS OF URDU CHARACTERS

alif alif madd

# 2 2h 3 3h 4 4h 5 5h 6 7 7h 8 8h 9 10 11 11h 12 12h 13 14 14h 15 15h 16 17 18 19 20 21 22 23 24 25 26 27 28 28h

# 29 29h 30 30h 31 31h 32 32h 32a 32ah 33 33h 34 34b 35 53 35b

Fig.2. Character Segmentation and Recognition

a. NEURAL NETWORK ARCHITECTURE AND TRAINING

Fig.3. Line of Urdu text (above) Segmented character (below)

Fig.3. Training set of single and two classes

Fig.5. Lam or Alif of Islam

You might also like