Recognition of Bangla compound characters using structural decomposition
Introduction
Optical character recognition (OCR) is the process of automatic computer recognition of optically scanned and digitized character images. Several OCR systems are available commercially in the market [1], [2]. OCR is a necessary step for tasks like converting books, documents, and office records into electronic form [3] which allows the widely available text processing tools to be used for retrieval and dissemination of information [4]. The electronic text takes up less storage space compared to the image, can be edited, searched [5], [6], [9] and formatted for better display and printing. It can be machine translated [7] and converted to speech [8].
OCR systems are available for Roman or English script [10] and for a few Asian scripts, such as Chinese [11], [12], Japanese [13], Korean [14], [15], and Arabic [16], [17]. In the last two decades, several OCR works have been reported on different Indian scripts, such as Bangla [18], Devanagari [19], Tamil [20], Malayalam [21], Oriya [22], Telugu [23], Kannada [24], Gurmukhi [25], Gujarati [26]. These works mainly deal with recognizing basic characters. However, the main challenge in designing an OCR for Indian scripts is to handle compound (also known as ‘conjunct’) characters which are formed by combining two or more basic characters. The complex shapes of these characters make the problem more difficult.
In this paper, we address the problem of compound character recognition in Bangla which is the second-most popular language in India and among the top ten languages all over the world [27]. Bangla script is used to write Assamese and Bengali (also called ‘Bangla’) languages. There are a large number of (near about 250) compound characters in Bangla. However, some of these characters are obsolete now-a-days. Hence, in our work, we have considered the most familiar character set (about 165 in number [28]) used in the Bangla literature. Many of them are very complex in shape compared to the Devanagari compound characters [29], [30]. Prior work on Bangla OCR includes [18], [31], [32], [33], [34] for printed basic characters and [35], [36], [37], [38], [39], [40] for handwritten basic characters. But the evidences of work on recognizing Bangla compound characters, as observed in the literature [18], [36], [41], [42], [34], are few.
The research on Bangla compound character recognition can be categorized into two sets of methods, developed for printed and handwritten characters. Chaudhuri and Pal [18] have proposed a template matching approach for printed Bangla compound character recognition. In this method, text digitization, noise removal, skew detection, and correction are done as part of preprocessing. The text documents are segmented into lines, words, and characters using horizontal–vertical projection profile analysis and head line removal techniques. They have used eight stroke-based features for representing a character and a filled-circle feature for representing a dot.
Garain and Chaudhuri [41] have proposed a template matching technique for recognizing Bangla printed compound characters. Run number vectors are computed using horizontal and vertical scans organized with respect to the centroid of the pattern. The vector is normalized and abbreviated so as to make it invariant to scaling, insensitive to character style variation and more effective for complex shaped characters. Matching is performed within a group of compound characters.
Sural and Das [34] have used the concept of fuzzy sets for recognizing printed Bangla compound characters. Hough transform is used to extract lines and circles. Attributes such as length, position and orientation are used to define a number of fuzzy sets. A three stage multi-layer perceptron (MLP) classifier, trained with a number of linguistic set memberships derived from the intersections on the basic fuzzy sets, can recognize the characters by their similarities to the different fuzzy pattern classes.
Pal et al. [42] have proposed an off-line Bangla handwritten compound character recognition method using modified quadratic discriminative function (MQDF) classifier. The features used are mainly based on the directional information obtained from the arc tangent of the gradient.
Das et al. [36] have recognized Bangla handwritten basic and compound characters using two different classifiers: multi-layer perceptron (MLP) and support vector machine (SVM). Features used are based on shadow, longest run, and quad-tree. The MLP classifier is used for recognizing different groups of characters. A confusion matrix is prepared for the recognition results of the MLP classifier. Classes having a high degree of mutual misclassification are further handled using an SVM classifier, which gives a better accuracy.
Proper recognition of compound characters for Bangla script is a difficult problem because of the complex structural characteristics of these characters. We highlight some typical characteristics of Bangla compound characters which render the problem quite difficult and challenging.
- 1.
Certain compound characters are very similar in shape and are referred to as confusing characters. Fig. 1 shows a representative set of pairs of confusing characters.
- 2.
Few compound characters, such as ( + + ), ( + + ), ( + + ), have very complex shapes. It is seen that when a compound character is formed by three basic characters its shape becomes very complex.
- 3.
The shapes for handwritten versions of certain compound characters are quite different from their printed styles.
To address the aforementioned challenges we propose a novel character recognition method for Bangla compound characters, using topological features, extracted by analyzing the structural convexities of the script aksharas. We handle the complex shape of a compound character by decomposing it into convex shape primitives. The topological characteristics of the character are represented in the form of the layout of the shape primitives. We recognize the compound character by matching the extracted topological feature with the standard feature templates that we formulate for the compound characters. A unique aspect of this work is the formulation of character decomposition rules for getting simpler shapes within the character skeleton.
The rest of this paper is organized as follows. Section 2 describes the module for detecting compound characters in a dataset containing both basic and compound characters. The decomposition of compound characters into shape components is explained in Section 3. The skeletal segments are decomposed into strokes and represented as shape primitives using the method given in Section 4. In Section 5 we discuss the formulation of topological features and the similarity measure for feature templates for recognizing compound characters. Experimental results and related discussion are reported in Section 6. The concluding notes are given in Section 7.
Section snippets
Detection of compound characters
Compound characters can be detected and recognized by certain typical structural characteristics which distinguish them from the basic characters. The printing style and font information do not contribute to the character topology. Hence we prefer to work with the most simplistic representation of the character topology – its skeleton. For our purpose of detecting and recognizing compound characters it suffices to have a topological representation which is able to distinguish between even the
Skeletal decomposition of compound characters
In this section we explain our methodology to decompose the polygonized skeleton of a compound character. We look for simpler skeletal segments which tend to form cohesive or meaningful units. We present the decomposition rules for breaking a compound character into simpler structures. Character recognition using decomposition into simpler primitives has been used in the past. Hu et al. [50] have used singular points such as terminal, intersection, bend and directional points to decompose a
Identifying convex shapes in a skeletal segment
The skeletal segments obtained till now may have junction points and branches. This section describes the further steps applied on the skeletal segments so as to identify the convex shapes. In Sub section 4.1 we describe how to trace paths (strokes) in a skeletal segment. A stroke is a sequence of vertices which does not exhibit branching. Identifying convex segments from a stroke is discussed in Sub section 4.2. Larger convex segments are further broken down to obtain smaller convex segments.
Recognition using topological features
Each convex segment of a character is labeled with (or mapped to) its best matching shape primitive. Our repertoire of shape primitives comprises nine shapes as shown in Fig. 10, which allow us to have a good representation of the convex shapes. The procedure to identify the matching shape primitive for each convex segment is discussed next (Fig. 11).
Consider a convex segment with k vertex points, namely . The end points and together form the opening of the convex
Experimental results and discussion
We have implemented the Bangla compound character detection and recognition system in C programming language on Fedora 10 running on Intel Core2 Duo 2.20 GHz, 1 GB RAM. We collected printed characters from several heterogeneous printed documents. The handwritten documents were collected from individuals of different age and profession. Samples were collected on a normal writing paper using standard ball-point pens, gel-pens and ink-pens. We avoided pens which produce thick strokes like the
Conclusion
In this paper we have proposed novel topological features for recognizing Bangla compound characters. We have formulated decomposition rules to break a compound character into simpler shape components. The decomposition improves the efficacy of the features used and yields a better recognition performance. Our recognition scheme does not require any training with real examples. This is an advantage because many Bangla compound characters are used rarely and finding a sufficient number of
Conflict of interest
None declared.
Soumen Bag received the B.E. and the M.Tech. degree in Computer Science and Engineering from National Institute of Technology (NIT) Durgapur, India, in 2003 and 2008 respectively. From January 2004 to June 2006, he worked as a lecturer in the Department of Computer Science and Engineering in BCET Durgapur, India. He received his Ph.D. from Indian Institute of Technology (IIT) Kharagpur in 2013. Since August 2012, he has been working as an Assistant Professor in International Institute of
References (53)
Forty years of research in character and document recognition–an industrial perspective
Pattern Recognition
(2008)Character recognition—a review
Pattern Recognition
(1990)- et al.
Word level multi-script identification
Pattern Recognition Letters
(2008) - et al.
Handwritten recognition researchtwenty years of achievement… and beyond
Pattern Recognition
(2009) - et al.
Off-line recognition of realistic Chinese handwriting using segmentation-free strategy
Pattern Recognition
(2009) - et al.
Recognition of off-line handwritten Korean characters
Pattern Recognition
(1996) - et al.
Recognition of on-line cursive Korean characters combining statistical and structural methods
Pattern Recognition
(1997) - et al.
A complete printed Bangla OCR system
Pattern Recognition
(1998) - et al.
Indian script character recognitiona survey
Pattern Recognition
(2004) - et al.
Bengali alpha-numeric character recognition using curvature features
Pattern Recognition
(1993)
An MLP using Hough transform based fuzzy feature extraction for Bengali script recognition
Pattern Recognition Letters
A hierarchical approach to recognition of handwritten Bangla characters
Pattern Recognition
An improved contour-based thinning method for character images
Pattern Recognition Letters
HYBREDan OCR document representation for classification tasks
International Journal of Computer Science Issues
Off-line handwritten Chinese character recognition as a compound Bays decision problem
IEEE Transactions on Pattern Analysis and Machine Intelligence
Off-line Arabic character recognition-a review
Pattern Analysis and Applications
Offline recognition of Devanagari scripta survey
IEEE Transactions on Systems, Man, and Cybernetics—Part CApplications and Reviews
A comparative study of optical character recognition for Tamil script
European Journal of Scientific Research
Cited by (37)
Offline recognition of handwritten Indic scripts: A state-of-the-art survey and future perspectives
2020, Computer Science ReviewCitation Excerpt :In this work, an accuracy of 95.19% has been achieved for 36,127 handwritten characters using MIL classifier with curvature features. Bag et al. [57] used a Template Matching (TM) approach for the recognition of handwritten Bengali compound characters. In this approach, a standard feature template has been constructed for each compound character.
Reduction of features to identify characters from degraded historical manuscripts
2018, Alexandria Engineering JournalCitation Excerpt :As of today there are 33 languages and 2000 dialects, of which 22 are recognized under the constitution. The popular South Indian recognized languages are Telugu, Tamil, Kannada, Malayalam, Tulu, etc., [1]. Telugu script which is an offshoot of Brahmi script has complex structural characteristics and makes character recognition a difficult task.
Shape decomposition-based handwritten compound character recognition for Bangla OCR
2018, Journal of Visual Communication and Image RepresentationCitation Excerpt :A group of different feature set such as shadow, octant centroid, quadtree-based longest run, and different topological attributes are used to form the overall feature set for the recognition purpose. Bag et al. [23] have proposed a method that decomposes the compound characters into skeletal segments for the improvement of recognition accuracy. In this method, convex shape primitives are extracted to form the structural feature set and template matching scheme is used to recognize the handwritten Bangla compound characters.
Modeling of palm leaf character recognition system using transform based techniques
2016, Pattern Recognition LettersCitation Excerpt :One of the best preserved oldest existing documents is recognized to be recorded in the second century A.D. Palm leaves were used as writing material to record art, medicine, astronomy, etc., and were preserved and passed through generations [9,11–15]. Telugu script which is an offshoot of Brahmi script has complex structural characteristics, which are difficult for character recognition [16]. It has 16 vowels and 36 consonants [14].
A systematic review on handwritten document analysis and recognition
2024, Multimedia Tools and ApplicationsA deep learning based approach for extracting Arabic handwriting: applied calligraphy and old cursive
2023, PeerJ Computer Science
Soumen Bag received the B.E. and the M.Tech. degree in Computer Science and Engineering from National Institute of Technology (NIT) Durgapur, India, in 2003 and 2008 respectively. From January 2004 to June 2006, he worked as a lecturer in the Department of Computer Science and Engineering in BCET Durgapur, India. He received his Ph.D. from Indian Institute of Technology (IIT) Kharagpur in 2013. Since August 2012, he has been working as an Assistant Professor in International Institute of Information Technology (IIIT), Bhubaneswar, India. He is the recipient of Institute Gold medal for First Class for his Master's degree. His research interests are in the areas of OCR for Indian Scripts, Document Image Analysis, Image Processing, and Pattern Recognition.
Gaurav Harit received his Ph.D. from Indian Institute of Technology Delhi, in 2007. He worked as an Assistant Professor in IIT Kharagpur from 2008 to 2010. Currently he is an Assistant Professor in IIT Jodhpur since July 2010. His areas of interest include Document Image Analysis, Image Analysis, and Computer Vision.
Partha Bhowmick did his B.Tech. from IIT Kharagpur and received his masters and Ph.D. from ISI Kolkata. Presently he is an Associate Professor in CSE Department, IIT Kharagpur. His primary research interests are in digital geometry, computer graphics, low-level image processing, approximate pattern matching, shape analysis, document image analysis, GIS, and biometrics.