Recognition of Bangla compound characters using structural decomposition

doi:10.1016/j.patcog.2013.08.026

Pattern Recognition

Volume 47, Issue 3, March 2014, Pages 1187-1201

https://doi.org/10.1016/j.patcog.2013.08.026 Get rights and content

Highlights

•
The proper recognition of compound characters is a difficult problem due to their complex shapes.
•
In this paper, we propose a novel character recognition method for Bangla compound characters.
•
Our strategy is to decompose the compound character into simpler shape components.
•
Our technique is applicable to printed and handwritten characters.
•
Experiment is done on printed and handwritten Bangla compound characters.

Abstract

In this paper we propose a novel character recognition method for Bangla compound characters. Accurate recognition of compound characters is a difficult problem due to their complex shapes. Our strategy is to decompose a compound character into skeletal segments. The compound character is then recognized by extracting the convex shape primitives and using a template matching scheme. The novelty of our approach lies in the formulation of appropriate rules of character decomposition for segmenting the character skeleton into stroke segments and then grouping them for extraction of meaningful shape components. Our technique is applicable to both printed and handwritten characters. The proposed method performs well for complex-shaped compound characters, which were confusing to the existing methods.

Introduction

Optical character recognition (OCR) is the process of automatic computer recognition of optically scanned and digitized character images. Several OCR systems are available commercially in the market [1], [2]. OCR is a necessary step for tasks like converting books, documents, and office records into electronic form [3] which allows the widely available text processing tools to be used for retrieval and dissemination of information [4]. The electronic text takes up less storage space compared to the image, can be edited, searched [5], [6], [9] and formatted for better display and printing. It can be machine translated [7] and converted to speech [8].

OCR systems are available for Roman or English script [10] and for a few Asian scripts, such as Chinese [11], [12], Japanese [13], Korean [14], [15], and Arabic [16], [17]. In the last two decades, several OCR works have been reported on different Indian scripts, such as Bangla [18], Devanagari [19], Tamil [20], Malayalam [21], Oriya [22], Telugu [23], Kannada [24], Gurmukhi [25], Gujarati [26]. These works mainly deal with recognizing basic characters. However, the main challenge in designing an OCR for Indian scripts is to handle compound (also known as ‘conjunct’) characters which are formed by combining two or more basic characters. The complex shapes of these characters make the problem more difficult.

In this paper, we address the problem of compound character recognition in Bangla which is the second-most popular language in India and among the top ten languages all over the world [27]. Bangla script is used to write Assamese and Bengali (also called ‘Bangla’) languages. There are a large number of (near about 250) compound characters in Bangla. However, some of these characters are obsolete now-a-days. Hence, in our work, we have considered the most familiar character set (about 165 in number [28]) used in the Bangla literature. Many of them are very complex in shape compared to the Devanagari compound characters [29], [30]. Prior work on Bangla OCR includes [18], [31], [32], [33], [34] for printed basic characters and [35], [36], [37], [38], [39], [40] for handwritten basic characters. But the evidences of work on recognizing Bangla compound characters, as observed in the literature [18], [36], [41], [42], [34], are few.

The research on Bangla compound character recognition can be categorized into two sets of methods, developed for printed and handwritten characters. Chaudhuri and Pal [18] have proposed a template matching approach for printed Bangla compound character recognition. In this method, text digitization, noise removal, skew detection, and correction are done as part of preprocessing. The text documents are segmented into lines, words, and characters using horizontal–vertical projection profile analysis and head line removal techniques. They have used eight stroke-based features for representing a character and a filled-circle feature for representing a dot.

Garain and Chaudhuri [41] have proposed a template matching technique for recognizing Bangla printed compound characters. Run number vectors are computed using horizontal and vertical scans organized with respect to the centroid of the pattern. The vector is normalized and abbreviated so as to make it invariant to scaling, insensitive to character style variation and more effective for complex shaped characters. Matching is performed within a group of compound characters.

Sural and Das [34] have used the concept of fuzzy sets for recognizing printed Bangla compound characters. Hough transform is used to extract lines and circles. Attributes such as length, position and orientation are used to define a number of fuzzy sets. A three stage multi-layer perceptron (MLP) classifier, trained with a number of linguistic set memberships derived from the intersections on the basic fuzzy sets, can recognize the characters by their similarities to the different fuzzy pattern classes.

Pal et al. [42] have proposed an off-line Bangla handwritten compound character recognition method using modified quadratic discriminative function (MQDF) classifier. The features used are mainly based on the directional information obtained from the arc tangent of the gradient.

Das et al. [36] have recognized Bangla handwritten basic and compound characters using two different classifiers: multi-layer perceptron (MLP) and support vector machine (SVM). Features used are based on shadow, longest run, and quad-tree. The MLP classifier is used for recognizing different groups of characters. A confusion matrix is prepared for the recognition results of the MLP classifier. Classes having a high degree of mutual misclassification are further handled using an SVM classifier, which gives a better accuracy.

Proper recognition of compound characters for Bangla script is a difficult problem because of the complex structural characteristics of these characters. We highlight some typical characteristics of Bangla compound characters which render the problem quite difficult and challenging.

1.
Certain compound characters are very similar in shape and are referred to as confusing characters. Fig. 1 shows a representative set of pairs of confusing characters.
2.
Few compound characters, such as
(
+
+
),
(
+
+
),
(
+
+
), have very complex shapes. It is seen that when a compound character is formed by three basic characters its shape becomes very complex.
3.
The shapes for handwritten versions of certain compound characters are quite different from their printed styles.

To address the aforementioned challenges we propose a novel character recognition method for Bangla compound characters, using topological features, extracted by analyzing the structural convexities of the script aksharas. We handle the complex shape of a compound character by decomposing it into convex shape primitives. The topological characteristics of the character are represented in the form of the layout of the shape primitives. We recognize the compound character by matching the extracted topological feature with the standard feature templates that we formulate for the compound characters. A unique aspect of this work is the formulation of character decomposition rules for getting simpler shapes within the character skeleton.

The rest of this paper is organized as follows. Section 2 describes the module for detecting compound characters in a dataset containing both basic and compound characters. The decomposition of compound characters into shape components is explained in Section 3. The skeletal segments are decomposed into strokes and represented as shape primitives using the method given in Section 4. In Section 5 we discuss the formulation of topological features and the similarity measure for feature templates for recognizing compound characters. Experimental results and related discussion are reported in Section 6. The concluding notes are given in Section 7.

Section snippets

Detection of compound characters

Compound characters can be detected and recognized by certain typical structural characteristics which distinguish them from the basic characters. The printing style and font information do not contribute to the character topology. Hence we prefer to work with the most simplistic representation of the character topology – its skeleton. For our purpose of detecting and recognizing compound characters it suffices to have a topological representation which is able to distinguish between even the

Skeletal decomposition of compound characters

In this section we explain our methodology to decompose the polygonized skeleton of a compound character. We look for simpler skeletal segments which tend to form cohesive or meaningful units. We present the decomposition rules for breaking a compound character into simpler structures. Character recognition using decomposition into simpler primitives has been used in the past. Hu et al. [50] have used singular points such as terminal, intersection, bend and directional points to decompose a

Identifying convex shapes in a skeletal segment

The skeletal segments obtained till now may have junction points and branches. This section describes the further steps applied on the skeletal segments so as to identify the convex shapes. In Sub section 4.1 we describe how to trace paths (strokes) in a skeletal segment. A stroke is a sequence of vertices which does not exhibit branching. Identifying convex segments from a stroke is discussed in Sub section 4.2. Larger convex segments are further broken down to obtain smaller convex segments.

Recognition using topological features

Each convex segment of a character is labeled with (or mapped to) its best matching shape primitive. Our repertoire of shape primitives comprises nine shapes as shown in Fig. 10, which allow us to have a good representation of the convex shapes. The procedure to identify the matching shape primitive for each convex segment is discussed next (Fig. 11).

Consider a convex segment with k vertex points, namely $p_{1}, p_{2}, \dots, p_{k}$ . The end points $p_{1} (x_{1}, y_{1})$ and $p_{k} (x_{k}, y_{k})$ together form the opening of the convex

Experimental results and discussion

We have implemented the Bangla compound character detection and recognition system in C programming language on Fedora 10 running on Intel Core2 Duo 2.20 GHz, 1 GB RAM. We collected printed characters from several heterogeneous printed documents. The handwritten documents were collected from individuals of different age and profession. Samples were collected on a normal writing paper using standard ball-point pens, gel-pens and ink-pens. We avoided pens which produce thick strokes like the

Conclusion

In this paper we have proposed novel topological features for recognizing Bangla compound characters. We have formulated decomposition rules to break a compound character into simpler shape components. The decomposition improves the efficacy of the features used and yields a better recognition performance. Our recognition scheme does not require any training with real examples. This is an advantage because many Bangla compound characters are used rarely and finding a sufficient number of

Conflict of interest

None declared.

References (53)

H. Fujisawa
Forty years of research in character and document recognition–an industrial perspective
Pattern Recognition
(2008)
V.K. Govindan
Character recognition—a review
Pattern Recognition
(1990)
P.B. Pati et al.
Word level multi-script identification
Pattern Recognition Letters
(2008)
M. Cheriet et al.
Handwritten recognition researchtwenty years of achievement… and beyond
Pattern Recognition
(2009)
T.H. Su et al.
Off-line recognition of realistic Chinese handwriting using segmentation-free strategy
Pattern Recognition
(2009)
H.J. Kim et al.
Recognition of off-line handwritten Korean characters
Pattern Recognition
(1996)
J.O. Kwon et al.
Recognition of on-line cursive Korean characters combining statistical and structural methods
Pattern Recognition
(1997)
B.B. Chaudhuri et al.
A complete printed Bangla OCR system
Pattern Recognition
(1998)
U. Pal et al.
Indian script character recognitiona survey
Pattern Recognition
(2004)
A. Dutta et al.
Bengali alpha-numeric character recognition using curvature features
Pattern Recognition
(1993)

S. Sural et al.

An MLP using Hough transform based fuzzy feature extraction for Bengali script recognition

Pattern Recognition Letters

(1999)

S. Basu et al.

A hierarchical approach to recognition of handwritten Bangla characters

Pattern Recognition

(2009)

S. Bag et al.

An improved contour-based thinning method for character images

Pattern Recognition Letters

(2011)

P. Sarkar, Document image analysis for digital libraries, in: International Workshop on Research Issues in Digital...

T. Kameshiro, T. Hirano, Y. Okada, F. Yoda, A document image retrieval method tolerating recognition and segmentation...

A. Kumar, C.V. Jawahar, R. Manmatha, Efficient search in document image collections, in: Asian Conference on Computer...

D. Genzel, A. C. Popat, N. Spasojevic, M. Jahr, A. Senior, E. Ie, F.Y. Tang, Translation-inspired OCR, in:...

A. Bahrampour, W. Barkhoda, B.Z. Azami, Implementation of three text to speech systems for Kurdish language, in:...

S. Laroum et al.

HYBREDan OCR document representation for classification tasks

International Journal of Computer Science Issues

(2011)

P.K. Wong et al.

Off-line handwritten Chinese character recognition as a compound Bays decision problem

IEEE Transactions on Pattern Analysis and Machine Intelligence

(1998)

F. Kimura, OCR technologies for machine printed and hand printed Japanese text, in: Digital Document Processing: Major...

A. Amin, Off line Arabic character recognition: a survey, in: International Conference on Document Analysis and...

M.S. Khorsheed

Off-line Arabic character recognition-a review

Pattern Analysis and Applications

(2002)

R. Jayadevan et al.

Offline recognition of Devanagari scripta survey

IEEE Transactions on Systems, Man, and Cybernetics—Part CApplications and Reviews

(2011)

R.J. Kannan

A comparative study of optical character recognition for Tamil script

European Journal of Scientific Research

(2009)

M.A. Rahiman, M.S. Rajasree, Printed Malayalam character recognition using back-propagation neural networks, in:...

Cited by (37)

Offline recognition of handwritten Indic scripts: A state-of-the-art survey and future perspectives
2020, Computer Science Review
Citation Excerpt :
In this work, an accuracy of 95.19% has been achieved for 36,127 handwritten characters using MIL classifier with curvature features. Bag et al. [57] used a Template Matching (TM) approach for the recognition of handwritten Bengali compound characters. In this approach, a standard feature template has been constructed for each compound character.
The handwritten script recognition is one of the most interesting and challenging areas of pattern recognition due to numerous variations in writing styles. Extensive in-depth research work is reported on the recognition of handwritten text in scripts such as Latin, Chinese, Arabic and Japanese. However, the work reported on handwritten Indic scripts is still in its infancy, so significant research is required in this field. This paper aims to describe various advancements reported over the last few decades in the field of handwritten Indic scripts recognition by analysing several existing state-of-the-art studies. This comprehensive survey presents a transparent panorama of various feature extraction and classification techniques for the offline recognition of handwritten Indic scripts. The most important part of this survey is to systematically present the reported works on handwritten Indic scripts like Devanagari, Bengali, Gurumukhi, Kannada, Telugu, Gujarati, Oriya, Tamil and Malayalam. After exploring the reported works, an analysis is performed based on the findings. Several issues and challenges related to the recognition of Indic scripts are discussed, which indicates some future research prospects. Based on the extensive study conducted in this article, it has been contemplated that there is a need to develop hybrid feature extraction and classification approaches for achieving the most accurate results. So, a novel framework based on improved particle swarm optimization (PSO) algorithm to automatically construct optimal convolutional neural network (CNN) architecture has been proposed with an aim to outperform the existing techniques.
Reduction of features to identify characters from degraded historical manuscripts
2018, Alexandria Engineering Journal
Citation Excerpt :
As of today there are 33 languages and 2000 dialects, of which 22 are recognized under the constitution. The popular South Indian recognized languages are Telugu, Tamil, Kannada, Malayalam, Tulu, etc., [1]. Telugu script which is an offshoot of Brahmi script has complex structural characteristics and makes character recognition a difficult task.
The historical writings were found on stones, palm leaves, cloth, etc. This paper deals with the identification of Telugu Palm leaf characters by acquiring a additional 3D feature on palm leaves. The background of these manuscripts is identical to the writings on them. Removing background from such scripts is a tedious task. This is achieved with the 3D feature depth in the current work. With the help of this 3D feature, an improved classification rate is also achieved.
Shape decomposition-based handwritten compound character recognition for Bangla OCR
2018, Journal of Visual Communication and Image Representation
Citation Excerpt :
A group of different feature set such as shadow, octant centroid, quadtree-based longest run, and different topological attributes are used to form the overall feature set for the recognition purpose. Bag et al. [23] have proposed a method that decomposes the compound characters into skeletal segments for the improvement of recognition accuracy. In this method, convex shape primitives are extracted to form the structural feature set and template matching scheme is used to recognize the handwritten Bangla compound characters.
Proper recognition of complex-shaped handwritten compound characters is still a big challenge for Bangla OCR systems. In this paper, we propose a novel shape decomposition-based segmentation technique to decompose the compound characters into prominent shape components. This shape decomposition reduces the classification complexity in terms of less number of classes to recognize, and at the same time improves the recognition accuracy. The decomposition is done at the segmentation area where the two basic shapes are joined to form a compound character. We use chain code histogram feature set with multi-layer perceptron (MLP) based classifier with backpropagation learning for classification. On experimentation, the proposed method is observed to provide good recognition accuracy comparing with other existing methods.
Modeling of palm leaf character recognition system using transform based techniques
2016, Pattern Recognition Letters
Citation Excerpt :
One of the best preserved oldest existing documents is recognized to be recorded in the second century A.D. Palm leaves were used as writing material to record art, medicine, astronomy, etc., and were preserved and passed through generations [9,11–15]. Telugu script which is an offshoot of Brahmi script has complex structural characteristics, which are difficult for character recognition [16]. It has 16 vowels and 36 consonants [14].
Optical character recognition (OCR) has been a well-known area of research for last five decades. This is an important application of pattern recognition in image processing. Automatic mail sorting generated interest in the handwritten character recognition (HCR) over a period of time. Palm leaf manuscripts which are very fragile and susceptible to damage caused by insects, contain huge amount of information relating to music, astrology, astronomy etc. Hence it becomes necessary for these manuscripts digitized and stored. These palm leaf manuscripts created interest for the young generation researchers since the last decade. This work exploits a special 3D feature (depth of indentation) which is proportional to the pressure applied by the scriber at that point. This 3D feature is obtained at each of the pixel point of a Telugu palm leaf character. In this work two dimensional Discrete wavelet transform (2-D DWT), two dimensional fast Fourier transform (2-D FFT) and two dimensional discrete cosine transform (2-D DCT) are used for feature extraction. The 3D feature along with the proposed two level transform based technique helps to obtain better recognition accuracy. The best recognition accuracy obtained in this model is 96.4%.
A systematic review on handwritten document analysis and recognition
2024, Multimedia Tools and Applications
A deep learning based approach for extracting Arabic handwriting: applied calligraphy and old cursive
2023, PeerJ Computer Science

View all citing articles on Scopus

Soumen Bag received the B.E. and the M.Tech. degree in Computer Science and Engineering from National Institute of Technology (NIT) Durgapur, India, in 2003 and 2008 respectively. From January 2004 to June 2006, he worked as a lecturer in the Department of Computer Science and Engineering in BCET Durgapur, India. He received his Ph.D. from Indian Institute of Technology (IIT) Kharagpur in 2013. Since August 2012, he has been working as an Assistant Professor in International Institute of Information Technology (IIIT), Bhubaneswar, India. He is the recipient of Institute Gold medal for First Class for his Master's degree. His research interests are in the areas of OCR for Indian Scripts, Document Image Analysis, Image Processing, and Pattern Recognition.

Gaurav Harit received his Ph.D. from Indian Institute of Technology Delhi, in 2007. He worked as an Assistant Professor in IIT Kharagpur from 2008 to 2010. Currently he is an Assistant Professor in IIT Jodhpur since July 2010. His areas of interest include Document Image Analysis, Image Analysis, and Computer Vision.

Partha Bhowmick did his B.Tech. from IIT Kharagpur and received his masters and Ph.D. from ISI Kolkata. Presently he is an Associate Professor in CSE Department, IIT Kharagpur. His primary research interests are in digital geometry, computer graphics, low-level image processing, approximate pattern matching, shape analysis, document image analysis, GIS, and biometrics.

View full text