1 Introduction

Structural pattern recognition is based on sophisticated data structures for pattern representation such as strings, trees, or graphsFootnote 1. Graphs are, in contrast with feature vectors, flexible enough to adapt their size to the complexity of individual patterns. Furthermore, graphs are capable to represent structural relationships that might exist between subparts of the underlying pattern (by means of edges). These two benefits turn graphs into a powerful and flexible representation formalism, which is actually used in diverse fields [1, 2].

The computation of a dissimilarity between pairs of graphs, termed graph matching, is a basic requirement for pattern recognition. In the last four decades quite an arsenal of algorithms has been proposed for the task of graph matching [1, 2]. Moreover, also different benchmarking datasets for graph-based pattern recognition have been made available such as ARG [3], IAM [4], or ILPIso [5]. These dataset repositories consist of synthetically generated graphs as well as graphs that represent real world objects.

Recently, graphs have gained some attention in the field of handwritten document analysis [4] like for instance handwriting recognition [6], keyword spotting [79], or signature verification [10, 11]. However, we still observe a lack of publicly available graph datasets that are based on handwritten word images. The present paper tries to close this gap and presents a twofold contribution. First, we introduce six novel graph extraction algorithms applicable to handwritten word images. Second, we provide a benchmark database for word classification that is based on the George Washington letters [12, 13].

The remainder of this paper is organised as follows. In Sect. 2, the proposed graph representation formalisms are introduced. In Sect. 3, an experimental evaluation of the novel graph representation formalisms is given on the George Washington dataset. Section 4 concludes the paper and outlines possible further research activities.

2 Graph-Based Representation of Word Images

A graph g is formally defined as a four-tuple \(g=(V,E,\mu ,\nu )\) where V and E are finite sets of nodes and edges, and \(\mu :V \rightarrow L_V\) as well as \(\nu :E \rightarrow L_E\) are labelling functions for nodes and edges, respectively. Graphs can either be undirected or directed, depending on whether pairs of nodes are connected by undirected or directed edges. Additionally, graphs are often divided into unlabelled and labelled graphs. In the former case we assume empty label alphabets (i.e. \({L_v=L_e=\{\}}\)), and in the latter case, nodes and/or edges can be labelled with an arbitrary numerical, vectorial, or symbolic label.

Different processing steps are necessary for the extraction of graphs from word images. In the framework presented in this paper the document images are first preprocessed by means of Difference of Gaussian (DoG)-filtering and binarisation to reduce the influence of noise [14]. On the basis of these preprocessed document images, single word images are automatically segmented from the document and labelled with a ground truthFootnote 2. Next, word images are skeletonised by a \(3 \times 3\) thinning operator [15]. We denote segmented word images that are binarised and filtered by B. If the image is additionally skeletonised we use the term S.

Graph-based word representations aim at extracting the inherent characteristic of these preprocessed word images. Figure 1 presents an overview of the six different graph representations, which are thoroughly described in the next three subsections. All of the proposed extraction methods result in graphs where the nodes are labelled with two-dimensional attributes, i.e. \(L_v=\mathbb {R}^2\), while edges remain unlabelled, i.e. \(L_e=\{\}\).

In any of the six cases, graphs are normalised in order to reduce the variation in the node labels \((x,y) \in \mathbb {R}^2\) that is due to different word image sizes. Formally, we apply the following transformation to the coordinate pairs (xy) that occur on all nodes of the current graph.

$$\begin{aligned} \hat{x} = \frac{x - \mu _x}{\sigma _x} \text { and } \hat{y} = \frac{y - \mu _y}{\sigma _y} \end{aligned}$$

where \(\mu _x\), \(\mu _y\) and \(\sigma _x\), \(\sigma _y\) denote the mean values and the standard deviations of all node labels in the current graph (in x- and y-direction, respectively).

Fig. 1.
figure 1

Different graph representations of the word “Letters”

2.1 Graph Extraction Based on Keypoints

The first graph extraction algorithm is based on the detection of specific keypoints in the word images. Keypoints are characteristic points in a word image, such as for instance end- and intersection-points of strokes. The proposed approach is inspired by [16] and is actually used for keyword spotting in [17]. In the following, Algorithm 1 and its description are taken from [17].

Graphs are created on the basis of filtered, binarised, and skeletonised word images S (see Algorithm 1 denoted by Keypoint from now on). First, end points and junction points are identified for each Connected Component (CC) of the skeleton image (see line 2 of Algorithm 1). For circular structures, such as for instance the letter ‘O’, the upper left point is selected as junction point. Note that the skeletons based on [15] may contain several neighbouring end- or junction points. We apply a local search procedure to select only one point at each ending and junction (this step is not explicitly formalised in Algorithm 1). Both end points and junction points are added to the graph as nodes, labelled with their image coordinates (xy) (see line 3).

Next, junction points are removed from the skeleton, dividing it into Connected Subcomponents (\(CC_{sub}\)) (see line 4). Afterwards, for each connected subcomponent intermediate points \((x,y) \in CC_{sub}\) are converted to nodes and added to the graph in equidistant intervals of size D (see line 5 and 6).

Finally, an undirected edge (uv) between \(u \in V\) and \(v \in V\) is inserted into the graph for each pair of nodes that is directly connected by a chain of foreground pixels in the skeleton image S (see line 7 and 8).

figure a

2.2 Graph Extraction Based on a Segmentation Grid

The second graph extraction algorithm is based on a grid-wise segmentation of word images. Grids have been used to describe features of word images like Local Gradient Histogram (LGH) [18] or Histogram of Oriented Gradients (HOG) [19]. However, to the best of our knowledge grids have not been used to represent word images by graphs.

Graphs are created on the basis of binarised and filtered, yet not skeletonised, word images B (see Algorithm 2). First, the dimension of the segmentation grid, basically defined by the number of columns C and rows R, is derived (see line 2 and 3 of Algorithm 2). Formally, we compute

$$\begin{aligned} C = \frac{\text {Width of } B}{w} \text { and } R = \frac{\text {Height of } B}{h}, \end{aligned}$$

where w and h denote the user defined width and height of the resulting segments.

Next, a word image B is divided into \(C \times R\) segments of equal size. For each segment \(s_{ij}\) \((i=1,\ldots ,C;j=1,\ldots ,R)\) a node is inserted into the resulting graph and labelled by the (xy)-coordinates of the centre of mass \((x_m,y_m)\) (see line 4). Formally, we compute

$$\begin{aligned} x_m = \frac{1}{n} \sum \limits _{w=1}^n x_w \text { and } y_m = \frac{1}{n} \sum \limits _{w=1}^n y_w, \end{aligned}$$
(1)

where n denotes the number of foreground pixel in segment \(s_{ij}\), while \(x_w\) and \(y_w\) denote the x- and y-coordinates of the foreground pixels in \(s_{ij}\). If a segment does not contain any foreground pixel, no centre of mass can be determined and thus no node is created for this segment.

Finally, undirected edges (uv) are inserted into the graph according to one out of three edge insertion algorithms, viz. Node Neighbourhood Analysis (NNA), Minimal Spanning Tree (MST), or Delaunay Triangulation (DEL). The first algorithm analyses the four neighbouring segments on top, left, right, and bottom of a node \(u \in V\). In case a neighbouring segment of u is also represented by a node \(v \in V\), an undirected edge (uv) between u and v is inserted into the graph. The second algorithm reduces the edges inserted by the Node Neighbourhood Analysis by means of a Minimal Spanning Tree algorithm. Hence, in this case the graphs are actually transformed into trees. Finally, the third algorithm is based on a Delaunay Triangulation of all nodes \(u \in V\). We denote this algorithmic procedure by Grid-NNA, Grid-MST, and Grid-DEL (depending on which edge insertion algorithm is employed).

figure b

2.3 Graph Extraction Based on Projection Profiles

The third graph extraction algorithm is based on an adaptive rather than a fixed segmentation of word images. That is, the individual word segment sizes are adapted to respect to projection profiles. Projection profiles have been used for skew correction [20] and feature vectors of word images [21], to name just two examples. However, to the best of our knowledge projection profiles have not been used to represent word images by graphs.

Graphs are created on the basis of binarised and filtered word images B (see Algorithm 3, denoted by Projection from now on). First, a histogram of the vertical projection profile \(P_v=\{p_{1},\ldots ,p_{max}\}\) is computed, where \(p_i\) represents the frequency of foreground pixels in column i of B and max is the width of B (see line 2 of Algorithm 3). Next, we split B vertically by searching so called white spaces, i.e. subsequences \(\{p_i,\ldots ,p_{i+k}\}\) with \(p_i=\ldots =p_{i+k}=0\). To this end, we split B in the middle of white spaces, i.e. position \(p=\lfloor (p_i+p_{i+k})/2\rfloor \), into n segments \(\{s_1,\ldots ,s_n\}\) (see line 3). In the best possible case a segment encloses word parts that semantically belong together (e.g. characters). Next, further segments are created in equidistant intervals \(D_v\) when the width of a segment \(s \in B\) is greater than \(D_v\) (see line 4 and 5).

The same procedure as described above is then applied to each (vertical) segment \(s \in B\) (rather than whole word image B) based on the projection profile of rows (rather than columns) (see lines 6 to 10). Thus, each segment s is individually divided into horizontal segments \(\{s_{1},\ldots ,s_{n}\}\) (Note that a user defined parameter \(D_h\) controls the number of additional segmentation points (similar to \(D_v\))). Subsequently, for each segment \(s \in B\) a node is inserted into the resulting graph and labelled by the (xy)-coordinates of the centre of mass \((x_m,y_m)\) (see (1) as well as line 11 and 12). If a segment consists of background pixels only, no centre of mass can be determined and thus no node is created for this segment.

Finally, an undirected edge (uv) between \(u \in V\) and \(v \in V\) is inserted into the graph for each pair of nodes, if the corresponding pair of segments is directly connected by a chain of foreground pixels in the skeletonised word image S (see line 13 and 14).

figure c

2.4 Graph Extraction Based on Splittings

The fourth graph extraction algorithm is based on an adaptive and iterative segmentation of word images by means of horizontal and vertical splittings. Similar to Projection, the segmentation is based on projection profiles of word images. Yet, their algorithmic procedures clearly distinguishes from each other. To the best of our knowledge such a split-based segmentation has not been used to represent word images by graphs.

Graphs are created on the basis of binarised and filtered word images B (see Algorithm 4, denoted by Split from now on). Thus, each segment \(s \in B\) (initially B is regarded as one segment) is iteratively split into smaller subsegments until the width and height of each segment in \(s \in B\) is below a certain threshold \(D_w\) and \(D_h\), respectively (see lines 2 to 12). Formally, each segment \(s \in S\) (with width greater than threshold \(D_w\)) is vertically subdivided into subsegments \(\{s_{1},\ldots ,s_{n}\}\) by means of the projection profile \(P_v\) of s (for further details we refer to Sect. 2.3). If the histogram \(P_v\) contains no white spaces, i.e. \(\forall h_i \in P \ne 0\), the segment s is split in its vertical centre into \(\{s_1,s_2\}\) (see lines 3 to 7). Next, the same procedure as described above is applied to each segment \(s \in B\) (with height greater than threshold \(D_h\)) in the horizontal, rather than vertical, direction (see lines 8 to 12).

Once no segment from \(s \in B\) can further be split, the centre of mass \((x_m,y_m)\) (see (1)) is computed for each segment \(s \in B\) and a node is inserted into the graph labelled by the (xy)-coordinates of the closest point on the skeletonised word image S to \((x_m,y_m)\) (see line 13 and 14). If a segment consists of background pixels only, no centre of mass can be determined and thus no node is created for this segment.

Finally, an undirected edge (uv) between \(u \in V\) and \(v \in V\) is inserted into the graph for each pair of nodes, if the corresponding pair of segments is directly connected by a chain of foreground pixels in the skeletonised word image S (see line 15 and 16).

figure d

3 Experimental Evaluation

The proposed graph extraction algorithms are evaluated on preprocessed word images of the George Washington (GW) dataset, which consists of twenty different multi-writer letters with only minor variations in the writing styleFootnote 3. The same documents have been used in [12, 13].

For our benchmark dataset a number of perfectly segmented word images is divided into three independent subsets, viz. a training set (90 words), a validation set (60 words), and a test set (143 words)Footnote 4. Each set contains instances of thirty different words. The validation and training set contain two and three instances per word, respectively, while the test contains at most five and at least three instances per word. For each word image, one graph is created by means of the six different graph extraction algorithms (using different parameterisations).

In Table 1 an overview of the validated meta-parameters for each graph extraction method is given. Roughly speaking, small meta-parameter values result in graphs with a higher number of nodes and edges, while large meta-parameter values result in graph with a smaller number of nodes and edges.

Table 1. Validated meta-parameters of each graph extraction algorithm

The quality of the different graph representation formalisms is evaluated by means of the accuracy of a kNN-classifierFootnote 5 that operates on approximated Graph Edit Distances (GED) [22]. The meta-parameters are optimised with respect to the accuracy of the kNN on the validation setFootnote 6. Then, the accuracy of the kNN-classifier is measured on the test set using the optimal meta-parameters for each graph extraction method.

In Table 2, the optimal meta-parameters, the median number of nodes \(\bar{|V|}\) and edges \(\bar{|E|}\) (defined over training, validation and test set) as well as the accuracy of the kNN on the test set are shown for each extraction method. We observe that the Projection and Split extraction methods clearly perform the best among all algorithms with a classification accuracy of about 82 % and 80 % on the test set, respectively. Keypoint achieves the third best classification result with about 77 %. However, the average number of nodes and edges of both Projection and Split are substantially lower than those of Keypoint. Grid-MST achieves an accuracy which is virtually the same as with Keypoint. Yet, with substantially less nodes and edges than Keypoint. The worst classification accuracies are obtained with Grid-NNA and Grid-DEL (about 65 % and 63 %, respectively). Note especially the large number of edges which are produced with the Delaunay triangulation.

Table 2. Classification accuracy of each graph representation formalism

4 Conclusion and Outlook

The novel graph database presented in this paper is based on graph extraction methods. These methods aim at extracting the inherent characteristics of handwritten word images and represent these characteristics by means of graphs. The Keypoint extraction method is based on the representation of nodes by characteristic points on the handwritten stroke, while edges represent strokes between these keypoints. Three of our extraction methods, viz. Grid-NNA, Grid-MST and Grid-DEL, are based on a grid-wise segmentation of a word image. Each segment of this grid is represented by a node which is then labelled by the centre of mass of the segment. Finally, the Projection and Split extraction methods are based on vertical and horizontal segmentations by means of projection profiles. An empirical evaluation of the six extraction algorithm is carried out on the George Washington letters. The achieved accuracy can be seen as a first benchmark to be used for future experiments. Moreover, the experimental results clearly indicate that both Projection and Split are well suited for extracting meaningful graphs from handwritten words.

In future work we plan to extend our graph database to further documents using the presented extraction methods and make them all publicly available. Thus, we will be able to provide a more comparative and thorough study against other state-of-the-art representation formalisms at a later stage.