Keywords

1 Introduction

Brand logos are graphic entities that represent organizations, goods, etc. Logos are mainly designed for decorative and identification purposes. A specific logo can have several different representations, and some logos can be very similar in some aspects. Logo classification in natural scenes is a challenging problem since they often appear in various angles and sizes, making harder to extract keypoints especially due to significant variations in texture, poor illumination, and high intra-class variations (Fig. 1). The automatic classification of those logos gives to the marketing industry a powerful tool to evaluate the impact of brands. Marketing campaigns and medias can benefit with this tool, detecting unauthorized distributions of copyright materials.

Several techniques and approaches were proposed in the last decades for object classification, such as Bag of Visual Words (BoVW), Deep Convolutional Neural Networks (DCNN), feature matching with RANSAC, etc. The most successful approaches in logo classification were based on the BoVW model and DCNN. Most recent approaches in logo recognition use a Region-Based Convolution Neural Network (RCNN) [15], this network introduces a selective search to find candidates. This approach has shown great results comparing to others proposed in previous researches. Even though the good results achieved by RCNN, it is really hard to train and test due to the characteristics of selective search that generates several potential bounding boxes categorized by a classifier. After classification, post-processing is used to refine the bounding boxes, eliminating duplicate detection, and re-scoring the boxes.

In this paper, we propose an approach for logo detection based on a deep learning model. Our results outperform state-of-the-art approach results, achieving a higher accuracy on the FlickrLogos-32 dataset. Our proposal uses transfer learning to improve the logo image representations, being not only more accurate but also faster in logo detection, processing 19 images per second using a NVidia Titan X card.

Fig. 1.
figure 1

This figure exemplifies the challenges of classifying logos in natural scenes, such as high intra-class variation, warping, occlusion, rotation, translation and scales.

2 Related Works

Different approaches for logo recognition have been proposed through the last years. A few years ago, only shallow classifiers have been proposed to solve this challenge [3, 18, 19]. But with the increasing popularity of deep learning frameworks and because of its success in image recognition, some researches started to come up [2, 8].

The first successful approach in logo recognition was based on contours and shapes in images with a uniform background. Francesconi et al. [6] proposed an adaptive model using a recursive neural network, the authors used the area and the perimeter of the logo as features.

After 2007, with the popularization of SIFT, countless applications started to use it, due to its robustness to rotation and scale transformations and partially invariant to occlusions. Many approaches based their proposals on SIFT descriptors [3, 17,18,19,20].

RANSAC became a popular learning module for object recognition since its use in Lowe’s research work [13]. Lowe used this technique to compare matched descriptors and find outlines, eliminating the false positives matches and thus locating the object. In logo recognition, some researchers explored this method and achieved significant results, e.g. [3, 20].

Deep Convolutional Neural Networks are on the trends in computer vision and especially in object recognition. Recent approaches in logo recognition applied this technique, achieving impressive results like [5, 7, 8, 15].

3 Deep Convolutional Neural Networks

Artificial Neural Network is a classification machine learning model where the layers are composed of interconnected neurons with learned weights. These weights are learned by a training process. Convolutional Neural Network (CNN) is a type of feedforward artificial neural network and a variation of a multilayer perceptron. A neural network with three or more hidden layers is called deep network.

Transfer Learning. In a CNN each layer learns to “understand” specific features from the image. The first layers usually learn generic features like edges and gradients, the more we keep forwarding in the layers, the more specific the features the layer detects. In order to “understand” these features, it is necessary to train the network, adjusting the net weights according to a predefined loss function. If the network weights initiate with random values, it requires much more images and training iterations compared to using pretrained weights. The use of net weights trained with other dataset is called “fine-tuning” and it demonstrates to be extremely advantageous compared to training a network from scratch [4]. This technique is useful when the number of training images per class is scarce (e.g. 40 images for this problem), which makes it hard for the CNN to learn. Furthermore, transfer learning also speeds up the training convergence [16].

Data Augmentation. Training a DCNN requires lots of data, especially very large/deep networks. When the dataset does not provide enough training images, we can add more images using data augmentation process. This process consists of creating new synthetic images, that simulate different view angles, distortions, occlusions, lighting changes, etc. This technique usually increases the robustness of the network resulting in better results.

3.1 Single Shot MultiBox Detector

Single Shot MultiBox Detector (SSD [11, 12]) makes predictions based on feature maps taken at different stages, then it divides each one into a pre-established set of bounding boxes with different aspect ratios and scales. The bounding boxes adjust itself to better match the target object. The network generates scores using a regression technique for estimate the presence of each object category in each bounding box. The SSD increases its robustness to scale variations by concatenating feature maps from different resolutions into a final feature map. This network generates scores for each object category in each bounding box and produces adjustments to the bounding box that better match the object shape. At the end, a non-maximum suppression is applied to reduce redundant detections (Fig. 2).

Fig. 2.
figure 2

(a) The final detection produced by the SSD. (b) A feature map with \(8 \times 8\) grid. (c) A feature map with \(4 \times 4\) grid and the output of each box, the location and scores for each class. Image extracted from [11].

SSD Variants. The SSD approach uses a base network to extract features from images and use them in detection layers. The extra layers in the SSD are responsible for detecting the object. There are some differences between SSD 300/500 and SSD 512. The SSD 512 is an upgrade of SSD 500, the improvements are presented as follows:

  1. 1.

    The pooling layer (pool6) between fully connected layers (fc6 and fc7) was removed;

  2. 2.

    The authors added convolutional layers as extra layers;

  3. 3.

    A new color distortion data augmentation, used for improving the quality of the image, is also added;

  4. 4.

    The network populates the dataset by getting smaller training examples from expanded images;

  5. 5.

    Better proposed bounding boxes by extrapolating the image’s boundary.

4 Our Approaches and Contributions

Logos detection can be considered a sub problem of object detection since they usually are objects with a planar surface. Our approaches are based on the SSD framework since it performs very well in object detection. We explore the performance of SDD model on logo images domain. We analyze the impact of using pretrained weights, rather than training from scratch, with the technique called transfer learning. We compare different implementations of the SSD and we also explore the impact of warping image transformations to meet the shape requirements of the SSD input layer.

4.1 Transfer Learning Methodology

To use the transfer learning technique was necessary to redesign the DCNN. This re-design remaps the last layer, adapting the class labels between two different datasets. Therefore, all convolution and pooling layers are kept the same, and the last fully-connected layers (responsible for classification) are reorganized for the new dataset. For logos detection, the fine-tuning was made over a pretrained network, trained for 160.000 iterations on PASCAL VOC2007 + VOC2012 + COCO datasets  [12].

4.2 Our Proposal Approaches

We explored 5 different approaches, Table 1 shows all different setups. All networks were trained for 100.000 iterations using the Nesterov Optimizer [14] with a fixed learning rate of 0.001. The SSD 300 and SSD 500 were only explored using pretrained weights because they were easily surpassed by the SSD 512. The approach SSD 500 AR was an attempt to reduce the warp transformation of the input image since in the training and testing phase, the SSD needs to fit the input image into a square resolution.

Table 1. Our five proposal approaches

5 Experiments

We evaluate and analyze our approaches on FlickrLogos-32 dataset [19]. Our experiments ran on the Caffe deep learning framework [9] and using \(2\times \) Nvidia Tesla K80. We first describe the dataset, then we compare the performance of our approaches and finally we compare our results to state-of-the-art methods in logo recognition.

5.1 DataSet

FlickrLogos-32 (FL32) is a challenging dataset and the most promising approaches in logo recognition experimented their proposals on it. This dataset was proposed by Romberg [19], many approaches evaluated their performances on this dataset [3, 5, 8, 18]. Romberg also defined an experimental protocol, splitting the dataset into training, validation and testing sets. In all approaches, we strictly follow this protocol. Table 2 shows the distribution between, train, validation and test sets. We have used P1 + P2 (except no-logos) for training and P3 for testing.

Table 2. Evaluation protocol table
Fig. 3.
figure 3

The left image shows the F-score of out 5 different approaches. The right image shows the F-score, Precision and Recall of the best approach (AR - Preserving aspect ratio, FS - From scratch and PT - Fine-tuning of a pretrained model).

5.2 Comparison of Our Approaches

All the 5 different approaches are represented in the left chart of Fig. 3, while the right figure shows the metrics for our best approach, the SSD pretrained. Analyzing the figure we can see that the approaches SSD 300, SSD 500 and SSD 500 AR achieved poor results if compared to the SSD 512. We see that in all cases using pretrained weights resulted in better performance. Analyzing only the best result, SSD 512 PT, we see that we achieve our best F-score with a threshold of 90.

5.3 Comparison Against Other Researches

The comparison among other researches and our best result (the SSD 512 with pretrained weights) can be seen in the Table 3. Analyzing the results we can see that our method outperforms by 2.5% the F-score and by 7.4% the recall of the state-of-the-art. The high recall achieved is due to the fact that the SSD uses some of its extra layers to estimate the object location and also it can well generalize the object. The approach proposed by Li et al. [10] achieved such high precision due to the process of feature matching that eliminates false positive matches.

Table 3. Comparison of out best approach with the methods in the state-of-the-art (HCF - Hand Crafted Feature, DL - Deep Learning)

6 Conclusion

In this work, we investigated the use of DCNN, transfer learning and data augmentation on logo recognition system. The combination among them has shown that DCNN is very suitable for this task, even with relatively small train set it provides greater recall and f-score. A relevant contribution of this paper is the use of data augmentation combined with transfer learning to surpass the lower data issue and allow to use deeper networks. These techniques improve the performance of DCNN in this scenario. The results of our approach reinforce the robustness of DCNN approach, which surpasses the F1-score literature results.