1 Introduction

The problem of texture classification has been largely studied in computer vision and pattern recognition. Most of the works are focused on the definition of visual descriptors that are invariant, or at least robust, with respect to some variations in the acquisition conditions, such as rotations and scalings of the image, changes in brightness, contrast, and light color temperature [5, 19].

Visual descriptors are often divided into two categories: traditional hand-crafted and learned features. Traditional hand-crafted descriptors are features extracted using a manually predefined algorithm based on the expert knowledge. These features can be global and local. Global features describe an image as a whole in terms of color, texture and shape distributions [33]. Some notable examples of global hand-crafted features are color histograms [36], Gabor filters [32], Local Binary Patterns (LBP) [38], Histogram of Oriented Gradients (HOG) [28], Dual Tree Complex Wavelet Transform (DT-CWT) [1, 5] and GIST [39]. Readers who would wish to deepen the subject can refer to the following papers [34, 42, 50]. Local descriptors like Scale Invariant Feature Transform (SIFT) [30] provide a way to describe salient patches around properly chosen key points within the images. The dimension of the feature vector depends on the number of chosen key points in the image. The most common approach to reduce the size of feature vectors is the Bag-of-Visual Words (BoVW) [46]. This approach has shown excellent performance in object recognition [26], image classification [17] and annotation [48]. The underlying idea is to quantize by clustering local descriptors into visual words. Words are then defined as the centroids of the learned clusters and are representative of several similar local regions. Given an image, for each key point the corresponding local descriptor is mapped to the most similar visual word. The final feature vector of the image is represented by the histogram of the visual words.

Learned descriptors are features extracted using Convolutional Neural Networks (CNNs). CNNs are a class of learnable architectures used in many domains such as image recognition, image annotation, image retrieval etc [43]. CNNs are usually composed of several layers of processing, each involving linear as well as non-linear operators, that are learned jointly, in an end-to-end manner, to solve a particular tasks. A typical CNN architecture for image classification consists of one or more convolutional layers followed by one or more fully connected layers. The result of the last full connected layer is the CNN output. The number of output nodes is equal to the number of image classes [29].

A CNN that has been trained for solving a given task can be also adapted to solve a different task. In practice, very few people train an entire CNN from scratch, because it is relatively rare to have a dataset of sufficient size. Instead, it is common to take a CNN that is pre-trained on a very large dataset (e.g. ImageNet, which contains 1.2 million images with 1000 categories [25]), and then use it either as an initialization or as a fixed feature extractor for the task of interest [41, 49]. In the latter case, given an input image, the pre-trained CNN performs all the multilayered operations and the corresponding feature vector is the output of one of the fully connected layers [49]. This use of CNNs have demonstrated to be very effective in many pattern recognition applications [41].

In this paper we present a comparison between hand-crafted and learned descriptors for color texture classification. The comparison is performed on five color texture databases that include images under varying imaging conditions: scales, camera orientations, light orientations, light temperatures, etc. Each database allows to study the robustness of visual descriptors with respect to a given imaging condition. Results demonstrate that learned descriptors, on average, significantly outperform hand-crafted descriptors. However, results obtained on the individual databases show that in the case of Outex 14, that includes training and test images taken under varying illuminant conditions, hand-crafted descriptors perform better than learned descriptors.

2 Visual Descriptors

For the comparison we select a number of descriptors from hand-crafted and CNN-based approaches [5, 9, 33]. In some cases we consider both color and gray-scale images. The gray-scale image L is defined as follows: \(L = 0.299R \,+\, 0.587G \,+\, 0.114B\). All feature vectors are \(L^2\) normalizedFootnote 1.

Hand-Crafted Descriptors. As global descriptors we consider some variants of well known approaches such as Histogram, Local Binary Patterns, Wavelet and Gabor [22, 34], in particular:

  • 256-dimensional gray-scale histogram (Hist L) [36];

  • 768-dimensional RGB histograms (Hist RGB) [40];

  • 8-dimensional Dual Tree Complex Wavelet Transform features obtained considering four scales, mean and standard deviation, and three color channels (DT-CWT and DT-CWT L) [1, 5];

  • 512-dimensional Gist features obtained considering eight orientations and four scales for each channel (Gist RGB) [39];

  • 32-dimensional Gabor features composed of mean and standard deviation of six orientations extracted at four frequencies for each color channel (Gabor L and Gabor RGB) [4, 5];

  • 243-dimensional Local Binary Patterns (LBP) feature vector for each channel. We consider LBP applied to gray images and to color images represented in RGB [31]. We select the LBP with a circular neighbourhood of radius 2 and 16 elements, and 18 uniform and no-rotation invariant patterns (LBP L and LBP RGB).

  • 256-dimensional Local Color Contrast (LCC) obtained by using a quantized measure of color contrast [20];

  • 499-dimensional LBP L combined with the LCC descriptor, as described in [3, 18, 20, 23].

  • 729-dimensional LBP RGB combined with the LCC descriptor, as described in [20, 21].

As local descriptor we consider the 1024-dimensional Bag of Visual Words (BoVW) of a 128-dimensional Scale Invariant Feature Transform (SIFT) calculated on the gray-scale image. The codebook of 1024 visual words is obtained by exploiting images from external sources.

Learned Descriptors. These descriptors are obtained as the intermediate representations of deep Convolutional Neural Networks originally trained for scene and object recognition. The networks are used to generate a visual descriptor by removing the final softmax nonlinearity and the last fully-connected layer. We select the most representative CNN architectures in the state of the art [49] by considering a different accuracy/speed trade-off. All the CNNs are trained on the ILSVRC-2012 dataset using the same protocol as in [29]. In particular we consider the following visual descriptors [41]:

  • BVLC AlexNet (BVLC AlexNet): this is the AlexNet trained on ILSVRC 2012 [29].

  • BVLC Reference CaffeNet (BVLC Ref): a AlexNet trained on ILSVRC 2012, with a minor variation [49] from the version as described in [29].

  • Fast CNN (Vgg F): it is similar to the one presented in [29] with a reduced number of convolutional layers and the dense connectivity between convolutional layers. The last fully-connected layer is 4096-dimensional [9].

  • Medium CNN (Vgg M): it is similar to the one presented in [51] with a reduced number of filters in the convolutional layer four. The last fully-connected layer is 4096-dimensional [9].

  • Medium CNN (Vgg M-2048-1024-128): three modifications of the Vgg M network, with lower dimensional last fully-connected layer. In particular we use a feature vector of 2048, 1024 and 128 size [9].

  • Slow CNN (Vgg S): it is similar to the one presented in [44] with a reduced number of convolutional layers, less filters in the layer five and the Local Response Normalization. The last fully-connected layer is 4096-dimensional [9].

  • Vgg Very Deep 19 and 16 layers (Vgg VeryDeep 16 and 19): the configuration of these networks has been achieved by increasing the depth to 16 and 19 layers, that results in a substantially deeper network than the ones previously [45].

  • GoogleNet [47] is a 22 layers deep network architecture that has been designed to improve the utilization of the computing resources inside the network.

  • ResNet 50 is Residual Network. Residual learning framework are designed to ease the training of networks that are substantially deeper than those used previously. This network has 50 layers [27].

  • ResNet 101 is Residual Network made of 101 layers [27].

  • ResNet 152 is Residual Network made of 101 layers [27].

3 Texture Databases

For the evaluation of visual descriptors we consider five texture databases. Each database allows to study the robustness of visual descriptors with respect to a given imaging condition. For instance, Outex 13 [37] and RawFooT [24] contain images with no variations in the imaging conditions between training and test images. KTH-TIPS2 [8] database contains images at different scales. Variations of the camera directions are included in the ALOT [7] database. Variations of the light temperature are included in the RawFooT and ALOT databases. Variations of the light direction are included in the RawFooT, KTH-TIPS2 and ALOT databases. Outex 14 [37] includes images taken under lights with different temperature and positioned differently. Table 1 summarizes all the imaging condition variations included in texture database.

Table 1. Variations of the imaging conditions included in the texture databases used in this paper.

3.1 Outex 13 and 14 Databases

Outex 13 and Outex 14 are parts of the Outex collection [37] (the data set is available at the address http://www.outex.oulu.fi). The Outex collection contains images that depict textures of 68 different classes acquired under three different light sources, each positioned differently: the 2856 K incandescent CIE A (denoted as ‘inca’), the 2300 K horizon sunlight (denoted as ‘horizon’) and the 4000 K fluorescent TL84 (denoted as ‘TL84’). An overview of the Outex collection is reported in Figs. 1 and 2. The photographs have been acquired with a Sony DXC-755P camera calibrated using the ‘inca’ illuminant. Images are encoded in linear camera RGB space [31].

Fig. 1.
figure 1

The 68 classes of the Outex collection under the light ‘inca’.

Fig. 2.
figure 2

Outex collection. Each row contains images from a different class. The first three columns contain images taken under the light ‘inca’; the second three columns under the light ‘horizon’; the last three columns under the light ‘TL84’.

The Outex 13 database is composed of three different sets including images corresponding to each illuminant condition. The evaluation is performed on each single set independently and results are reported as average over the three results. Each set is made of 1360 images: 680 for training and 680 for test. Each set is made of training and test images acquired with no variations in the illuminant conditions.

To study the robustness of the proposed features to lighting variations we consider the Outex-14 test suite, that is obtained by considering all the possible combinations of illuminants for the training and the test sets. For instance, considering the training set from ‘inca’ and test set from ‘horizon’ and ‘TL84’ we obtain the subsets ‘inca’ vs ‘horizon/TL84’. In this way, considering each light source as training, we obtain three different sets of images: ‘inca’ vs ‘horizon/TL84’, ‘horizon’ vs ‘inca/TL84’ and ‘TL84’ vs ‘inca/horizon’. The evaluation is performed on each single set independently and results are reported as average over the three results. Each set is made of 2048 images: 680 for training and 1360 for test. For both Outex 13 and 14, texture images are obtained by subdividing texture photographs in 20 sub-images of size \(128 \times 128\) pixels.

3.2 RawFooT Database

The Raw Food Texture database (RawFooT), has been specially designed to investigate the robustness of descriptors and classification methods with respect to variations in the lighting conditions [24]. Classes correspond to 68 samples of raw food, including various kind of meat, fish, cereals, fruit etc. Samples taken under D65 at light direction \(\theta =24^{\circ }\) are showed in Fig. 3. The database includes images of 68 samples of textures, acquired under 46 lighting conditions which may differ in:

  1. 1.

    the light direction: 24\(^\mathrm{\circ }\), 30\(^\mathrm{\circ }\), 36\(^\mathrm{\circ }\), 42\(^\mathrm{\circ }\), 48\(^\mathrm{\circ }\), 54\(^\mathrm{\circ }\), 60\(^\mathrm{\circ }\), 66\(^\mathrm{\circ }\), and 90\(^\mathrm{\circ }\);

  2. 2.

    illuminant color: 9 outdoor illuminants: D40, D45, ..., D95; 6 indoor illuminants: 2700 K, 3000 K, 4000 K, 5000 K, 5700 K and 6500 K, we will refer to these as L27, L30, ..., L65;

  3. 3.

    intensity: 100%, 75%, 50% and 25% of the maximum achievable level;

  4. 4.

    combination of these factors.

Fig. 3.
figure 3

Overview of the 68 classes included in the Raw Food Texture database. For each class it is shown the image taken under D65 at direction \(\theta \) = \(24^{\circ }\).

For each of the 68 classes we consider 16 patches obtained by dividing the original texture image, that is of size 800 \(\times \) 800 pixels, in 16 non-overlapping squares of size 200 \(\times \) 200 pixels. We select images taken under half of the imaging conditions for training (indicated as set1, a total of 10336 images) and the remaining for testing (set2, a total of 10336 images). For each class we select eight patches for training and eight for testing by following a chessboard pattern (white position is indicated as W, black position as B). As a result, we obtain four sets:

  1. 1.

    training: set1 at position W; test: set2 at position B;

  2. 2.

    training: set1 at position B; test: set2 at position W;

  3. 3.

    training: set2 at position W; test: set1 at position B;

  4. 4.

    training: set2 at position B; test: set1 at position W.

The evaluation is performed on each single set independently and results are reported as average over the four results.

Fig. 4.
figure 4

The 11 classes of the KTH-TIPS2b database.

3.3 KTH-TIPS2b Database

The KTH-TIPS (Textures under varying Illumination, Pose and Scale) image database was created to extend the CUReT database in two directions, by providing variations in scale as well as pose and illumination, and by imaging other samples of a subset of its materials in different settings [8].

The KTH-TIPS2 databases took this a step further by imaging 4 different samples of 11 materials (samples 1, 2, 3 and 4), each under varying pose, illumination and scale. Examples from the 11 classes is displayed in Fig. 4. Each sample has 108 patches acquired under different imaging conditions. We collect 4 sets of images composed as follow:

  1. 1.

    training: samples 1; test: samples 2,3,4;

  2. 2.

    training: samples 2; test: samples 1,3,4;

  3. 3.

    training: samples 3; test: samples 1,2,4;

  4. 4.

    training: samples 4; test: samples 1,2,3.

The evaluation is performed on each single set independently and results are reported as average over the four results.

Fig. 5.
figure 5

The 250 classes of the ALOT database.

3.4 ALOT Database

The Amsterdam Library of Textures (ALOT) is a color image collection of 250 rough textures. In order to capture the sensory variation in object recordings, the authors systematically varied viewing angle, illumination angle, and illumination color for each material. This collection is similar in spirit as the CURET collection [7]. Examples from the 250 classes is displayed in Fig. 5.

The textures were placed on a turn table, and recordings were made for aspects of 0\(^\mathrm{\circ }\), 60\(^\mathrm{\circ }\), 120\(^\mathrm{\circ }\), and 180\(^\mathrm{\circ }\). Four cameras were used, three perpendicular to the light bow at 0\(^\mathrm{\circ }\) azimuth and 80\(^\mathrm{\circ }\), 60\(^\mathrm{\circ }\), 40\(^\mathrm{\circ }\) altitude. Furthermore, one is mounted at 60\(^\mathrm{\circ }\) azimuth and 60\(^\mathrm{\circ }\) altitude. Combined with five illumination directions and one semi-hemispherical illumination, a sparse sampling of the BTF is obtained.

Each object was recorded with only one out of five lights turned on, yielding five different illumination angles. Furthermore, turning on all lights yields a sort of hemispherical illumination, although restricted to a more narrow illumination sector than true hemisphere. Each texture was recorded with 3075 K illumination color temperature, at which the cameras were white balanced. One image for each camera is recorded with all lights turned on, at a reddish spectrum of 2175 K color temperature.

For each of the 250 classes we consider 6 patches obtained by dividing the original texture image, in 6 non-overlapping squares of size 200 \(\times \) 200 pixels. For each class we have 100 textures acquired under different imaging conditions. For each texture we select three patches for training and three for testing by following a chessboard pattern (white position is indicated as W, black position as B). As a result, we obtain a training set made of 75000 images (W position) and a test set made of 75000 images (B position).

4 Experiments

In all the experiments we use the nearest neighbor classification strategy: given a patch in the test set, its distance with respect to all the training patches is computed. The prediction of the classifier is the class of closest element in the training set. For this purpose, after some preliminary tests with several descriptors in which we evaluated the most common distance measures, we decided to use the L1 distance: \(d(\mathbf {x},\mathbf {y})=\sum _{i=1}^{\text {N}} | x_i - y_i |\), where \(\mathbf {x}\) and \(\mathbf {y}\) are two feature vectors. All the experiments are conducted under the maximum ignorance assumption, that is, no information about the imaging conditions of the test patches is available for the classification method and for the descriptors. Performance is reported as classification rate (i.e., the ratio between the number of correctly classified images and the number of test images).

Table 2. Accuracy of selected color descriptors. For each column the best result is reported in bold. The visual descriptors are ordered by the last column, which is the average accuracy over the database. The black bullet stands for grayscale visual descriptors.

Table 2 reports the performance obtained by all the visual descriptors evaluated on each single texture database. The list of the visual descriptors is ordered by their average performance over databases.

It is almost clear that, on average, learned features are more powerful than hand-crafted ones. In particular, residual nets are the most powerful. Residual nets are expected to be more effective because contain many more layers than other networks. They are designed with the aim to be more accurate on the image/object recognition task. The less powerful networks are the shortest ones.

Amongst hand-crafted features, the most powerful are SIFT and some variants of LBP. On average, local descriptors are most performing than global ones [10]. Moreover, grayscale hand-crafted descriptors perform better than color ones (in Table 2 grayscale descriptors are denoted with the black bullet).

Observing the results achieved on each single database, we discover that in the case of no-variations between training and test textures, histogram RGB works better than learned descriptors. In the case of simultaneously variation of light temperature and direction, the best performing visual descriptor is the LBP + LCC. This approach combines the powerful of grayscale LBP with a novel measure of the Local Color Contrast that is invariant with respect to rotations and translations of the image plane, and with respect to several transformations in the color space [20]. In the case of variations of image scale, light temperature, light direction and camera direction the learned descriptors outperform hand-crafted ones.

5 Conclusions

We focused, here, on texture classification under variable imaging conditions. To this purpose, several descriptors from hand-crafted and learned approaches on five state of the art texture databases are evaluated. Results show that learned descriptors, on average, perform better than hand-crafted ones. However, results show limit of learned descriptors when direction and temperature of the light change simultaneously. In this case, SIFT and some variants of LBP perform better than learned descriptors. As future works, we plan to evaluate recent hybrid [2, 6, 11] visual descriptors that combine local and learned descriptors [12,13,14,15,16, 35].