Abstract
The paper presents a comparison between hand-crafted and learned descriptors for color texture classification. The comparison is performed on five color texture databases that include images under varying imaging conditions: scales, camera orientations, light orientations, light color temperatures, etc. Results demonstrate that learned descriptors, on average, significantly outperform hand-crafted descriptors. However, results obtained on the individual databases show that in the case of Outex 14, that includes training and test images taken under varying illuminant conditions, hand-crafted descriptors perform better than learned descriptors.
You have full access to this open access chapter, Download conference paper PDF
1 Introduction
The problem of texture classification has been largely studied in computer vision and pattern recognition. Most of the works are focused on the definition of visual descriptors that are invariant, or at least robust, with respect to some variations in the acquisition conditions, such as rotations and scalings of the image, changes in brightness, contrast, and light color temperature [5, 19].
Visual descriptors are often divided into two categories: traditional hand-crafted and learned features. Traditional hand-crafted descriptors are features extracted using a manually predefined algorithm based on the expert knowledge. These features can be global and local. Global features describe an image as a whole in terms of color, texture and shape distributions [33]. Some notable examples of global hand-crafted features are color histograms [36], Gabor filters [32], Local Binary Patterns (LBP) [38], Histogram of Oriented Gradients (HOG) [28], Dual Tree Complex Wavelet Transform (DT-CWT) [1, 5] and GIST [39]. Readers who would wish to deepen the subject can refer to the following papers [34, 42, 50]. Local descriptors like Scale Invariant Feature Transform (SIFT) [30] provide a way to describe salient patches around properly chosen key points within the images. The dimension of the feature vector depends on the number of chosen key points in the image. The most common approach to reduce the size of feature vectors is the Bag-of-Visual Words (BoVW) [46]. This approach has shown excellent performance in object recognition [26], image classification [17] and annotation [48]. The underlying idea is to quantize by clustering local descriptors into visual words. Words are then defined as the centroids of the learned clusters and are representative of several similar local regions. Given an image, for each key point the corresponding local descriptor is mapped to the most similar visual word. The final feature vector of the image is represented by the histogram of the visual words.
Learned descriptors are features extracted using Convolutional Neural Networks (CNNs). CNNs are a class of learnable architectures used in many domains such as image recognition, image annotation, image retrieval etc [43]. CNNs are usually composed of several layers of processing, each involving linear as well as non-linear operators, that are learned jointly, in an end-to-end manner, to solve a particular tasks. A typical CNN architecture for image classification consists of one or more convolutional layers followed by one or more fully connected layers. The result of the last full connected layer is the CNN output. The number of output nodes is equal to the number of image classes [29].
A CNN that has been trained for solving a given task can be also adapted to solve a different task. In practice, very few people train an entire CNN from scratch, because it is relatively rare to have a dataset of sufficient size. Instead, it is common to take a CNN that is pre-trained on a very large dataset (e.g. ImageNet, which contains 1.2 million images with 1000 categories [25]), and then use it either as an initialization or as a fixed feature extractor for the task of interest [41, 49]. In the latter case, given an input image, the pre-trained CNN performs all the multilayered operations and the corresponding feature vector is the output of one of the fully connected layers [49]. This use of CNNs have demonstrated to be very effective in many pattern recognition applications [41].
In this paper we present a comparison between hand-crafted and learned descriptors for color texture classification. The comparison is performed on five color texture databases that include images under varying imaging conditions: scales, camera orientations, light orientations, light temperatures, etc. Each database allows to study the robustness of visual descriptors with respect to a given imaging condition. Results demonstrate that learned descriptors, on average, significantly outperform hand-crafted descriptors. However, results obtained on the individual databases show that in the case of Outex 14, that includes training and test images taken under varying illuminant conditions, hand-crafted descriptors perform better than learned descriptors.
2 Visual Descriptors
For the comparison we select a number of descriptors from hand-crafted and CNN-based approaches [5, 9, 33]. In some cases we consider both color and gray-scale images. The gray-scale image L is defined as follows: \(L = 0.299R \,+\, 0.587G \,+\, 0.114B\). All feature vectors are \(L^2\) normalizedFootnote 1.
Hand-Crafted Descriptors. As global descriptors we consider some variants of well known approaches such as Histogram, Local Binary Patterns, Wavelet and Gabor [22, 34], in particular:
-
256-dimensional gray-scale histogram (Hist L) [36];
-
768-dimensional RGB histograms (Hist RGB) [40];
-
8-dimensional Dual Tree Complex Wavelet Transform features obtained considering four scales, mean and standard deviation, and three color channels (DT-CWT and DT-CWT L) [1, 5];
-
512-dimensional Gist features obtained considering eight orientations and four scales for each channel (Gist RGB) [39];
-
32-dimensional Gabor features composed of mean and standard deviation of six orientations extracted at four frequencies for each color channel (Gabor L and Gabor RGB) [4, 5];
-
243-dimensional Local Binary Patterns (LBP) feature vector for each channel. We consider LBP applied to gray images and to color images represented in RGB [31]. We select the LBP with a circular neighbourhood of radius 2 and 16 elements, and 18 uniform and no-rotation invariant patterns (LBP L and LBP RGB).
-
256-dimensional Local Color Contrast (LCC) obtained by using a quantized measure of color contrast [20];
-
499-dimensional LBP L combined with the LCC descriptor, as described in [3, 18, 20, 23].
-
729-dimensional LBP RGB combined with the LCC descriptor, as described in [20, 21].
As local descriptor we consider the 1024-dimensional Bag of Visual Words (BoVW) of a 128-dimensional Scale Invariant Feature Transform (SIFT) calculated on the gray-scale image. The codebook of 1024 visual words is obtained by exploiting images from external sources.
Learned Descriptors. These descriptors are obtained as the intermediate representations of deep Convolutional Neural Networks originally trained for scene and object recognition. The networks are used to generate a visual descriptor by removing the final softmax nonlinearity and the last fully-connected layer. We select the most representative CNN architectures in the state of the art [49] by considering a different accuracy/speed trade-off. All the CNNs are trained on the ILSVRC-2012 dataset using the same protocol as in [29]. In particular we consider the following visual descriptors [41]:
-
BVLC AlexNet (BVLC AlexNet): this is the AlexNet trained on ILSVRC 2012 [29].
-
BVLC Reference CaffeNet (BVLC Ref): a AlexNet trained on ILSVRC 2012, with a minor variation [49] from the version as described in [29].
-
Fast CNN (Vgg F): it is similar to the one presented in [29] with a reduced number of convolutional layers and the dense connectivity between convolutional layers. The last fully-connected layer is 4096-dimensional [9].
-
Medium CNN (Vgg M): it is similar to the one presented in [51] with a reduced number of filters in the convolutional layer four. The last fully-connected layer is 4096-dimensional [9].
-
Medium CNN (Vgg M-2048-1024-128): three modifications of the Vgg M network, with lower dimensional last fully-connected layer. In particular we use a feature vector of 2048, 1024 and 128 size [9].
-
Slow CNN (Vgg S): it is similar to the one presented in [44] with a reduced number of convolutional layers, less filters in the layer five and the Local Response Normalization. The last fully-connected layer is 4096-dimensional [9].
-
Vgg Very Deep 19 and 16 layers (Vgg VeryDeep 16 and 19): the configuration of these networks has been achieved by increasing the depth to 16 and 19 layers, that results in a substantially deeper network than the ones previously [45].
-
GoogleNet [47] is a 22 layers deep network architecture that has been designed to improve the utilization of the computing resources inside the network.
-
ResNet 50 is Residual Network. Residual learning framework are designed to ease the training of networks that are substantially deeper than those used previously. This network has 50 layers [27].
-
ResNet 101 is Residual Network made of 101 layers [27].
-
ResNet 152 is Residual Network made of 101 layers [27].
3 Texture Databases
For the evaluation of visual descriptors we consider five texture databases. Each database allows to study the robustness of visual descriptors with respect to a given imaging condition. For instance, Outex 13 [37] and RawFooT [24] contain images with no variations in the imaging conditions between training and test images. KTH-TIPS2 [8] database contains images at different scales. Variations of the camera directions are included in the ALOT [7] database. Variations of the light temperature are included in the RawFooT and ALOT databases. Variations of the light direction are included in the RawFooT, KTH-TIPS2 and ALOT databases. Outex 14 [37] includes images taken under lights with different temperature and positioned differently. Table 1 summarizes all the imaging condition variations included in texture database.
3.1 Outex 13 and 14 Databases
Outex 13 and Outex 14 are parts of the Outex collection [37] (the data set is available at the address http://www.outex.oulu.fi). The Outex collection contains images that depict textures of 68 different classes acquired under three different light sources, each positioned differently: the 2856 K incandescent CIE A (denoted as ‘inca’), the 2300 K horizon sunlight (denoted as ‘horizon’) and the 4000 K fluorescent TL84 (denoted as ‘TL84’). An overview of the Outex collection is reported in Figs. 1 and 2. The photographs have been acquired with a Sony DXC-755P camera calibrated using the ‘inca’ illuminant. Images are encoded in linear camera RGB space [31].
The Outex 13 database is composed of three different sets including images corresponding to each illuminant condition. The evaluation is performed on each single set independently and results are reported as average over the three results. Each set is made of 1360 images: 680 for training and 680 for test. Each set is made of training and test images acquired with no variations in the illuminant conditions.
To study the robustness of the proposed features to lighting variations we consider the Outex-14 test suite, that is obtained by considering all the possible combinations of illuminants for the training and the test sets. For instance, considering the training set from ‘inca’ and test set from ‘horizon’ and ‘TL84’ we obtain the subsets ‘inca’ vs ‘horizon/TL84’. In this way, considering each light source as training, we obtain three different sets of images: ‘inca’ vs ‘horizon/TL84’, ‘horizon’ vs ‘inca/TL84’ and ‘TL84’ vs ‘inca/horizon’. The evaluation is performed on each single set independently and results are reported as average over the three results. Each set is made of 2048 images: 680 for training and 1360 for test. For both Outex 13 and 14, texture images are obtained by subdividing texture photographs in 20 sub-images of size \(128 \times 128\) pixels.
3.2 RawFooT Database
The Raw Food Texture database (RawFooT), has been specially designed to investigate the robustness of descriptors and classification methods with respect to variations in the lighting conditions [24]. Classes correspond to 68 samples of raw food, including various kind of meat, fish, cereals, fruit etc. Samples taken under D65 at light direction \(\theta =24^{\circ }\) are showed in Fig. 3. The database includes images of 68 samples of textures, acquired under 46 lighting conditions which may differ in:
-
1.
the light direction: 24\(^\mathrm{\circ }\), 30\(^\mathrm{\circ }\), 36\(^\mathrm{\circ }\), 42\(^\mathrm{\circ }\), 48\(^\mathrm{\circ }\), 54\(^\mathrm{\circ }\), 60\(^\mathrm{\circ }\), 66\(^\mathrm{\circ }\), and 90\(^\mathrm{\circ }\);
-
2.
illuminant color: 9 outdoor illuminants: D40, D45, ..., D95; 6 indoor illuminants: 2700 K, 3000 K, 4000 K, 5000 K, 5700 K and 6500 K, we will refer to these as L27, L30, ..., L65;
-
3.
intensity: 100%, 75%, 50% and 25% of the maximum achievable level;
-
4.
combination of these factors.
For each of the 68 classes we consider 16 patches obtained by dividing the original texture image, that is of size 800 \(\times \) 800 pixels, in 16 non-overlapping squares of size 200 \(\times \) 200 pixels. We select images taken under half of the imaging conditions for training (indicated as set1, a total of 10336 images) and the remaining for testing (set2, a total of 10336 images). For each class we select eight patches for training and eight for testing by following a chessboard pattern (white position is indicated as W, black position as B). As a result, we obtain four sets:
-
1.
training: set1 at position W; test: set2 at position B;
-
2.
training: set1 at position B; test: set2 at position W;
-
3.
training: set2 at position W; test: set1 at position B;
-
4.
training: set2 at position B; test: set1 at position W.
The evaluation is performed on each single set independently and results are reported as average over the four results.
3.3 KTH-TIPS2b Database
The KTH-TIPS (Textures under varying Illumination, Pose and Scale) image database was created to extend the CUReT database in two directions, by providing variations in scale as well as pose and illumination, and by imaging other samples of a subset of its materials in different settings [8].
The KTH-TIPS2 databases took this a step further by imaging 4 different samples of 11 materials (samples 1, 2, 3 and 4), each under varying pose, illumination and scale. Examples from the 11 classes is displayed in Fig. 4. Each sample has 108 patches acquired under different imaging conditions. We collect 4 sets of images composed as follow:
-
1.
training: samples 1; test: samples 2,3,4;
-
2.
training: samples 2; test: samples 1,3,4;
-
3.
training: samples 3; test: samples 1,2,4;
-
4.
training: samples 4; test: samples 1,2,3.
The evaluation is performed on each single set independently and results are reported as average over the four results.
3.4 ALOT Database
The Amsterdam Library of Textures (ALOT) is a color image collection of 250 rough textures. In order to capture the sensory variation in object recordings, the authors systematically varied viewing angle, illumination angle, and illumination color for each material. This collection is similar in spirit as the CURET collection [7]. Examples from the 250 classes is displayed in Fig. 5.
The textures were placed on a turn table, and recordings were made for aspects of 0\(^\mathrm{\circ }\), 60\(^\mathrm{\circ }\), 120\(^\mathrm{\circ }\), and 180\(^\mathrm{\circ }\). Four cameras were used, three perpendicular to the light bow at 0\(^\mathrm{\circ }\) azimuth and 80\(^\mathrm{\circ }\), 60\(^\mathrm{\circ }\), 40\(^\mathrm{\circ }\) altitude. Furthermore, one is mounted at 60\(^\mathrm{\circ }\) azimuth and 60\(^\mathrm{\circ }\) altitude. Combined with five illumination directions and one semi-hemispherical illumination, a sparse sampling of the BTF is obtained.
Each object was recorded with only one out of five lights turned on, yielding five different illumination angles. Furthermore, turning on all lights yields a sort of hemispherical illumination, although restricted to a more narrow illumination sector than true hemisphere. Each texture was recorded with 3075 K illumination color temperature, at which the cameras were white balanced. One image for each camera is recorded with all lights turned on, at a reddish spectrum of 2175 K color temperature.
For each of the 250 classes we consider 6 patches obtained by dividing the original texture image, in 6 non-overlapping squares of size 200 \(\times \) 200 pixels. For each class we have 100 textures acquired under different imaging conditions. For each texture we select three patches for training and three for testing by following a chessboard pattern (white position is indicated as W, black position as B). As a result, we obtain a training set made of 75000 images (W position) and a test set made of 75000 images (B position).
4 Experiments
In all the experiments we use the nearest neighbor classification strategy: given a patch in the test set, its distance with respect to all the training patches is computed. The prediction of the classifier is the class of closest element in the training set. For this purpose, after some preliminary tests with several descriptors in which we evaluated the most common distance measures, we decided to use the L1 distance: \(d(\mathbf {x},\mathbf {y})=\sum _{i=1}^{\text {N}} | x_i - y_i |\), where \(\mathbf {x}\) and \(\mathbf {y}\) are two feature vectors. All the experiments are conducted under the maximum ignorance assumption, that is, no information about the imaging conditions of the test patches is available for the classification method and for the descriptors. Performance is reported as classification rate (i.e., the ratio between the number of correctly classified images and the number of test images).
Table 2 reports the performance obtained by all the visual descriptors evaluated on each single texture database. The list of the visual descriptors is ordered by their average performance over databases.
It is almost clear that, on average, learned features are more powerful than hand-crafted ones. In particular, residual nets are the most powerful. Residual nets are expected to be more effective because contain many more layers than other networks. They are designed with the aim to be more accurate on the image/object recognition task. The less powerful networks are the shortest ones.
Amongst hand-crafted features, the most powerful are SIFT and some variants of LBP. On average, local descriptors are most performing than global ones [10]. Moreover, grayscale hand-crafted descriptors perform better than color ones (in Table 2 grayscale descriptors are denoted with the black bullet).
Observing the results achieved on each single database, we discover that in the case of no-variations between training and test textures, histogram RGB works better than learned descriptors. In the case of simultaneously variation of light temperature and direction, the best performing visual descriptor is the LBP + LCC. This approach combines the powerful of grayscale LBP with a novel measure of the Local Color Contrast that is invariant with respect to rotations and translations of the image plane, and with respect to several transformations in the color space [20]. In the case of variations of image scale, light temperature, light direction and camera direction the learned descriptors outperform hand-crafted ones.
5 Conclusions
We focused, here, on texture classification under variable imaging conditions. To this purpose, several descriptors from hand-crafted and learned approaches on five state of the art texture databases are evaluated. Results show that learned descriptors, on average, perform better than hand-crafted ones. However, results show limit of learned descriptors when direction and temperature of the light change simultaneously. In this case, SIFT and some variants of LBP perform better than learned descriptors. As future works, we plan to evaluate recent hybrid [2, 6, 11] visual descriptors that combine local and learned descriptors [12,13,14,15,16, 35].
Notes
- 1.
The feature vector are divided by its \(L^2\)-norm.
References
Barilla, M., Spann, M.: Colour-based texture image classification using the complex wavelet transform. In: 2008 5th International Conference on Electrical Engineering, Computing Science and Automatic Control, CCE 2008, pp. 358–363, November 2008
Bianco, S., Ciocca, G., Napoletano, P., Schettini, R., Margherita, R., Marini, G., Pantaleo, G.: Cooking action recognition with iVAT: an Interactive video annotation tool. In: Petrosino, A. (ed.) ICIAP 2013. LNCS, vol. 8157, pp. 631–641. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41184-7_64
Bianco, S., Cusano, C., Napoletano, P., Schettini, R.: On the robustness of color texture descriptors across illuminants. In: Petrosino, A. (ed.) ICIAP 2013. LNCS, vol. 8157, pp. 652–662. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41184-7_66
Bianconi, F., Fernández, A.: Evaluation of the effects of gabor filter parameters on texture classification. Pattern Recogn. 40(12), 3325–3335 (2007)
Bianconi, F., Harvey, R., Southam, P., Fernández, A.: Theoretical and experimental comparison of different approaches for color texture classification. J. Electron. Imaging 20(4), 043006 (2011)
Boccignone, G., Napoletano, P., Ferraro, M.: Embedding diffusion in variational Bayes: a technique for segmenting images. Int. J. Pattern Recogn. Artif. Intell. 22(05), 811–827 (2008)
Burghouts, G.J., Geusebroek, J.M.: Material-specific adaptation of color invariant features. Pattern Recogn. Lett. 30(3), 306–313 (2009)
Caputo, B., Hayman, E., Mallikarjuna, P.: Class-specific material categorisation. In: 2005 Tenth IEEE International Conference on Computer Vision, ICCV 2005, vol. 2, pp. 1597–1604. IEEE (2005)
Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014)
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3606–3613 (2014)
Cimpoi, M., Maji, S., Kokkinos, I., Vedaldi, A.: Deep filter banks for texture recognition, description and segmentation. arXiv preprint arXiv:1507.02620 (2015)
Colace, F., Casaburi, L., De Santo, M., Greco, L.: Sentiment detection in social networks and in collaborative learning environments. Comput. Hum. Behav. 51, 1061–1067 (2015)
Colace, F., De Santo, M., Greco, L.: An adaptive product configurator based on slow intelligence approach. Int. J. Metadata Seman. Ontol. 9(2), 128–137 (2014)
Colace, F., De Santo, M., Greco, L., Napoletano, P.: Text classification using a graph of terms. In: 2012 Sixth International Conference on Complex, Intelligent and Software Intensive Systems (CISIS), pp. 1030–1035. IEEE (2012)
Colace, F., De Santo, M., Greco, L., Napoletano, P.: A query expansion method based on a weighted word pairs approach. In: Proceedings of the 3rd Italian Information Retrieval (IIR) 964, pp. 17–28 (2013)
Colace, F., De Santo, M., Greco, L., Napoletano, P.: Weighted word pairs for query expansion. Inf. Process. Manage. 51(1), 179–193 (2015)
Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV, Prague, vol. 1, pp. 1–2 (2004)
Cusano, C., Napoletano, P., Schettini, R.: Illuminant invariant descriptors for color texture classification. In: Tominaga, S., Schettini, R., Trémeau, A. (eds.) CCIW 2013. LNCS, vol. 7786, pp. 239–249. Springer, Heidelberg (2013). doi:10.1007/978-3-642-36700-7_19
Cusano, C., Napoletano, P., Schettini, R.: Intensity and color descriptors for texture classification. In: Proceedings of the SPIE Image Processing: Machine Vision Applications VI, SPIE, vol. 8661, pp. 866113–866113-11 (2013)
Cusano, C., Napoletano, P., Schettini, R.: Combining local binary patterns and local color contrast for texture classification under varying illumination. JOSA A 31(7), 1453–1461 (2014)
Cusano, C., Napoletano, P., Schettini, R.: Local angular patterns for color texture classification. In: Murino, V., Puppo, E., Sona, D., Cristani, M., Sansone, C. (eds.) ICIAP 2015. LNCS, vol. 9281, pp. 111–118. Springer, Cham (2015). doi:10.1007/978-3-319-23222-5_14
Cusano, C., Napoletano, P., Schettini, R.: Remote sensing image classification exploiting multiple kernel learning. IEEE Geosci. Remote Sens. Lett. 12(11), 2331–2335 (2015)
Cusano, C., Napoletano, P., Schettini, R.: Combining multiple features for color texture classification. J. Electron. Imaging 25(6), 061410 (2016)
Cusano, C., Napoletano, P., Schettini, R.: Evaluating color texture descriptors under large variations of controlled lighting conditions. J. Opt. Soc. Am. A 33(1), 17–30 (2016)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Grauman, K., Leibe, B.: Visual Object Recognition, No. 11. Morgan & Claypool Publishers (2010)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Junior, O.L., Delgado, D., Gonçalves, V., Nunes, U.: Trainable classifier-fusion schemes: an application to pedestrian detection. In: Intelligent Transportation Systems (2009)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)
Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)
Mäenpää, T., Pietikäinen, M.: Classification with color and texture: jointly or separately? Pattern Recogn. 37(8), 1629–1640 (2004)
Manjunath, B.S., Ma, W.Y.: Texture features for browsing and retrieval of image data. IEEE Trans. Pattern Anal. Mach. Intell. 18(8), 837–842 (1996)
Mirmehdi, M., Xie, X., Suri, J.: Handbook of Texture Analysis. Imperial College Press, London (2008)
Napoletano, P.: Visual descriptors for content-based retrieval of remote sensing images. arXiv preprint arXiv:1602.00970 (2016)
Napoletano, P., Boccignone, G., Tisato, F.: Attentive monitoring of multiple video streams driven by a Bayesian foraging strategy. IEEE Trans. Image Process. 24(11), 3266–3281 (2015)
Novak, C.L., Shafer, S., et al.: Anatomy of a color histogram. In: 1992 Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 1992, pp. 599–605. IEEE (1992)
Ojala, T., Mäenpää, T., Pietikäinen, M., Viertola, J., Kyllönen, J., Huovinen, S.: Outex-new framework for empirical evaluation of texture analysis algorithms. In: 16th International Conference on Pattern Recognition, vol. 1, pp. 701–706 (2002)
Ojala, T., Pietikäinen, M., Mänepää, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 (2002)
Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vision 42(3), 145–175 (2001)
Pietikainen, M., Nieminen, S., Marszalec, E., Ojala, T.: Accurate color discrimination with classification based on feature distributions. In: 1996 Proceedings of the 13th International Conference on Pattern Recognition, vol. 3, pp. 833–838, August 1996
Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 512–519 (2014)
Rui, Y., Huang, T.S., Chang, S.F.: Image retrieval: current techniques, promising directions, and open issues. J. Vis. Commun. Image Represent. 10(1), 39–62 (1999)
Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: OverFeat: integrated recognition, localization and detection using convolutional networks. In: International Conference on Learning Representations (ICLR 2014), CBLS, April 2014
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: 2003 Proceedings of the Ninth IEEE International Conference on Computer Vision, pp. 1470–1477. IEEE (2003)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Tsai, C.F.: Bag-of-words representation in image annotation: a review. ISRN Artif. Intell. 2012 (2012)
Vedaldi, A., Lenc, K.: MatConvNet - convolutional neural networks for MATLAB. CoRR abs/1412.4564 (2014)
Veltkamp, R., Burkhardt, H., Kriegel, H.P.: State-of-the-Art in Content-based Image and Video Retrieval, vol. 22. Springer Science & Business Media, Heidelberg (2013)
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). doi:10.1007/978-3-319-10590-1_53
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Napoletano, P. (2017). Hand-Crafted vs Learned Descriptors for Color Texture Classification. In: Bianco, S., Schettini, R., Trémeau, A., Tominaga, S. (eds) Computational Color Imaging. CCIW 2017. Lecture Notes in Computer Science(), vol 10213. Springer, Cham. https://doi.org/10.1007/978-3-319-56010-6_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-56010-6_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56009-0
Online ISBN: 978-3-319-56010-6
eBook Packages: Computer ScienceComputer Science (R0)