Abstract

With great developments of computing technologies and data mining methods, image annotation has attracted much attraction in smart agriculture. However, the semantic gap between labels and images poses great challenges on image annotation in agriculture, due to the label imbalance and difficulties in understanding obscure relationships of images and labels. In this paper, an image annotation method based on graph learning is proposed to accurately annotate images. Specifically, inspired by nearest neighbors, the semantic neighbor graph is introduced to generate preannotation, balancing unbalanced labels. Then, the correlations between labels and images are modeled by the random dot product graph, to deeply mine semantics. Finally, we perform experiments on two image sets. The experimental results show that our method is much better than the previous method, which verifies the effectiveness of our model and the proposed method.

1. Introduction

With great developments of computing technologies and data mining methods, smart agriculture has attracted much attraction since it can greatly increase crop yields by effectively recommending methods to control pests [1, 2]. For example, the internet of vehicles with task scheduling [3, 4] can help formers to harvest crops automatically, and content-based crop image retrieval can help producers to keep track of plant growth in real time, which contributes to developing disease control and production plans. Meanwhile, with the technological advancement, the form of crop monitoring is also undergoing tremendous changes, posing great challenges to the current machine learning-based methods [59], due to the collected data that are of high volume, high velocity, high value, and high variety [10, 11]. Thus, to mine patterns of data in smart agriculture requires novel methods.

Image annotation, as a typical method for images analysis in agricultural big data, predicts labels for a given image, which can well match the image content [12]. In recent years, a large number of researchers have done extensive research on image annotation [13, 14]. For example, to reduce the semantic gap between visual features and text features, some researchers have proposed the generative model, which models image annotation as a joint likelihood distribution between images and labels. Nevertheless, the generative method only uses the image-label correlations, ignoring the relation over images. To use the relation over images, the discriminant model is proposed, focusing on finding the difference between images. Typically, this method trains a classifier to predict image labels, but the balance of sample labels has a large impact on the model performance. At the same time, some researchers proposed a graph model that utilizes all the data to build the intrinsic structure of unlabeled images and annotated images. Also, the nearest neighbor model is used to construct the label propagation graph, based on the theory that similar images share common labels [15, 16]. However, this method pays too much attention to the correlation between images, ignoring the image differences.

To solve those problems, a nearest neighbor graph model is proposed in this paper, which combining superiorities of graph and nearest theories. Specifically, the semantic neighbors of test image under each label are firstly searched to the semantic neighbor graph. Then, a preannotation score is obtained by graph learning of the semantic neighbor graph, considering relationships between images. The preannotation of the semantic neighbor graph can effectively solve the label imbalance problem, increasing the annotation probability of the rare labels and suppressing the high-frequency labels.

Next, the relationships between labels are used to improve the accuracy of the image annotation. The previous work was simply to calculate the cooccurrence probability between labels without considering the imbalance of cooccurrence between labels. For example, “Sea” and “Ship” are likely to appear in the same picture, and the two labels are strongly related. However, the possibility of “sea” in “ship” images may be greater than that of “ship” in “sea” images. This is because the “sea” is associated with more things, such as “fish” and “coral.” To solve this imbalance of labels, the random dot product graph is used to mine the deep associations of labels. After that, visual differences that lead to lower similarity between similar images are used to further improve the performance of the proposed method. Finally, the naive Bayes nearest neighbor (NBNN) classifier is used to establish a joint likelihood between images and labels because of its simplicity and efficiency. Finally, the proposed method is conducted on Corel 5K and IAPR TC12. And results show that the proposed method has obvious improvement in terms of label recall. The main contributions of this paper are as follows: (i)To effectively solves the label imbalance problem, the semantic neighbor graph learning is proposed to generate preannotation based on the nearest neighbor where all the labels are included in the initial label candidate(ii)To mine the deep associations of labels, the random dot product graph is proposed, balancing the distributions of cooccurrence of paired labels

The remaining content structure of the thesis is as follows: in Section 2, we introduce the related work of image annotation. Then, in Section 3, we present our image annotation framework and concrete implementation of the framework. The datasets, experimental, settings, and results are illustrated in Section 4. The paper is concluded in Section 5.

Image annotation has been a research hotspot which attracts increasing attention. Many fields are related to it, and they can benefit from the progress of each other. For example, the internet of vehicles [17, 18] can provide a lot of images to be annotated, and the better annotated images can be used to train the distinguishing model for better driving vehicles. Thus, a large number of researchers have introduced many kinds of methods to image annotation in recent years. They can be divided into four classes: the generating model, discriminating model, graph learning model, and nearest neighbor model.

2.1. Generating Model

To solve the problem in image annotation, some scholars proposed the mixture model, which is one of the generating model. For example, Jeon et al. proposed a cross-media relevance model (CMRM) [19]. In this method, image is segmented into several blobs, which can be clustered. Then, they calculate the probability between words and images by establishing maximum likelihood estimation. However, this method is affected by clustering of the image feature. Therefore, a continuous relevance model (CRM) [20] was proposed by Lavrenko et al., which used a continuous image feature. The method calculates the probability of labeling the word using polynomial distribution. But this method needs to store a large kernel matrix, resulting in a computational burden.

In order to solve the hybrid model’s “visual ambiguity” problem, that visual similarities do not mean semantic similarities, researchers proposed the topic model. The topic model can be thought as a hybrid model with a particular topic used to portray the relationship between the image and the label. For example, Barnard et al. proposed a method with modeling multimodal cooccurrence [21]. This method imports several topic variables and attempts to find the relation between labels and visual features through probability. But this method is affected by model initialization. Blei et al. presented the LDA method [22], which used the Dirichlet distribution in the stage of choosing topics and words. However, the topic model is complex and has too many parameters. Thus, it is not suitable for large-scale datasets.

2.2. Discriminating Model

To solve the problem of the generating model, some researchers proposed the discriminating model. The discriminating model uses multilabel classification to solve the problem of image annotation. This method trains a classifier for each label, then determines which label the image belongs to by the classifier. For instance, Carneiro et al. proposed SML [23], which established a relationship between semantic labels and semantic classes. This method does not need to segment the image in advance, but it requires a high balance of classes and does not consider the relationship between labels. Sun et al. [24] used sparse factor representation to come up with sparse structure based on label dependency for weakening the negative effect caused by the unbalance of labels. But this method does not consider the potential relationship between images with labels and the lack of high-quality image dataset.

2.3. Graph-Based Learning Model

To address the issue of insufficient labeled images, some investigators put forward the graph learning model. The graph learning model is a semisupervised learning model, which uses labeled and unlabeled images to create the graph, then uses the Laplacian matrix for transferring labels. Liu et al. proposed the nearest spanning chain (NSC) [25]. In this method, they use a graph algorithm to transfer labels, but they do not take into account the relationship between images and labels. So Su and Xue proposed GLKNN [26]. In the stage of initializing graph weights, the cooccurrence relationship between labels is considered. However, they discount that the cooccurrence relationship is unbalance. This graph model only considers visual features and has no regard for problem of “visual ambiguity.” Meanwhile, in the condition of a big image dataset, this model has high time complexity and poor annotation performance.

2.4. Nearest Neighbor Model

Because the nearest neighbor model performs better under big data conditions, this method has attracted more and more researchers. This model transfers the image annotation problem into the image retrieval problem. First, this method needs to search images which are highly similar to unlabeled images, then labels unlabeled images by means of label transmission. For example, Guillaumin et al. proposed a method based on weighted KNN called TagProp [27]. In this case, the labeled probability of lower frequency labels is increased and the labeled probability of higher frequency labels is suppressed. And Verma and Jawahar put forward 2PKNN [28], in which image distance metric learning was used. They adjust the weights of different visual features in order to make the relationship between visual features more consistent with the relationship between image semantics. In CCAKNN [29], the aim is to get the image subset of each semantic label. They map two features to the same subspace and model the visual features by using the Bayesian probability model. However, the nearest neighbor model only uses the similarity between images and ignores the difference between the image samples.

3. Our Approach

A new image annotation framework is proposed on the basis of graph learning, which is composed of three steps. First, we propose the nearest neighbor graph based on the principle that similar images share labels, to obtain preannotation results. Next, the association between labels is used to improve the accuracy of image annotation by the random dot product graph, which deeply mines the internal association of labels to increase probabilities of labeling weak labels. Finally, the naive Bayes nearest neighbor classifier is used to calculate the distance between images and labels. The main process of the proposed method is shown in Algorithm 1:

Framework of the proposed method.
Input: images
Output: predicted labels
1: Find the best nearest neighbor images by improving the nearest neighbors.
2: Construct a similar matrix through .
3: Mine the deep relationship between images, using random dot product graph (RDPG) for refactoring, .
4: Iterate to convergence through .
5: Build a semantic matrix through
6: Consider the effect of the association between labels on the results of the annotations, .
7: Consider the relationship between images and labels, .
8: Return the final score of the label, .
Algorithm 1.
3.1. RPDG-Based Image Graph for Image Annotation

Let be a collection of images, be a set of labels, and the training set be denoted by , which is composed of each marked image and corresponding label set which is presented as a binary vector. For example, if the th image is labeled by th label, ; otherwise, . To solve the problem of label imbalance, the nearest neighbor graph is constructed based on the neighbor image sets.

For a given image to be labeled, its neighbor image set constitutes a set . We select a set of nearest neighbor images to form a set for each label based on the visual distance of images. The main idea of this method is that similar images have a high probability of passing labels. The traditional approach finds the semantic nearest neighbor by using the weighted multiple vision distance, without considering the probability that two images are neighbors to each other is different. As a result, the nearest neighbors of unlabeled images would have some noise images, which brings noise labeled and decreases the image annotation accuracy. Due to the complex distribution of visual features in images, some images in the image dataset have a higher probability of being selected as neighbor images, some images are less likely to be selected, and others may not even be selected. But in practice, the nearest neighbor relationship between images is not symmetric. For example, the image is a nearest neighbor image of the image , but the image is not a nearest neighbor image of the image , which degrades the accuracy of the conventional methods in selecting nearest neighbor images. Therefore, we propose a novel way to select the nearest neighbor images.

We propose a novel method based on the common neighbor images. We use this improved method to select the nearest neighbors of the test image, reducing the noise labels. Our method first sorts images of each label according to the visual distance and selects the first images. Then, our method selects the nearest neighbors for each of these images. Sorting according to the number of their common neighbor images, the top images are selected as the neighbor images of image . The nearest neighbor images selected in this way are more consistent with the image similarity under actual conditions. And the number of images which are related to the test image in semantic is also increased. As a result, the possibility of introducing a noise image is reduced and the accuracy of the annotation improves.

We assume a simple graph , where is the vertex set representing images in . The edge set is denoted as representing a relationship between two images. The weight of the edge is the similarity of two images.

The principle of the graph-based learning method is semisupervised learning. This method uses the image features and annotation information of the training data. Then, it iterates the similarity matrix of the training data and passes the appropriate semantic label from the labeled images to the unlabeled images based on this similarity, which is a preliminary result of the first step.

The detail of this method is as follows: (Step 1)Construct a similar matrix of set aswhere is a measure of distance. And =0, because there is no self-loop in the graph (Step 2)Symmetrically normalize bywhere is a diagonal matrix and (Step 3)Iterate according to the Eq. (3) until convergencewhere is the number of iteration until convergence and is the propagation parameter (Step 4)Label the unlabeled images according to the convergence matrix

Through the above steps, we finally get the tag score and ranking. On the basis of the above discussion, there are two key parts of the graph-based learning method: a similarity graph and an initial state representation . describes the similarity between the test image and its nearest neighbor images, which provides a basis for the label transmission.

Thus, the construction of a similarity graph is very important. In constructing a graph in the traditional graph-based image annotation methods, the weight of the edge between the vertices (images) directly uses the visual distance. However, because of the existence of the “visual ambiguity,” this method may ignore the hidden relationships of the images. So different from the previous work, we use the random dot product graph to discover hidden relationships.

The random dot product graph is a point-edge random graph model. For each node in the node set , a -dimensional vector is randomly and uniformly selected from the -dimensional unit space as the assignment of . The probability of the edge between each pair of nodes , is

This probability is used for generating a random dot product graph as the assignment .

The two main properties of random dot product graph are as follows:

Property 1. Clustering: the edges of random dot product graph appear with incompletely equal probability, with obvious clustering characteristics.

Property 2. Transitivity: if two nodes have strong connections with the third node at the same time, then the two nodes should also have a great correlation directly. Conversely, if two nodes have no other associated third node, then the probability that the two nodes are related should be small.

Each edge in the random graph appears randomly and independently. According to the Bernoulli distribution, the random dot product graph generates the edge set according to the probability to obtain an observation graph. If the observation graph is an undirected weighted graph and its adjacency matrix is , then

Its log likelihood function is

In the observation, the probability of the edge reflects the correlation between the nodes. It can be seen from Equation (6) that when is maximum, the probability of the edge corresponds to the weight as much as possible. According to the principle of duality, we have where .

Therefore, the objective function is expressed as where is a random assignment of nodes, the probability of the edge is the inner product form, and the right side of Equation (8) is the Frobenius norm of the matrix, so it can be written as .

Based on the above principles, we have the following algorithm.

Random dot product method for simple graphs.
Input: the weight matrix of the image data graph.
Output: the weight matrix of random point product.
1: Take an all-zero matrix .
2: Find spectral decomposition of .
3: is a matrix of largest eigenvectors, . is a diagonal matrix composed of largest eigenvalues, where each negative eigenvalue is changed to 0.
4:
5: Return 2 until converges.
6: Calculate , return 1 until converges. is the edge probability matrix after random reconstruction, where .
Algorithm 2.

Based on the above method, for given nodes and , the weighted distance is expressed as

The random dot product graph improves the weight of the similar matrix. With the improvement of the nearest neighbor graph, we pay more attention to the internal relations between images. By this method, the weak label problem can be effectively solved.

3.2. Word-Based Graph Learning

The frequencies of the labels in the image dataset are different. The low-frequency labels are easily ignored during the annotating process, which leads to the accuracy decrease of the annotation. In the previous work, people usually used the semantic symbiosis between labels to solve this problem. However, there is a cooccurrence imbalance between the labels, which makes it impossible to significantly improve the label effect of the low-frequency label. By the transitive nature of the random dot product graph, we reconstruct the association graph of the label words and find the inherent hidden relationship between the labels. The random dot product graph can obtain the relationship between any annotated words. The probability of common semantic relations is large, and the probability of uncommon semantic relations is small, which is consistent with the real semantic relationship.

In the label set , we record the probability of label to label denoted by , where represents the number of cooccurrences between labels and . In this paper, we abbreviate to . Because of semantic cooccurrence imbalance, is not equal to .

We first get the transfer matrix between the labels according to Equation (10). is reconstructed by random dot product to obtain . Bringing transfer matrix and the matrix obtained on the basis of graph learning into Equation (3), we iterate to get result .

3.3. Image to Word Relation

This relationship can be regarded as the possibility of having an image to produce a label. In most cases, the relationship can be estimated on a training set by some hypothetical distribution. In many methods, the image is clustered and divided into several “blob,” with each “blob” corresponding to a label word. However, in the process of clustering, problems will be caused due to that the underlying features are similar, but the actual contents are different, which makes the blob itself wrong. In this paper, the method used to calculate the image to word distance is the naive Bayes nearest neighbor (NBNN) classifier [30] for image classification. This method is simple and has good performance. At the same time, it calculates the association between the whole image and the annotated label, avoiding the wrong correspondence between the “blob” and the annotated label.

The features of the image are recorded as , and represents a collection of annotated as label . The definition of image to word distance is where is the figure for images in . is the Gaussian kernel function:

3.4. Combination of Three Scores

Finally, we combine the two scores based on the graph learning with the score of the image-to-label distance to get the final score, which is the basis for the final labels. where is a score based on the association between images and is the probability that the image is labeled with the label based on an association between labels. In addition, .

4. Experiment

In this section, we introduce two datasets used in the experiment and the extraction of features of two datasets. Also, the evaluation indicators of the image annotation methods are given.

4.1. Datasets

During the experiment, we used two datasets. Table 1 shows the statistics of these datasets.

Corel 5K [31]. This dataset contains 4,500 training images and 499 test images. It is divided into 50 themes, each with 100 images except the last. The dataset contains 260 labels. Each image is manually labeled with 1-5 different labels, and the average is 3.4.

IAPR TC12 [32]. This dataset contains 19,627 images, where 17,665 are training images and 1962 are test images. This dataset contains a total of 291 tags, and each image in the dataset is averaged as 5.836 tags.

4.2. Feature

The first step in our approach is to extract features, which is a very important part. Feature extraction has a profound impact on the performance of image annotation systems. Recently, CNN has been widely applied to feature extraction of images. Compared with using 15 handcrafted features, it is not necessary to use metric learning to determine the optimal weight of each feature, so it is easier to determine the parameters. We use CNN to extract individual features instead of handcrafted features, which can effectively reduce the number of features and improve system accuracy.

4.3. Evaluation Metrics

In our experiments, we use the same evaluation method as [33] to effectively evaluate and compare our method with the previous methods. In our approach, we give each image five labels. Then, we calculate the labeling precision and recall for each label in each image of the test set. Suppose that a label marking images in the ground truth, and the number of images marked as during the test is , in which the correct number of marks is recorded as . The method of calculating the precision of the label is and recall of the label is . These values are obtained by calculating each label, and then, the mean value is calculated to get the average precision and the average recall . Define that is the score for combining and , . And define that represents the number of tags that have been correctly tagged at least once, which indicates the ability of our proposed method to solve class imbalance and weak label problems.

5. Result

In this subsection, we describe the performance of the proposed method compared with the previously proposed methods. Table 2 gives the experimental results on the datasets Corel 5K and IAPR CT12. This table shows that our method outperforms the previous work. Among the Corel 5K, our accuracy is the second highest, and our tag recall number is the highest. Detailed results and analysis of the experiment will be presented in the following sections.

It is worth noting that we have selected several methods based on nearest neighbors as comparison methods. As shown in Table 2, our method performs better than JEC in all aspects. Compared with 2PKNN, our recall value and value is also much higher on the Corel 5K dataset. And our RDPGKNN is superior to TagProp. The comparison with these methods shows that the graph learning method also has unique advantages in the field of image annotation and proves the validity and rationality of the label using the graph learning method for propagation.

We also compare RDPGKNN with graph-based learning algorithms, and the results show that our approach is generally better than previous work. Since most of the graph learning algorithms are applied to small vocabularies, there are few research methods on image annotation based on graph learning in Corel 5K and other datasets, so we mainly choose TGLM and GLKNN. In comparison with TGLM, the experimental results show that our method is obviously superior in Corel 5K. This shows the advantage of the nearest neighbor method, which effectively solves the label imbalance problem, so that each annotation word has the opportunity of being selected. At the same time, compared with GLKNN, our has a significant improvement, because we consider label cooccurrence asymmetry. Using graph-based learning to calculate the label transition probability can maximize the selected probability of low-frequency tags, provide more appropriate weights for the transfer between tags, and improve the performance of the image tagging system.

On the IAPR CT12, our algorithm also has excellent performance. Compared with the previous work, the RDPGKNN method recalls the most labels. On this basis, our recall rate is second only to that of the CAAKNN method, and the recall rate is greatly improved on the premise that the accuracy does not drop too much. Compared to the GLKNN based on the graph, the recall rate of our method has also increased by 2%. This also confirms the need to consider the problem of cooccurrence imbalance between images. Figure 1 shows some examples of the annotation of our method on two datasets. Among them, we use the black mark to indicate the labels annotated in ground truth and annotated with RDPGKNN, and the red mark does not appear in the ground truth. It should be noted that some images in the dataset have fewer than five labels annotated in ground truth, but our method must label five labels.

After comparing with all methods, we find that our method effectively increased the value of . This shows that compared with the traditional methods, our method has strong performance in recall, and the other performance is almost unchanged. Also, the problem that some labels cannot be selected due to the unbalanced label co-occurrence phenomenon is solved.

6. Conclusion

In this paper, a reconstitution graph learning model is proposed to for image annotation in smart agriculture. To solve the weak label problem, a nearest neighbor graph learning model is proposed to get the prelabels. Meanwhile, for the cooccurrence imbalance between labels, the random dot product graph is used to explore the intrinsic links between labels. Many experiments on the Corel 5K and IAPR TC12 are conducted, and the result shows that the recall of our method is much larger than that of the previous graph-based learning methods. At the same time, our accuracy and recall rate are basically the same as the latest methods. In the future, we will force on the computational complexity of the proposed method and the depth correlation between labels and images in the annotation process.

Data Availability

The datasets used in this paper are public datasets which can be accessed by the following websites: Corel 5K: https://rdrr.io/cran/mldr.datasets/man/corel5k.html and IAPR TC12: http://www-i6.informatik.rwth-aachen.de/imageclef/resources/iaprtc12.tgz;

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This work was supported by “the Fundamental Research Funds for the Central Universities”, No. DUT20LAB136.