Keywords

1 Introduction

Recent advances in Information and Communication Technology (ICT) aim at tackling some of the most important challenges in agriculture we face today [5]. Supporting the world’s current food needs without compromising future generations through sustainable agriculture is of great challenge. Indeed, among all the topics around sustainable agriculture, how to reduce the usage, and the impact of pesticide without losing the quantity or quality in the yield to fulfill the requirement of the growing population has an increasingly important place [6].

Researchers have applied a wide range of technologies to tackle some specific goals. Among these goals: climate prediction in agriculture using simulation models [7], making the production of certain types of grains more efficient and effective with computer vision and Artificial Intelligence [11], soil assessment with drones [14], and the IoT paradigm when connected devices such as sensors capture real-time data at the field level and that, combined with Cloud Computing, can be used to monitor agricultural components such as soil, plants, animals and weather and other environmental conditions [16]. The usage of such ICTs to improve farming processes is known as smart farming [18].

In the context of smart farming, IoT devices themselves are both data producers and data consumers and they produce highly-structured data; however these devices and the technologies we presented above are far from being the only data sources. Indeed, important information related to agriculture can also come from different sources such as official periodic reports and journals like the French Plants Health Bulletins (BSV, for its name in French Bulletin de Santé du Végétal)Footnote 1, social media such as Twitter and farmers experiences. The goal of the BSV is to: i), present a report of crop health, including their stages of development, observations of pests and diseases, and the presence of symptoms related to them; and ii), provide an evaluation of the phytosanitary risk, according to the periods of crop sensitivity and the pest and disease thresholds. The BSV and other formal reports are semi-structured data.

In the agricultural context, Twitter -or any other social media- can be used as a platform for knowledge exchange about sustainable soil management [10] and it can also help the public to understand agricultural issues and support risk and crisis communication in agriculture [1]. Farmer experiences (aka Old farming practices or ancestral knowledge) may be collected through interviews and participatory processes. Social media posts and farmer experiences are non-structured data.

Fig. 1.
figure 1

Heterogeneous sources of agricultural data: non-structured data from Twitter and from farmers experiences www.bio-centre.org; semi-structured data from The French Plants Health Bulletins; and structured data from a weather sensor from www.data.gouv.fr.

Figure 1 illustrates how this heterogeneous data coming from different sources may look like for farmers: information is not always explicit or timely. Our objective is to integrate such heterogeneous data into knowledge bases that can support farmers in their activities, and to present global, real-time and comprehensive information to researchers and interested parties. We present related work in Sect. 2, our initial approach in Sect. 3 and conclusions and perspectives in Sect. 4.

2 Previous Works

We classify existing works into two categories: information access and management in plant health domain, and data integration in agriculture. In the information access and management in plant health domain category, the semantic annotation in BSV focuses on extracting information for the traditional BSV. Indeed, for more than 50 years, printed plant health bulletins have been diffused by regions and by crops in France, giving information about the arrival and the evolution of pests, pathogens, and weeds, and advises for preventive actions. These bulletins serve not only as agricultural alerts for farmers but also documentation for those who want to study the historical data. The French National Institute For Agricultural Research (INRA) has been working towards the publishing of the bulletins as Linked Open Data [12], where BSV from different regions are centralized, tagged with crop type, region, date and published on the Internet. To organize the bulletins by crop usage in France, an ontology with 272 concepts was manually constructed. With the volume of concepts and relations augmenting, manual construction of ontologies will become too expensive [3]. Thus, ontology learning methods to automatically extract concepts and relationships should be studied.

INRA has also introduced a method to modulate an ontology for crop observation [13]. The process is the following: 1) collect competency questions from researchers in agronomy; 2) construct the ontology corresponding to requirements in competency questions; 3) ask semantic experts who have not participated in the conception of the ontology to translate the competency questions into SPARQL queries to validate the ontology design. In this exercise, a model to describe the appearance of pests was given but not instantiated, nevertheless it could be a reference to our future crop-pest ontology conception.

Finally, Pest observer (http://www.pestobserver.eu/) is a web portal [15] which enables users to explore BSV with a combination of the following filters: crop, disease and pest; however, crop-pest relationships are not included. It relies on text-mining techniques to index BSV documents.

Regarding data integration in agriculture, AGRISFootnote 2, the International System for Agricultural Science Technology states that many initiatives are developed to return more meaningful data to users [4]. Some of these initiatives are: extracting keywords by crawling the Web to build the AGROVOC vocabulary, which covers all areas of interest of the Food and Agriculture Organization of the United Nations; and SemaGrow [9], which is an open-source infrastructure for linked open data (LOD) integration that federates SPARQL endpoints from different providers. To extract pest and insecticide related relations, SemaGrow uses Computer-aided Ontology Development Architecture (CODA) for RDF triplification of Unstructured Information Management Architecture (UIMA) results from analysis of unstructured content.

Though INRA kick-started categorizing the french crop bulletins using linked open data, and that project SemaGrow shed light upon heterogeneous data integration using ontologies, both projects focused on processing formal and technical documents. Moreover, in CODA application case, IsPestOf rule was defined but not instantiated. Therefore, a global knowledge base, that covers the crops, the natural hazards including pests, diseases, and climate variations, and the relations between them, is still missing. There is also an increasing necessity to a comprehensive and an automatic approach to integrate knowledge from an ampler variety of heterogeneous sources.

Fig. 2.
figure 2

Our approach for building a phytosanitary knowledge

3 Proposed Design

Figure 2 illustrates our initial design to manage the phytosanitary knowledge from heterogeneous data sources. It consists of a first phase based on ontology learning and a second phase based on ontology-based information extraction:

  • Linguistic preprocessing: Unstructured and semi-structured textual data are passed through a linguistic prepossessing pipeline (Sentence segmentation, Tokenization, Part-of-Speech (POS) tagging, Lemmatization) with existing natural language processing (NLP) tools such as Stanford NLP (https://nlp.stanford.edu/), GATE (https://gate.ac.uk/) and UIMA (https://uima.apache.org/).

  • Terms/concept detection: At the best of our knowledge and from the state of the art study, there is no ontology in french that modulates the natural hazards and their relations with crops. Existing french thesaurus like french crop usage and Agrovoc can be applied to filter collected data and served as gazetteer. Linguistic rules represented by regular expressions can be used to extract temporal data. Recurrent neural network (RNN), conditional random field (CRF) model and bidirectional long-short term memory (BiLSTM) were applied for health-related name entity recognition from twitter messages and gave a remarkable result [2]. Once the ontology is populated, it could provide knowledge and constraints to the extraction of terms [17].

  • Relation detection: Similar to term/concept detection, initially there’s no ontology. A basic strategy could be using self-supervised methods like Modified Open Information Extraction (MOIE): i) use wordnet-based semantic similarity and frequency distribution to identify related terms among detected terms from previous step ii) slicing the textual patterns between related terms [8]. Once the ontology is populated, it could contribute to calculate semantic similarities between detected terms in phase i).

  • Ontology generation: Ontology generation with CODA and Pearl, as in the SemaGrow project presented in Sect. 2.

  • Evaluation: This architecture presents a mutual application-based evaluation design: ideally the learned ontology should improve the information extraction. Besides, Pest observer web portal can be served to validate phytosanitary information extraction from plant health bulletins.

4 Conclusions and Perspectives

New digital technologies allow farmers to predict the yield of their fields, to optimize their resources and to avoid or protect their fields from natural hazards whether they are due to the weather, pests or diseases. This is a recent area where research is constantly evolving. We have introduced in this paper work relevant to our problem, namely: the integration of several data sources to extract information related to the natural hazards in agriculture. We then proposed an architecture based on ontology learning and ontology-based information extraction. We plan in a first phase build an ontology from twitter data that contains vocabulary in the existing thesaurus. To evaluate the constructed ontology, we will extract crops and pests from the learnt ontology, and compare it with tags in pest observer. In the following iterations, we will work on ontology alignment strategies to update the ontology with data from other sources. To go further, multilingual ontology management with keeping tempo-spacial contexts should be investigated.