Towards a Platform for Curation Technologies: Enriching Text Collections with a Semantic-Web Layer

Bourgonje, Peter; Moreno-Schneider, Julian; Nehring, Jan; Rehm, Georg; Sasaki, Felix; Srivastava, Ankit

doi:10.1007/978-3-319-47602-5_14

Peter Bourgonje¹⁹,
Julian Moreno-Schneider¹⁹,
Jan Nehring¹⁹,
Georg Rehm¹⁹,
Felix Sasaki¹⁹ &
…
Ankit Srivastava¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 9989))

Included in the following conference series:

European Semantic Web Conference

1690 Accesses
6 Citations

Abstract

In an attempt to put a Semantic Web-layer that provides linguistic analysis and discourse information on top of digital content, we develop a platform for digital curation technologies. The platform offers language-, knowledge- and data-aware services as a flexible set of workflows and pipelines for the efficient processing of various types of digital content. The platform is intended to enable human experts (knowledge workers) to get a grasp and understand the contents of large document collections in an efficient way so that they can curate, process and further analyse the collection according to their sector-specific needs.

You have full access to this open access chapter, Download conference paper PDF

Designing User Interfaces for Curation Technologies

AnnoTag: Concise Content Annotation via LOD Tags derived from Entity-Level Analytics

Semantic Tagging via Entity-Level Analytics: Assessment of Concise Content Tagging

Keywords

1 Introduction

The target audience of our platform are knowledge workers who conduct research in specific domains with the goal of, for example, preparing museum exhibitions or writing news articles. Typically they only have limited time available to accomplish their tasks, ranging from several hours to one or two weeks at most. Owing to the diversity of tasks, the domains are often new to them. The output of their work is typically used in online or traditional media (e.g., newspapers, agencies, tv stations) or by a museum for an exhibition. In the project Digitale Kuratierungstechnologien^{Footnote 1} (DKT, Digital Curation Technologies, [7]), we aim at automating specific parts of the workflows, which consist of looking for information related to and relevant for the domain, learning the key concepts, selecting the most relevant parts and preparing the information to be used. We build a platform that integrates various Natural Language Processing (NLP), Information Retrieval (IR) and Machine Translation (MT) components to retrieve information and recombine them to produce output that improves the curation processes and makes them more efficient. We work with heterogeneous data sets from public and non-public sources. The output will be in the form of a semantically enriched hypertext graph, stored and accessed using linked data technologies. Our goal is to enable knowledge workers to explore and curate document collections easier and more efficiently [6].

2 Related Work

Our digital curation platform integrates individual components by linking content with metadata [2]. The components include open-source tools for NLP tasks such as Named Entity Recognition (NER), Information Extraction (IE), IR and MT. The platform uses an architecture developed in the project FREME [8]. A related platform focusing on the localisation industry as the main use case is described by [1]. Our platform targets multiple other industry sectors, has different digital content formats in focus and deploys a different approach to representing curated information.

3 Semantic Layer

For the purpose of this paper we focus on written text documents. Eventually we will also be able to handle the conversion of non-textual data into text (e.g., transcripts for audio, subtitles for video, etc.). On top of text data we generate a Semantic Layer (SL) that contains semantic annotations. The SL creates an interlinked representation connected to external information sources. It is produced by a set of tools that communicate using the NLP Interchange Format (NIF).^{Footnote 2} It operates in a pipelined workflow where the output of each service is used as input for the next one. The SL can be used for exploratory search. The user query is sent through the same pipeline used to generate the SL over the whole document collection. This allows us to search the index for the plain words in the query, but also any entities or temporal expressions that were recognized. The components of the pipelined workflow are:

NLP: This component consists of NER combining a model approach and a dictionary approach. It works with three types of entities: persons, locations and organizations. Any entities found in the input are annotated using NIF. After NER, we also perform Entity Linking using DBPedia Spotlight [4] to retrieve the relevant (DBPedia) URI for entities recognized with the model and on the URI directly taken from the dictionary (the dictionary specifies a key – the entity – and a value, i.e., a URI in some ontology) for entities recognized with the dictionary. Subsequently the NLP workflow performs a temporal expression analysis. This module consists of a language-specific regular expression grammar and currently supports German and English. The expressions are normalized to a machine readable format and added to the NIF model.
Information extraction: We use Lucene^{Footnote 3} to create an index for our document collection that enables text-based IR. In addition to indexing the text content, entities and temporal expressions have their own specific fields in order to allow search in the SL as well. Indexing entities also allows us to disambiguate based on entity clustering (planned for the next phase in this two-year project).
Semantic Storage: The semantic information generated during the NLP processes is stored in the triple store Sesame.^{Footnote 4} We use an ontology relating the semantic information extracted from the documents. It relies on Schema.org to describe entities and contains documents and concepts, where the concepts are divided into locations, organizations, persons and temporal expressions.
Multilingual component: This component is based on Moses^{Footnote 5} enhanced with pre-/post-processing modules to leverage the information obtained from preceding steps (e.g., NER, temporal analysis). The MT system is capable of translating both segments (sentences, subtitles) and documents enabling knowledge workers to retrieve information and to present the semantically-enriched output in several languages (English, German, Spanish, Arabic). Preliminary experiments show as much as a 5 % improvement in the overall MT system performance for multiple language pairs and domains.

4 Experiments

Our goal is to reduce the time knowledge workers invest in their sector-specific curation processes. A proper evaluation would require us to measure the time it takes knowledge workers to get from input to output with and without utilization of our platform. This is rather difficult to measure and to quantify. We are in an early stage of the project and do not have access to suitable data for such an evaluation yet. As the project progresses, we will acquire real-world data (to be provided by the industrial partners involved in the project), annotate the data to construct a gold standard so that the platform can be evaluated as a whole. For now, we can offer isolated evaluations of individual components. For evaluating the German version of the temporal expression analyzer, we use the German WikiWars corpus [3]. This corpus is a collection of 22 documents sourced from Wikipedia pages about military conflicts and contains 2.240 temporal expressions. Evaluating against this corpus, we can report an f-score of 0.83. However, we developed against this same corpus and since we are mainly interested in coverage of our regular expressions, the corpus was not divided into training and test sets. We consider our f-score an acceptable baseline and will continue to improve this during the project. For evaluating the German version of the NER module, we selected the German wikiNER corpus [5]. This corpus contains NER annotations in CoNLL format. For this we can report f-scores of 0.78, 0.87 and 0.76 for locations, persons and organizations, respectively. These numbers will serve as baselines for future work as well.

5 Conclusion and Future Work

This article addresses the issue of combining NLP, IR and MT procedures into a system that enables knowledge workers to explore a collection of documents in an intuitive and efficient way. Our focus is on combining the individual components and linking the output of the methods, rather than trying to improve upon the output of individual state-of-the-art procedures. In this early stage of the project, we can aggregate the information contained in multiple documents and present this in a way that allows the knowledge worker to see what is inside. For the future we plan to work on making our tools easily adaptable to new domains. This poses a challenge since we expect to deal with domains for which only limited amounts of training data are available. We also plan to exploit the linked open data framework more by plugging in new datasets. Future applications are related to the project goals: text summarization of documents will help the curation process and semantic story-telling will assist in text generation processes, relating individual document components at a semantic level.

Notes

References

Lewis, D., Brennan, R., Finn, L., Jones, D., Meehan, A., O’sullivan, D., Hellmann, S., Sasaki, F.: Global intelligent content: active curation of language resources using linked data. In: Proceedings of LREC 2014, Reykjavik (2014)
Google Scholar
Lewis, D., Gómez-Pérez, A., Hellman, S., Sasaki, F.: The role of linked data for content annotation and translation. In: Proceedings of the 2014 European Data Forum. EDF 2014 (2014). http://2014.data-forum.eu
Mazur, P., Dale, R.: WikiWars: a new corpus for research on temporal expressions. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. EMNLP 2010, pp. 913–922. Association for Computational Linguistics, Stroudsburg (2010)
Google Scholar
Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems. I-Semantics 2011, pp. 1–8. ACM, New York (2011)
Google Scholar
Nothman, J., Ringland, N., Radford, W., Murphy, T., Curran, J.R.: Learning multilingual named entity recognition from Wikipedia. Artif. Intell. 194, 151–175 (2012)
Article MathSciNet MATH Google Scholar
Rehm, G.: Hypertextsorten: Definition - Struktur - Klassifikation. Ph.D. thesis, Institutfür Germanistik, Angewandte Sprachwissenschaft und Computerlinguistik, Justus-Liebig-Universität Giessen (2005)
Google Scholar
Rehm, G., Sasaki, F.: Semantische Technologien und Standards für das mehrsprachige Europa. In: Ege, B., Humm, B., Reibold, A. (eds.) Corporate Semantic Web, pp. 247–257. Springer, Heidelberg (2015)
Google Scholar
Sasaki, F., Gornostay, T., Dojchinovski, M., Osella, M., Mannens, E., Stoitsis, G., Richie, P., Declerck, T., Koidl, K.: Introducing freme: deploying linguistic linked data. In: Proceedings of the 4th Workshop of the Multilingual Semantic Web. MSW 2015 (2015)
Google Scholar

Download references

Acknowledgments

“Digitale Kuratierungstechnologien” is supported by the German Federal Ministry of Education and Research, Unternehmen Region, WK-P (No. 03WKP45).

Author information

Authors and Affiliations

Language Technology Lab, DFKI GmbH, Alt-Moabit 91c, 10559, Berlin, Germany
Peter Bourgonje, Julian Moreno-Schneider, Jan Nehring, Georg Rehm, Felix Sasaki & Ankit Srivastava

Authors

Peter Bourgonje
View author publications
You can also search for this author in PubMed Google Scholar
Julian Moreno-Schneider
View author publications
You can also search for this author in PubMed Google Scholar
Jan Nehring
View author publications
You can also search for this author in PubMed Google Scholar
Georg Rehm
View author publications
You can also search for this author in PubMed Google Scholar
Felix Sasaki
View author publications
You can also search for this author in PubMed Google Scholar
Ankit Srivastava
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Georg Rehm .

Editor information

Editors and Affiliations

Hasso-Plattner-Institut für Softwaresystemtechnik, Universität Potsdam, Potsdam, Germany
Harald Sack
Innovation Development, Istituto Superiore Mario Boella, Turin, Italy
Giuseppe Rizzo
Technical University of Ilmenau, Ilemnau, Germany
Nadine Steinmetz
Artiﬁcial Intelligence Laboratory, J. Stefan Institute, Ljubljana, Slovenia
Dunja Mladenić
Institut für Informatik III, University of Bonn, Bonn, Germany
Sören Auer
Institut für Informatik III, Universität Bonn, Bonn, Germany
Christoph Lange

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bourgonje, P., Moreno-Schneider, J., Nehring, J., Rehm, G., Sasaki, F., Srivastava, A. (2016). Towards a Platform for Curation Technologies: Enriching Text Collections with a Semantic-Web Layer. In: Sack, H., Rizzo, G., Steinmetz, N., Mladenić, D., Auer, S., Lange, C. (eds) The Semantic Web. ESWC 2016. Lecture Notes in Computer Science(), vol 9989. Springer, Cham. https://doi.org/10.1007/978-3-319-47602-5_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-47602-5_14
Published: 20 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47601-8
Online ISBN: 978-3-319-47602-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards a Platform for Curation Technologies: Enriching Text Collections with a Semantic-Web Layer

Abstract

Similar content being viewed by others

Designing User Interfaces for Curation Technologies

AnnoTag: Concise Content Annotation via LOD Tags derived from Entity-Level Analytics

Semantic Tagging via Entity-Level Analytics: Assessment of Concise Content Tagging

Keywords

1 Introduction

2 Related Work

3 Semantic Layer

4 Experiments

5 Conclusion and Future Work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Towards a Platform for Curation Technologies: Enriching Text Collections with a Semantic-Web Layer

Abstract

Similar content being viewed by others

Designing User Interfaces for Curation Technologies

AnnoTag: Concise Content Annotation via LOD Tags derived from Entity-Level Analytics

Semantic Tagging via Entity-Level Analytics: Assessment of Concise Content Tagging

Keywords

1 Introduction

2 Related Work

3 Semantic Layer

4 Experiments

5 Conclusion and Future Work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation