ABSTRACT
This tutorial makes the case for developing a unified framework that manages information extraction from unstructured data (focusing in particular on text). We first survey research on information extraction in the database, AI, NLP, IR, and Web communities in recent years. Then we discuss why this is the right time for the database community to actively participate and address the problem of managing information extraction (including in particular the challenges of maintaining and querying the extracted information, and accounting for the imprecision and uncertainty inherent in the extraction process). Finally, we show how interested researchers can take the next step, by pointing to open problems, available datasets, applicable standards, and software tools. We do not assume prior knowledge of text management, NLP, extraction techniques, or machine learning.
Index Terms
- Managing information extraction: state of the art and research directions
Recommendations
Ontology-based information extraction: An introduction and a survey of current approaches
Information extraction (IE) aims to retrieve certain types of information from natural language text by processing them automatically. For example, an IE system might retrieve information about geopolitical indicators of countries from a set of web ...
An information extraction engine for web discussion forums
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide WebIn this poster, we present an information extraction engine for web-based forums. The engine analyzes the HTML files crawled from web forums, deduces the wrapper (template) of the pages and extracts the information about posts (e.g., author, title, ...
Semi-automatic information extraction from discussion boards with applications for anti-spam technology
ICCSA'10: Proceedings of the 2010 international conference on Computational Science and Its Applications - Volume Part IIForums (or discussion boards) represent a huge information collection structured under different boards, threads and posts. The actual information entity of a forum is a post, which has the information about authors, date and time of post, actual ...
Comments