Skip to main content

Distributed Classification of Textual Documents on the Grid

  • Conference paper
High Performance Computing and Communications (HPCC 2006)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4208))

Abstract

Efficient access to information and integration of information from various sources and leveraging this information to knowledge are currently major challenges in life science research. However, a large fraction of this information is only available from scientific articles that are stored in huge document databases in free text format or from the Web, where it is available in semi-structured format.

Text mining provides some methods (e.g., classification, clustering, etc.) able to automatically extract relevant knowledge patterns contained in the free text data. The inclusion of the Grid text-mining services into a Grid-based knowledge discovery system can significantly support problem solving processes based on such a system.

Motivation for the research effort presented in this paper is to use the Grid computational, storage, and data access capabilities for text mining tasks and text classification in particular. Text classification mining methods are time-consuming and utilizing the Grid infrastructure can bring significant benefits. Implementation of text mining techniques in distributed environment allows us to access different geographically distributed data collections and perform text mining tasks in parallel/distributed fashion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apte, C., Damerau, F., Weiss, S.M.: Towards language independent automated learning of text categorisation models. In: Research and Development in Information Retrieval, pp. 23–30 (1994)

    Google Scholar 

  2. Bednar, P., Butka, P., Paralic, J.: Java library for support of text mining and retrieval. In: Proceedings of Znalosti 2005, Stara Lesna, pp. 162–169 (2005)

    Google Scholar 

  3. Brezany, P., Janciak, I., Woehrer, A., Min Tjoa, A.: Gridminer: A framework for knowledge discovery on the grid - from a vision to design and implementation. In: Cracow Grid Workshop, Cracow (December 2004)

    Google Scholar 

  4. Domingos, P., Pazzani, M.J.: On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning 29(2-3), 103–130 (1997)

    Article  MATH  Google Scholar 

  5. Lewis, D.D.: Reuters-21578 text categorization test collection distribution 1.0 (1999), http://www.research.att.com/~lewis

  6. Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Developement 4, 309–317 (1957)

    Article  MathSciNet  Google Scholar 

  7. Quinlan, J.R.: Learning first-order definitions of functions. Journal of Artificial Intelligence Research 5, 139–161 (1996)

    MATH  Google Scholar 

  8. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 513–523 (1988)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Janciak, I., Sarnovsky, M., Tjoa, A.M., Brezany, P. (2006). Distributed Classification of Textual Documents on the Grid. In: Gerndt, M., Kranzlmüller, D. (eds) High Performance Computing and Communications. HPCC 2006. Lecture Notes in Computer Science, vol 4208. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11847366_73

Download citation

  • DOI: https://doi.org/10.1007/11847366_73

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-39368-9

  • Online ISBN: 978-3-540-39372-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics