Skip to main content

Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

  • Conference paper
Theory and Practice of Digital Libraries (TPDL 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7489))

Included in the following conference series:

Abstract

Large digital libraries have become available over the past years through digitisation and aggregation projects. These large collections present a challenge to the new user who wishes to discover what is available in the collections. Subject classification can help in this task, however in large collections it is frequently incomplete or inconsistent. Automatic clustering algorithms provide a solution to this, however the question remains whether they produce clusters that are sufficiently cohesive and distinct for them to be used in supporting discovery and exploration in digital libraries. In this paper we present a novel approach to investigating cluster cohesion that is based on identifying instruders in a cluster. The results from a human-subject experiment show that clustering algorithms produce clusters that are sufficiently cohesive to be used where no (consistent) manual classification exists.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval 12, 461–486 (2009), doi:10.1007/s10791-008-9066-8

    Article  Google Scholar 

  2. Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: Optics: ordering points to identify the clustering structure. SIGMOD Rec. 28(2), 49–60 (1999)

    Article  Google Scholar 

  3. Azzopardi, L., Girolami, M., van Rijsbergen, C.: Topic based language models for ad hoc information retrieval. In: Proceedings of the IEEE International Joint Conference on Neural Networks 2004, vol. 4, pp. 3281–3286 (July 2004)

    Google Scholar 

  4. Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. In: VLDB 2012 (2012)

    Google Scholar 

  5. Blei, D.M., Griffiths, T., Jordan, M., Tenenbaum, J.: Hierarchical topic models and the nested chinese restaurant process. In: NIPS (2003)

    Google Scholar 

  6. Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., Blei, D.M.: Reading tea leaves: How humans interpret topic models. In: NIPS (2009)

    Google Scholar 

  7. Clough, P., Sanderson, M., Reid, N.: The eurovision st andrews collection of photographs. ACM SIGIR Forum 40(1), 21–30 (2006)

    Article  Google Scholar 

  8. Eklund, P., Goodall, P., Wray, T.: Cluster-based navigation for a virtual museum. In: Adaptivity, Personalization and Fusion of Heterogeneous Information, RIAO 2010, Le Centre de Hautes Etudes Internationales d’Informatique Documentaire, Paris, France, France, pp. 211–212 (2010)

    Google Scholar 

  9. Granitzer, M., Kienreich, W., Sabol, V., Andrews, K., Klieber, W.: Evaluating a system for interactive exploration of large, hierarchically structured document repositories. In: IEEE Symposium on Information Visualization, INFOVIS 2004, pp. 127–134 (2004)

    Google Scholar 

  10. Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National Academiy of Science 101, 5228–5235 (2004)

    Article  Google Scholar 

  11. Handl, J., Meyer, B.: Improved Ant-Based Clustering and Sorting in a Document Retrieval Interface. In: Guervós, J.J.M., Adamidis, P.A., Beyer, H.-G., Fernández-Villacañas, J.-L., Schwefel, H.-P. (eds.) PPSN 2002. LNCS, vol. 2439, pp. 913–923. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  12. Hassan-Montero, Y., Herrero-Solana, V.: Improving tag-clouds as visual information retrieval interfaces. In: Proceedings InfoSciT (2006)

    Google Scholar 

  13. He, J., Tan, A.-H., Tan, C.-L., Sun, S.-Y.: On quantitative evaluation of clustering systems. In: Information Retrieval and Clustering, pp. 105–133. Kluwer Academic Publishers (2003)

    Google Scholar 

  14. Lloyd, S.P.: Least square quantization in pcm. IEEE Transactions on Information Theory 28(2), 129–137 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  15. Loper, E., Bird, S.: Nltk: the natural language toolkit. In: Proceedings of the ACL 2002 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, ETMTNLP 2002, vol. 1, pp. 63–70. Association for Computational Linguistics, Stroudsburg (2002)

    Google Scholar 

  16. Marchionini, G.: Exploratory search: From finding to understanding. Communications of the ACM 49(4), 41–46 (2006)

    Article  Google Scholar 

  17. Maulik, U., Bandyopadhyay, S.: Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(12), 1650–1654 (2002)

    Article  Google Scholar 

  18. Mei, X.S., Zhai, C.: Automatic labeling of multinomial topic models. In: Proceedings of KDD 2007, pp. 490–499 (2007)

    Google Scholar 

  19. Newman, D., Karimi, S., Cavedon, L.: External evaluation of topic models. In: Proceedings of teh 14th Australasian Document Computing Symposum, pp. 11–18 (2009)

    Google Scholar 

  20. Newman, D., Noh, Y., Talley, E., Karimi, S., Baldwin, T.: Evaluating topic models for digital libraries. In: JCDL 2010 (2010)

    Google Scholar 

  21. Pirolli, P.: Powers of 10: Modeling complex information-seeking systems at multiple scales. Computer 42(3), 33–40 (2009)

    Article  Google Scholar 

  22. Rao, R., Pedersen, J.O., Hearst, M.A., Mackinlay, J.D., Card, S.K., Masinter, L., Halvorsen, P.-K., Robertson, G.C.: Rich interaction in the digital library. Commun. ACM 38(4), 29–39 (1995)

    Article  Google Scholar 

  23. Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, pp. 45–50. ELRA (May 2010), http://is.muni.cz/publication/884893/en

  24. Roussinov, D.G., Chen, H.: Document clustering for electronic meetings: an experimental comparison of two techniques. Decision Support Systems 27(1-2), 67–79 (1999)

    Article  Google Scholar 

  25. Sculley, D.: Web-scale k-means clustering. In: WWW 2010 (2010)

    Google Scholar 

  26. Song, M.: Bibliomapper: a cluster-based information visualization technique. In: Proceedings of the Information Visualization, pp. 130–136 (1998)

    Google Scholar 

  27. Sutcliffe, A., Ennis, M.: Towards a cognitive theory of information retrieval. Interacting with Computers 10, 321–351 (1998)

    Article  Google Scholar 

  28. van Ossenbruggen, J., Amin, A., Hardman, L., Hildebrand, M., van Assem, M., Omelayenko, B., Schreiber, G., Tordai, A., de Boer, V., Wielinga, B., Wielemaker, J., de Niet, M., Taekema, J., van Orsouw, M.-F., Teesing, A.: Searching and annotating virtual heritage collections with semantic-web technologies. In: Museums and the Web 2007 (2007)

    Google Scholar 

  29. Wallach, H.M., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for topic models. In: Proceedings of the 26th International Conference on Machine Learning (2009)

    Google Scholar 

  30. Wei, X., Croft, W.B.: Lda-based document models for ad-hoc retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference, SIGIR 2006, pp. 178–185. ACM, New York (2006)

    Chapter  Google Scholar 

  31. Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G., Ng, A., Liu, B., Yu, P., Zhou, Z.-H., Steinbach, M., Hand, D., Steinberg, D.: Top 10 algorithms in data mining. Knowledge and Information Systems 14, 1–37 (2008), doi:10.1007/s10115-007-0114-2

    Article  Google Scholar 

  32. Zhao, W., Ma, H., He, Q.: Parallel K-Means Clustering Based on MapReduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) CloudCom 2009. LNCS, vol. 5931, pp. 674–679. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hall, M., Clough, P., Stevenson, M. (2012). Evaluating the Use of Clustering for Automatically Organising Digital Library Collections. In: Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F. (eds) Theory and Practice of Digital Libraries. TPDL 2012. Lecture Notes in Computer Science, vol 7489. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33290-6_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33290-6_35

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33289-0

  • Online ISBN: 978-3-642-33290-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics