Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Hall, Mark; Clough, Paul; Stevenson, Mark

doi:10.1007/978-3-642-33290-6_35

Mark Hall^19,20,
Paul Clough²⁰ &
Mark Stevenson¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7489))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

2285 Accesses
8 Citations

Abstract

Large digital libraries have become available over the past years through digitisation and aggregation projects. These large collections present a challenge to the new user who wishes to discover what is available in the collections. Subject classification can help in this task, however in large collections it is frequently incomplete or inconsistent. Automatic clustering algorithms provide a solution to this, however the question remains whether they produce clusters that are sufficiently cohesive and distinct for them to be used in supporting discovery and exploration in digital libraries. In this paper we present a novel approach to investigating cluster cohesion that is based on identifying instruders in a cluster. The results from a human-subject experiment show that clustering algorithms produce clusters that are sufficiently cohesive to be used where no (consistent) manual classification exists.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval 12, 461–486 (2009), doi:10.1007/s10791-008-9066-8
Article Google Scholar
Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: Optics: ordering points to identify the clustering structure. SIGMOD Rec. 28(2), 49–60 (1999)
Article Google Scholar
Azzopardi, L., Girolami, M., van Rijsbergen, C.: Topic based language models for ad hoc information retrieval. In: Proceedings of the IEEE International Joint Conference on Neural Networks 2004, vol. 4, pp. 3281–3286 (July 2004)
Google Scholar
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. In: VLDB 2012 (2012)
Google Scholar
Blei, D.M., Griffiths, T., Jordan, M., Tenenbaum, J.: Hierarchical topic models and the nested chinese restaurant process. In: NIPS (2003)
Google Scholar
Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., Blei, D.M.: Reading tea leaves: How humans interpret topic models. In: NIPS (2009)
Google Scholar
Clough, P., Sanderson, M., Reid, N.: The eurovision st andrews collection of photographs. ACM SIGIR Forum 40(1), 21–30 (2006)
Article Google Scholar
Eklund, P., Goodall, P., Wray, T.: Cluster-based navigation for a virtual museum. In: Adaptivity, Personalization and Fusion of Heterogeneous Information, RIAO 2010, Le Centre de Hautes Etudes Internationales d’Informatique Documentaire, Paris, France, France, pp. 211–212 (2010)
Google Scholar
Granitzer, M., Kienreich, W., Sabol, V., Andrews, K., Klieber, W.: Evaluating a system for interactive exploration of large, hierarchically structured document repositories. In: IEEE Symposium on Information Visualization, INFOVIS 2004, pp. 127–134 (2004)
Google Scholar
Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National Academiy of Science 101, 5228–5235 (2004)
Article Google Scholar
Handl, J., Meyer, B.: Improved Ant-Based Clustering and Sorting in a Document Retrieval Interface. In: Guervós, J.J.M., Adamidis, P.A., Beyer, H.-G., Fernández-Villacañas, J.-L., Schwefel, H.-P. (eds.) PPSN 2002. LNCS, vol. 2439, pp. 913–923. Springer, Heidelberg (2002)
Chapter Google Scholar
Hassan-Montero, Y., Herrero-Solana, V.: Improving tag-clouds as visual information retrieval interfaces. In: Proceedings InfoSciT (2006)
Google Scholar
He, J., Tan, A.-H., Tan, C.-L., Sun, S.-Y.: On quantitative evaluation of clustering systems. In: Information Retrieval and Clustering, pp. 105–133. Kluwer Academic Publishers (2003)
Google Scholar
Lloyd, S.P.: Least square quantization in pcm. IEEE Transactions on Information Theory 28(2), 129–137 (1982)
Article MathSciNet MATH Google Scholar
Loper, E., Bird, S.: Nltk: the natural language toolkit. In: Proceedings of the ACL 2002 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, ETMTNLP 2002, vol. 1, pp. 63–70. Association for Computational Linguistics, Stroudsburg (2002)
Google Scholar
Marchionini, G.: Exploratory search: From finding to understanding. Communications of the ACM 49(4), 41–46 (2006)
Article Google Scholar
Maulik, U., Bandyopadhyay, S.: Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(12), 1650–1654 (2002)
Article Google Scholar
Mei, X.S., Zhai, C.: Automatic labeling of multinomial topic models. In: Proceedings of KDD 2007, pp. 490–499 (2007)
Google Scholar
Newman, D., Karimi, S., Cavedon, L.: External evaluation of topic models. In: Proceedings of teh 14th Australasian Document Computing Symposum, pp. 11–18 (2009)
Google Scholar
Newman, D., Noh, Y., Talley, E., Karimi, S., Baldwin, T.: Evaluating topic models for digital libraries. In: JCDL 2010 (2010)
Google Scholar
Pirolli, P.: Powers of 10: Modeling complex information-seeking systems at multiple scales. Computer 42(3), 33–40 (2009)
Article Google Scholar
Rao, R., Pedersen, J.O., Hearst, M.A., Mackinlay, J.D., Card, S.K., Masinter, L., Halvorsen, P.-K., Robertson, G.C.: Rich interaction in the digital library. Commun. ACM 38(4), 29–39 (1995)
Article Google Scholar
Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, pp. 45–50. ELRA (May 2010), http://is.muni.cz/publication/884893/en
Roussinov, D.G., Chen, H.: Document clustering for electronic meetings: an experimental comparison of two techniques. Decision Support Systems 27(1-2), 67–79 (1999)
Article Google Scholar
Sculley, D.: Web-scale k-means clustering. In: WWW 2010 (2010)
Google Scholar
Song, M.: Bibliomapper: a cluster-based information visualization technique. In: Proceedings of the Information Visualization, pp. 130–136 (1998)
Google Scholar
Sutcliffe, A., Ennis, M.: Towards a cognitive theory of information retrieval. Interacting with Computers 10, 321–351 (1998)
Article Google Scholar
van Ossenbruggen, J., Amin, A., Hardman, L., Hildebrand, M., van Assem, M., Omelayenko, B., Schreiber, G., Tordai, A., de Boer, V., Wielinga, B., Wielemaker, J., de Niet, M., Taekema, J., van Orsouw, M.-F., Teesing, A.: Searching and annotating virtual heritage collections with semantic-web technologies. In: Museums and the Web 2007 (2007)
Google Scholar
Wallach, H.M., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for topic models. In: Proceedings of the 26th International Conference on Machine Learning (2009)
Google Scholar
Wei, X., Croft, W.B.: Lda-based document models for ad-hoc retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference, SIGIR 2006, pp. 178–185. ACM, New York (2006)
Chapter Google Scholar
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G., Ng, A., Liu, B., Yu, P., Zhou, Z.-H., Steinbach, M., Hand, D., Steinberg, D.: Top 10 algorithms in data mining. Knowledge and Information Systems 14, 1–37 (2008), doi:10.1007/s10115-007-0114-2
Article Google Scholar
Zhao, W., Ma, H., He, Q.: Parallel K-Means Clustering Based on MapReduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) CloudCom 2009. LNCS, vol. 5931, pp. 674–679. Springer, Heidelberg (2009)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department for Computer Science, Sheffield University, Sheffield, UK
Mark Hall & Mark Stevenson
Information School, Sheffield University, Sheffield, UK
Mark Hall & Paul Clough

Authors

Mark Hall
View author publications
You can also search for this author in PubMed Google Scholar
Paul Clough
View author publications
You can also search for this author in PubMed Google Scholar
Mark Stevenson
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Multimedia and Graphic Arts, Cyprus University of Technology, 3036, Limassol, Cyprus
Panayiotis Zaphiris & Fernando Loizides &
School of Informatics, City University of London, Northampton Square, EC1V 0HB, London, UK
George Buchanan
School of Library, Archival and Information Studies, Irving K. Barber Learning Centre, The University of British Columbia, V6T 1Z3, Vancouver, BC, Canada
Edie Rasmussen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hall, M., Clough, P., Stevenson, M. (2012). Evaluating the Use of Clustering for Automatically Organising Digital Library Collections. In: Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F. (eds) Theory and Practice of Digital Libraries. TPDL 2012. Lecture Notes in Computer Science, vol 7489. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33290-6_35

Download citation

DOI: https://doi.org/10.1007/978-3-642-33290-6_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33289-0
Online ISBN: 978-3-642-33290-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics