Abstract
The strong trend towards the automation of many aspects of scientific enquiry and scholarship has started to affect also the social sciences and even the humanities. Several recent articles have demonstrated the application of pattern analysis techniques to the discovery of non-trivial relations in various datasets that have relevance for social and human sciences, and some have even heralded the advent of “Computational Social Sciences” and “Culturomics”. In this review article I survey the results obtained over the past 5 years at the Intelligent Systems Laboratory in Bristol, in the area of automating the analysis of news media content. This endeavor, which we approach by combining pattern recognition, data mining and language technologies, is traditionally a part of the social sciences, and is normally performed by human researchers on small sets of data. The analysis of news content is of crucial importance due to the central role that the global news system plays in shaping public opinion, markets and culture. It is today possible to access freely online a large part of global news, and to devise automated methods for large scale constant monitoring of patterns in content. The results presented in this survey show how the automatic analysis of millions of documents in dozens of different languages can detect non-trivial macro-patterns that could not be observed at a smaller scale, and how the social sciences can benefit from closer interaction with the pattern analysis, artificial intelligence and text mining research communities.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aday, S.: Chasing the bad news: An analysis of 2005 iraq and afghanistan war coverage on nbc and fox news channel. Journal of Communications 60, 144–164 (2010)
Ali, O., Flaounas, I., De Bie, T., Mosdell, N., Lewis, J., Cristianini, N.: Automating news content analysis: An application to gender bias and readability. In: Workshop on Applications of Pattern Analysis (WAPA). JMLR: Workshop and Conference Proceedings, Windsor, UK, pp. 36–43 (2010)
Ali, O., Cristianini, N.: Information fusion for entity matching in unstructured data. In: Papadopoulos, H., Andreou, A.S., Bramer, M. (eds.) AIAI 2010. IFIP Advances in Information and Communication Technology, vol. 339, pp. 162–169. Springer, Heidelberg (2010)
Ariely, D., Berns, G.: Neuromarketing: the hope and hype of neuroimaging in business. Nature Reviews Neuroscience 11, 284–292 (2010)
Bach, F.: Bolasso: model consistent lasso estimation through the bootstrap. In: Proceedings of the 25th International Conference on Machine Learning, ICML 2008 (2008)
Bautin, M., Ward, C., Patil, A., Skiena, S.: Access: News and blog analysis for the social sciences. In: 19th Int. World Wide Web Conference, WWW 2010 (2010)
Chang, C., Lin, C.: LIBSVM : A library for support vector machines. Software available at (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm
Coyle, K.: Mass digitization of books. The Journal of Academic Librarianship 32, 641–645 (2006)
Crane, G.: What do you do with a million books? D-Lib Magazine 12 (2006)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other Kernel-based learning methods. Cambridge University Press, Cambridge (2000)
Cristianini, N.: Scientific method and patterns in data. In: Samalam, V. (ed.) Procs. of the 5th UK BCS Knowledge Discovery and Data Mining Symposium, University of Salford (2009)
Cristianini, N.: Are we there yet? Neural Networks (2010)
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: Gate: A framework and graphical development environment for robust nlp tools and applications. In: Proc. of the 40th Anniversary Meeting of the Association for Computational Linguistics, Philadelphia, USA, pp. 168–175 (2002)
Greenbaum, D., Luscombe, N.M., Janson, R., et al.: Interrelating different types of genomic data, from proteome to secretome: ’oming in on function. Genome Research
Editorial: Defining the scientific method. Nature Methods 6, 237 (2009)
Esuli, A., Sebastiani, F.: Sentiwordnet: A publicly available lexical resource for opinion mining. In: Proceedings of LREC, pp. 417–422 (2006)
Flaounas, I., Ali, O., Turchi, M., Snowsill, T., Nicart, F., Bie, T.D., Cristianini, N.: Noam: News outlets analysis and monitoring system. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, New York (2011)
Flaounas, I., Turchi, M., Ali, O., Fyson, N., Bie, T.D., Mosdell, N., Lewis, J., Cristianini, N.: The structure of eu mediasphere. PLoS ONE e14243 (2010)
Flaounas, I., Ali, O., Bie, T.D., Mosdell, N., Lewis, J., Cristianini, N.: Massive-scale automated analysis of news-content: Topics, style and gender (2011) (submitted for publication)
Flesch, R.: A new readability yardstick. Journal of Applied Psychology 32(3), 221–233 (1948)
González, M., Barabási, A.L.: Complex networks: From data to models. Nature Physics 3, 224–225 (2007)
Grivell, L.: Mining the bibliome: searching for a needle in a haystack? EMBO Reports 3(3), 200–203 (2002), http://www.nature.com/embor/journal/v3/n3/full/embor199.html
Janes, K., Yaffe, M.: Data-driven modelling of signal-transduction networks. Nature Reviews Molecular Cell Biology 7, 820–828 (2006)
Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Machine Translation Summit X, pp. 79–86 (2005)
Koehn, P., Hoang, H., et al.: Moses: Open source toolkit for statistical machine translation. In: Annual Meeting-Association for Computational Linguistics ACL 2007, demonstration session, vol. 45 (2007)
Lampos, V., De Bie, T., Cristianini, N.: Flu detector-tracking epidemics on twitter. In: Machine Learning and Knowledge Discovery in Databases, pp. 599–602 (2010)
Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A., Brewer, D., Christakis, N., Contractor, N., Fowler, J., Gutmann, M., et al.: Computational Social Science. Science 323(5915), 721–723 (2009)
Lewis, D., Yang, Y., Rose, T., Li, F.: Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)
Lloyd, L., Kechagias, D., Skiena, S.: Lydia: A system for large-scale news analysis. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 161–166. Springer, Heidelberg (2005)
Michel, J., Shen, Y., Aiden, A., Veres, A., Gray, M., Pickett, J., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., et al.: Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331(6014), 176 (2011)
Potthast, T.: Paradigm shifts versus fashion shifts? EMBO Reports 10, S42–S45 (2009)
Sandhaus, E.: The new york times annotated corpus. In: Linguistic Data Consortium, Philadelphia (2008)
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)
Snowsill, T., Flaounas, I., Bie, T.D., Cristianini, N.: Detecting events in a million new york times articles. In: Balcátar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) (ECML/PKDD 2010). LNCS (LNAI), vol. 6321, pp. 615–618. Springer, Heidelberg (2010)
Snowsill, T., Nicart, F., Stefani, M., de Bie, T., Cristianini, N.: Finding surprising patterns in textual data streams. In: 2nd International Workshop on Cognitive Information Processing, pp. 405–410 (2010)
Steinberger, R., Pouliquen, B., der Goot, E.V.: An Introduction to the Europe Media Monitor Family of Applications. In: Information Access in a Multilingual World-Proceedings of the SIGIR 2009 Workshop (SIGIR-CLIR 2009), pp. 1–8 (2009)
Turchi, M., Flaounas, I., Ali, O., De Bie, T., Snowsill, T., Cristianini, N.: Found in translation. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009. LNCS, vol. 5782, pp. 746–749. Springer, Heidelberg (2009)
Watts, D.: A twenty-first century science. Nature 445(7127), 489 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cristianini, N. (2011). Automatic Discovery of Patterns in Media Content. In: Giancarlo, R., Manzini, G. (eds) Combinatorial Pattern Matching. CPM 2011. Lecture Notes in Computer Science, vol 6661. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21458-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-21458-5_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21457-8
Online ISBN: 978-3-642-21458-5
eBook Packages: Computer ScienceComputer Science (R0)