ABSTRACT
While social data is being widely used in various applications such as sentiment analysis and trend prediction, its sheer size also presents great challenges for storing, sharing and processing such data. These challenges can be addressed by data summarization which transforms the original dataset into a smaller, yet still useful, subset. Existing methods find such subsets with objective functions based on data properties such as representativeness or informativeness but do not exploit social contexts, which are distinct characteristics of social data. Further, till date very little work has focused on topic preserving data summarization, despite the abundant work on topic modeling. This is a challenging task for two reasons. First, since topic model is based on latent variables, existing methods are not well-suited to capture latent topics. Second, it is difficult to find such social contexts that provide valuable information for building effective topic-preserving summarization model. To tackle these challenges, in this paper, we focus on exploiting social contexts to summarize social data while preserving topics in the original dataset. We take Twitter data as a case study. Through analyzing Twitter data, we discover two social contexts which are important for topic generation and dissemination, namely (i) CrowdExp topic score that captures the influence of both the crowd and the expert users in Twitter and (ii) Retweet topic score that captures the influence of Twitter users' actions. We conduct extensive experiments on two real-world Twitter datasets using two applications. The experimental results show that, by leveraging social contexts, our proposed solution can enhance topic-preserving data summarization and improve application performance by up to 18%.
- Machine learning for language toolkit. http://mallet.cs.umass.edu/.Google Scholar
- Twitter public apis. https://dev.twitter.com/overview/documentation.Google Scholar
- Twitter public search apis. https://dev.twitter.com/rest/public/search.Google Scholar
- S. Auty and R. Elliott. Being like or being liked: identity vs. approval in a social context. Advances in Consumer Research, 28(1), 2001.Google Scholar
- A. Badanidiyuru, B. Mirzasoleiman, A. Karbasi, and A. Krause. Streaming submodular maximization: Massive data summarization on the fly. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 671--680. ACM, 2014. Google ScholarDigital Library
- S.-A. Bahrainian and A. Dengel. Sentiment analysis and summarization of twitter data. In Computational Science and Engineering (CSE), 2013 IEEE 16th International Conference on, pages 227--234. IEEE, 2013. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993--1022, 2003. Google ScholarDigital Library
- Y. Cha, B. Bi, C.-C. Hsieh, and J. Cho. Incorporating popularity in topic models for social network analysis. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pages 223--232. ACM, 2013. Google ScholarDigital Library
- D. Chakrabarti and K. Punera. Event summarization using tweets. ICWSM, 11:66--73, 2011.Google Scholar
- Y. Chang, X. Wang, Q. Mei, and Y. Liu. Towards twitter context summarization with user in uence models. In Proceedings of the sixth ACM international conference on Web search and data mining, pages 527--536. ACM, 2013. Google ScholarDigital Library
- G. Erkan and D. R. Radev. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, pages 457--479, 2004. Google ScholarDigital Library
- S. Fujishige. Polymatroidal dependence structure of a set of random variables. Information and Control, 39(1):55--72, 1978.Google ScholarCross Ref
- Z. Galil. Efficient algorithms for finding maximum matching in graphs. ACM Computing Surveys (CSUR), 18(1):23--38, 1986. Google ScholarDigital Library
- A. Haghighi and L. Vanderwende. Exploring content models for multi-document summarization. In Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 36--{370. Association for Computational Linguistics, 2009. Google ScholarDigital Library
- L. Hong, A. Ahmed, S. Gurumurthy, A. J. Smola, and K. Tsioutsiouliklis. Discovering geographical topics in the twitter stream. In Proceedings of the 21st international conference on World Wide Web, pages 769--778. ACM, 2012. Google ScholarDigital Library
- L. Hong and B. D. Davison. Empirical study of topic modeling in twitter. In Proceedings of the First Workshop on Social Media Analytics, pages 80--88. ACM, 2010. Google ScholarDigital Library
- X. Hu, L. Tang, J. Tang, and H. Liu. Exploiting social relations for sentiment analysis in microblogging. In Proceedings of the sixth ACM international conference on Web search and data mining, pages 537--546. ACM, 2013. Google ScholarDigital Library
- H. Lin and J. Bilmes. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 510--520. Association for Computational Linguistics, 2011. Google ScholarDigital Library
- X. Liu and K. Aberer. Soco: a social network aided context-aware recommender system. In Proceedings of the 22nd international conference on World Wide Web, pages 781--802. International World Wide Web Conferences Steering Committee, 2013. Google ScholarDigital Library
- Y. Lu, P. Tsaparas, A. Ntoulas, and L. Polanyi. Exploiting social context for review quality prediction. In Proceedings of the 19th international conference on World wide web, pages 691--700. ACM, 2010. Google ScholarDigital Library
- H. Ma, D. Zhou, C. Liu, M. R. Lyu, and I. King. Recommender systems with social regularization. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 287--296. ACM, 2011. Google ScholarDigital Library
- R. Mehrotra and E. Yilmaz. Representative & informative query selection for learning to rank using submodular functions. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 545--554. ACM, 2015. Google ScholarDigital Library
- Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic modeling with network regularization. In Proceedings of the 17th international conference on World Wide Web, pages 101--110. ACM, 2008. Google ScholarDigital Library
- M. Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization Techniques, pages 234--243. Springer, 1978.Google ScholarCross Ref
- B. Mirzasoleiman, A. Badanidiyuru, A. Karbasi, J. Vondrák, and A. Krause. Lazier than lazy greedy. arXiv:1409.7938, 2014. Google ScholarDigital Library
- B. Mirzasoleiman, A. Karbasi, R. Sarkar, and A. Krause. Distributed submodular maximization: Identifying representative elements in massive data. In Advances in Neural Information Processing Systems, pages 2049--2057, 2013. Google ScholarDigital Library
- F. Morstatter, J. Pfe er, H. Liu, and K. M. Carley. Is the sample good enough? comparing data from twitter's streaming api with twitter's rehose. arXiv preprint arXiv:1306.5204, 2013.Google Scholar
- T. T. Nguyen, Q. V. H. Nguyen, M. Weidlich, and K. Aberer. Result selection and summarization for web table search. In 31st IEEE International Conference on Data Engineering, number EPFL-CONF-203577, 2015.Google ScholarCross Ref
- J. Nichols, J. Mahmud, and C. Drews. Summarizing sporting events using twitter. In Proceedings of the 2012 ACM international conference on Intelligent User Interfaces, pages 189--198. ACM, 2012. Google ScholarDigital Library
- F. Pan, W. Wang, A. K. Tung, and J. Yang. Finding representative set from massive data. In Data Mining, Fifth IEEE International Conference on, pages 8--pp. IEEE, 2005. Google ScholarDigital Library
- B. Sankaran, M. Ghazvininejad, X. He, D. Kale, and L. Cohen. Learning and optimization with submodular functions. arXiv preprint arXiv:1505.01576, 2015.Google Scholar
- J. Steinberger and K. Jezek. Using latent semantic analysis in text summarization and summary evaluation. In Proc. ISIM'04, pages 93--100, 2004.Google Scholar
- J. Surowiecki. The wisdom of crowds. Anchor, 2005. Google ScholarDigital Library
- H. P. Vanchinathan, A. Marfurt, C.-A. Robelin, D. Kossmann, and A. Krause. Discovering valuable items from massive data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1195--1204. ACM, 2015. Google ScholarDigital Library
- Q. Yuan, G. Cong, Z. Ma, A. Sun, and N. M. Thalmann. Who, where, when and what: discover spatio-temporal topics for twitter users. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 605--613. ACM, 2013. Google ScholarDigital Library
- W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In Advances in Information Retrieval, pages 338--349. Springer, 2011. Google ScholarDigital Library
- X. Zhu, C. Vondrick, D. Ramanan, and C. Fowlkes. Do we need more training data or better models for object detection?. In BMVC, volume 3, page 5. Citeseer, 2012.Google Scholar
- H. Zhuang, I. Filali, R. Rahman, and K. Aberer. Coshare: A cost-effective data sharing system for data center networks. In 2015 IEEE Conference on Collaboration and Internet Computing (CIC), pages 11--18. IEEE, 2015. Google ScholarDigital Library
Index Terms
- Data Summarization with Social Contexts
Recommendations
Social context summarization
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information RetrievalWe study a novel problem of social context summarization for Web documents. Traditional summarization research has focused on extracting informative sentences from standard documents. With the rapid growth of online social networks, abundant user ...
Topic-driven reader comments summarization
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementReaders of a news article often read its comments contributed by other readers. By reading comments, readers obtain not only complementary information about this news article but also the opinions from other readers. However, the existing ranking ...
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...
Comments