Skip to main content
Log in

Tweet-scan-post: a system for analysis of sensitive private data disclosure in online social media

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The social media technologies are open to users who are intended in creating a community and publishing their opinions of recent incidents. The participants of the online social networking sites remain ignorant of the criticality of disclosing personal data to the public audience. The private data of users are at high risk leading to many adverse effects like cyberbullying, identity theft, and job loss. This research work aims to define the user entities or data like phone number, email address, family details, health-related information as user’s sensitive private data (SPD) in a social media platform. The proposed system, Tweet-Scan-Post (TSP), is mainly focused on identifying the presence of SPD in user’s posts under personal, professional, and health domains. The TSP framework is built based on the standards and privacy regulations established by social networking sites and organizations like NIST, DHS, GDPR. The proposed approach of TSP addresses the prevailing challenges in determining the presence of sensitive PII, user privacy within the bounds of confidentiality and trustworthiness. A novel layered classification approach with various state-of-art machine learning models is used by the TSP framework to classify tweets as sensitive and insensitive. The findings of TSP systems include 201 Sensitive Privacy Keywords using a boosting strategy, sensitivity scaling that measures the degree of sensitivity allied with a tweet. The experimental results revealed that personal tweets were highly related to mother and children, professional tweets with apology, and health tweets with concern over the father’s health condition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Abid Y, Imine A, Rusinowitch M (2018) Sensitive attribute prediction for social networks users. In DARLI-AP 2018–2nd international workshop on data analytics solutions for real-life applications

  2. Ampong G, Mensah A, Adu A, Addae J, Omoregie O, Ofori K (2018) Examining self-disclosure on social networking sites: a flow theory and privacy perspective. Behav Sci 8(6):58

    Article  Google Scholar 

  3. Becker M, Klausing SM, Hess T (2019) Uncovering the privacy paradox: the influence of distraction on data disclosure decision. In: Proceedings of the 27th European conference on information systems (ECIS)

  4. Caliskan Islam A, Walsh J, Greenstadt R (2014) Privacy detective: detecting private information and collective privacy behavior in a large social network. Proceedings of the 13th workshop on privacy in the electronic society, ACM, pp. 35–46

  5. Castillo SRM, Chen Z (2016) Using transfer learning to identify privacy leaks in tweets. IEEE 2nd international conference on collaboration and internet computing (CIC), IEEE, pp. 506–513

  6. Chauhan A, Kummamuru K, Toshniwal D (2017) Prediction of places of visit using tweets. Knowl Inf Syst 50(1):145–166

    Article  Google Scholar 

  7. Corley CD, Cook DJ, Mikler AR, Singh KP (2010) Text and structural data mining of influenza mentions in web and social media. Int J Environ Res Public Health 7(2):596–615

    Article  Google Scholar 

  8. Dong C, Jin H, Knijnenburg BP (2016) Ppm: a privacy prediction model for online social networks. International conference on social informatics. Springer, Cham, pp. 400–420

  9. Eliacik AB, Erdogan N (2018) Influential user weighted sentiment analysis on topic based microblogging community. Exp Syst Appl 92:403–418

    Article  Google Scholar 

  10. Fan S, Huang B (2017) Recurrent collective classification. Knowledge and Information Systems, 1–15

  11. Fares M, Moufarrej A, Jreij E, Tekli J, Grosky W (2019) Difficulties and improvements to graph-based lexical sentiment analysis using LISA. 2019 IEEE international conference on cognitive computing (ICCC). IEEE, pp. 28–35

  12. Fu X, Liu W, Xu Y, Cui L (2017) Combine HowNet lexicon to train phrase recursive autoencoder for sentence-level sentiment analysis. Neurocomputing 241:18–27

    Article  Google Scholar 

  13. Gan D, Jenkins LR (2015) Social networking privacy—Who’s stalking you? Future Internet 7(1):67–93

    Article  Google Scholar 

  14. Gao W, Peng M, Wang H, Zhang Y, Xie Q, Tian G (2018) Incorporating word embeddings into topic modeling of short text. Knowledge and Information Systems, 1–23

  15. Geetha R, Karthika S, Pavithra N, Preethi V (2019) Tweedle: sensitivity check in health-related social short texts based on regret theory. Procedia Comput Sci 165:663–675

    Article  Google Scholar 

  16. Ghosh S, Desarkar MS (2018) Class specific TF-IDF boosting for short-text classification: application to short-texts generated during disasters. In companion proceedings of the the web conference 2018, pp. 1629–1637

  17. Gill AJ, Vasalou A, Papoutsi C, Joinson AN (2011) Privacy dictionary: a linguistic taxonomy of privacy for content analysis. In: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, pp. 3227–3236

  18. Gopal, J., Huang, S., & Luo, B. (2015). FamilyID: a hybrid approach to identify family information from microblogs. In IFIP annual conference on data and applications security and privacy. Springer, Cham, pp. 215-222

  19. Househ M, Grainger R, Petersen C, Bamidis P, Merolli M (2018) Balancing between privacy and patient needs for health information in the age of participatory health and social media: a scoping review. Yearb Med Inform 27(01):029–036

    Article  Google Scholar 

  20. Jordan K, Weller M (2018) Academics and social networking sites: benefits, problems and tensions in professional engagement with online networking. J Interact Media Educ 2018(1)

  21. Kotsiantis SB (2005) Logitboost of simple bayesian classifier. Informatica 29(1)

  22. Kumar CP, Babu LD (2019) Novel text preprocessing framework for sentiment analysis. In: Smart intelligent computing and applications. Springer, Singapore, pp 309–317

    Chapter  Google Scholar 

  23. Kumar HK, Harish BS (2018) Classification of short text using various preprocessing techniques: an empirical evaluation. Recent findings in intelligent computing techniques. Springer, Singapore, pp 19–30

    Chapter  Google Scholar 

  24. Li P, Cho H, Goh ZH (2019) Unpacking the process of privacy management and self-disclosure from the perspectives of regulatory focus and privacy calculus. Telematics Inform 41:114–125

    Article  Google Scholar 

  25. Li Y, Li T, Liu H (2017) Recent advances in feature selection and its applications. Knowl Inf Syst 53(3):551–577

    Article  Google Scholar 

  26. Liu S, Wang Y, Chen C, Xiang Y (2016) An ensemble learning approach for addressing the class imbalance problem in Twitter spam detection. Australasian conference on information security and privacy. Springer, Cham, pp 215–228

    Chapter  Google Scholar 

  27. Liu Z, Wang X (2018) How to regulate individuals’ privacy boundaries on social network sites: a cross-cultural comparison. Inform Manag 55(8):1005–1023

    Article  Google Scholar 

  28. Liu Z, Wang X, Liu J (2019) How digital natives make their self-disclosure decisions: a cross-cultural comparison. Inform Technol People

  29. Lu X, Zhaowei Qu, Li Qi, Hui P (2015) Privacy information security classification for internet of things based on internet data. Int J Distrib Sens Netw 11(8):932–941

    Article  Google Scholar 

  30. Mao H, Shuai X, Kapadia A (2011) Loose tweets: an analysis of privacy leaks on twitter. Proceedings of the 10th annual ACM workshop on privacy in the electronic society. ACM, pp. 1–12

  31. Marwick AE, Boyd D (2011) I tweet honestly, I tweet passionately: twitter users, context collapse, and the imagined audience. New Media Soc 13(1):114–133

    Article  Google Scholar 

  32. McCallister E (2010) Guide to protecting the confidentiality of personally identifiable information. Diane Publishing

  33. Moll R, Pieschl S, Bromme R (2014) Trust into collective privacy? The role of subjective theories for self-disclosure in online communication. Societies 4(4):770–784

    Article  Google Scholar 

  34. Nassar L, Karray F (2018) Overview of the crowdsourcing process. Knowledge and Information Systems, 1–24

  35. Parra-Arnau J, Mármol FG, Rebollo-Monedero D, Forné J (2017) Shall I post this now? Optimized, delay-based privacy protection in social networks. Knowl Inf Syst 52(1):113–145

    Article  Google Scholar 

  36. Peddinti ST, Ross KW, Cappos J (2017) User anonymity on twitter. IEEE Secur Priv 15(3):84–87

    Article  Google Scholar 

  37. Pla F, Hurtado LF (2017) Language identification of multilingual posts from Twitter: a case study. Knowl Inf Syst 51(3):965–989

    Article  Google Scholar 

  38. Schapire RE (2003) The boosting approach to machine learning: an overview. In: Denison DD, Hansen MH, Holmes CC, Mallick B, Yu B (eds) Nonlinear estimation and classification. Lecture notes in statistics, vol 171. Springer, pp. 149–171

  39. Shao G (2009) Understanding the appeal of user-generated media: a uses and gratification perspective. Internet Res 19(1):7–25

    Article  Google Scholar 

  40. Sleeper M, Cranshaw J, Kelley PG, Ur G, Acquisti A, Cranor LF, Sadeh N (2013) I read my Twitter the next morning and was astonished: a conversational perspective on Twitter regrets. In: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, pp. 3277–3286

  41. Sun X, Chan PK (2018) Estimating effectiveness of twitter messages with a personalized machine learning approach. Knowl Inf Syst 56(1):27–53

    Article  Google Scholar 

  42. Tang JH, Wang CC (2012) Self-disclosure among bloggers: re-examination of social penetration theory. Cyberpsychol Behav Soc Netw 15(5):245–250

    Article  Google Scholar 

  43. Tsakalidis A, Papadopoulos S, Kompatsiaris I (2014) An ensemble model for cross-domain polarity classification on twitter. In international conference on web information systems engineering. Springer, Cham, pp. 168-177

  44. Tu W, Cheung D, Mamoulis N (2015) Time-sensitive opinion mining for prediction. In Twenty-Ninth AAAI conference on artificial intelligence, 29(1): 4214-4215

  45. Tuarob S, Tucker CS, Salathe M, Ram N (2014) An ensemble heterogeneous classification methodology for discovering health-related knowledge in social media messages. J Biomed Inform 49:255–268

    Article  Google Scholar 

  46. Vasalou A, Gill AJ, Mazanderani F, Papoutsi C, Joinson A (2011) Privacy dictionary: a new resource for the automated content analysis of privacy. J Am Soc Inform Sci Technol 62(11):2095–2105

    Article  Google Scholar 

  47. Vitak J, Blasiola S, Patil S, Litt E (2015) Balancing audience and privacy tensions on social network sites: strategies of highly engaged users. Int J Commun 9:20

    Google Scholar 

  48. Wagner A, Krasnova H, Abramova O, Buxmann P, Benbasat I (2018) From˜ Privacy Calculus™ to˜ Social Calculus™: Understanding self-disclosure on social networking sites

  49. Wan Y, Gao Q (2015) An ensemble sentiment classification system of twitter data for airline services analysis. 2015 IEEE international conference on data mining workshop (ICDMW), IEEE, pp. 1318–1325

  50. Wang Q, Bhandal J, Huang S, Luo B (2017) Content-based classification of sensitive tweets. Int J Semant Comput 11(04):541–562

    Article  Google Scholar 

  51. Yue L, Chen W, Li X, Zuo W, Yin M (2018) A survey of sentiment analysis in social media. Knowledge and Information Systems, 1–47

  52. Zhang S, Kwok RCW, Lowry PB, Liu Z, Wu J (2019) The influence of role stress on self-disclosure on social networking sites: a conservation of resources perspective. Inform Manag 56(7):103–147

    Article  Google Scholar 

  53. Zuo Y, Zhao J, Xu K (2016) Word network topic model: a simple but general solution for short and imbalanced texts. Knowl Inf Syst 48(2):379–398

    Article  Google Scholar 

  54. Statistica. https://www.statista.com/statistics/242606/number-of-active-twitter-users-in-selected-countries/. Accessed 15 February, 2020

  55. IndiaToday. https://www.indiatoday.in/india/story/kotak-mahindra-bank-sacks-employee-after-his-irresponsible-facebook-post-on-kathua-gangrape-victim-1211705-2018-04-13. Accessed 13 April 2018

  56. Times of India. https://timesofindia.indiatimes.com/home/science/hashtags-that-can-put-your-child-in-danger-online/articleshow/63652567.cms Accessed 20 April 2018

  57. Intersoft Consulting. http://gdpr-info.eu Accessed 25 June 2017

  58. Homeland Security. https://www.dhs.gov/publication/dhs-handbook-safeguarding-sensitive-pii Accessed 14 May 2018

  59. Shraddha Bajracharya, Businesstopia, https://www.businesstopia.net/mass-communication/uses-gratifications-theory Accessed 10 February 2018

  60. The Breach Level Index. https://www.breachlevelindex.com/data-breach-database Accessed 18 May 2019.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to R. Geetha.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Geetha, R., Karthika, S. & Kumaraguru, P. Tweet-scan-post: a system for analysis of sensitive private data disclosure in online social media. Knowl Inf Syst 63, 2365–2404 (2021). https://doi.org/10.1007/s10115-021-01592-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-021-01592-2

Keywords

Navigation