Words are important: A textual content based identity resolution scheme across multiple online social networks

https://doi.org/10.1016/j.knosys.2020.105624Get rights and content

Highlights

  • Propose a scheme which utilizes the contents of OSN users to match their identities.

  • Text Mining and NLP are used to extract the features from users’ contents.

  • On the basis of experiments, we achieved accuracy of 91.2 percent in identity matching.

  • This scheme will enhance the accuracy of identity resolution frameworks.

Abstract

Identity resolution of a person using various online social networks can enable an interested party to have a better and holistic understanding of former’s behavior and personality. Major challenges in developing a reliable and scalable matching scheme for online identities include non-availability of required information or having contradictory information for the same user across these networks. In this study, we present a scheme for identity matching which utilizes important features extracted from contents generated by or shared with users across one’s online social networks. With the help of natural language processing and text mining techniques, we extract and process parts-of-speech, symbols, emoticons, numbers, and high frequency words in user’s posts, tweets, retweets, and URLs. On the basis of experiments with ground truth Twitter–Facebook real datasets, this method achieved 91.2 percent accuracy in matching user’s identity across the user’s profiles. The main contribution of this paper is that this proposes a novel method for identity matching, which utilizes only the publicly available content information of online social network users. This method can be used alone for identity matching, or can be used along with other identity resolution frameworks to enhance their accuracy.

Introduction

The immensely popular online social networks (OSNs) have provided analysts with a simple but rich source of data about the individuals and the society. These networks capture information about individuals’ online behavior wherein they talk, chat and respond on a plethora of topics and issues [1], as well as their real-world behavior such as their sharing location, check-ins, photos, etc. Diverse OSNs employ distinct means to engage individuals and fulfill their requirements [2]. For instance, people use Facebook to create a network of their acquaintances, LinkedIn for extending their professional contacts and networks, and Twitter to share information. On these OSNs of interest, an individual registers his online identity, which includes a set of attributes that portray him uniquely within the given network, distinct from the other users. Such online identity, while enabling unique identification of users, helps people on the network to connect and interact with each other. A user’s online identity includes profile information like name, location, education, etc., network information containing the links to people he is connected to, and content information made up of the posts created by him or those shared by or with him. The OSNs, being privy to all these data, can allow for better overall understanding and profiling of their users. For example, the methods for community detection in a social network can be improved by meaningful integration of this publicly available information. Similarly, the problem of link prediction, which can also be extended to inferring missing links in social networks, can be solved by utilizing user profile data. Marketers can harness these publicly available data to understand their clients better and extend appropriate services for better customer satisfaction.

In the literature, the process of identifying and coupling identities of a person across different OSNs given his identity on one social network is termed as ‘Identity Resolution across OSNs’ [3]. However, the heterogeneity present in terms of any given user’s profile across the networks poses a great challenge to this endeavor. Due to increasing awareness about privacy, many individuals either avoid or restrict disclosing some of their information [4], [5]; this leads to incomplete and sometimes even contradictory data for the identity resolution methods to tackle. These often result in dissimilar online personalities for the same person scattered across the web, with no explicit connection directing one to the other.

There has been a lot of interest amongst the researcher community as well as the industry on the potential benefits of matching an individual’s online identity across multiple OSNs. Business organizations, political parties, and non-profit institutions utilize social media data to understand sentiments towards their products, events, ideologies, or institutions. To get easy access to such data, these entities create accounts on these OSN platforms and request individuals to ‘follow’ or ‘like’ their accounts, and also share their feedback on these accounts. A consistent social media strategy allows these entities to develop a social audience. One measure to evaluate the success of such an endeavor is the number of users discussing or liking their offerings, and the sentiments expressed therein. However, with these entities setting up accounts on multiple social networks, it is difficult to estimate their exact social audience. A single individual can contribute to the same activity through his multiple OSN accounts; for instance, follow ‘Netflix India’ official account on Twitter using his Twitter account, and like ‘Netflix India’ account on Facebook through his Facebook account. Identity resolution of all their social media audience can allow these entities to estimate the actual number of individuals involved. Similarly, most e-commerce sites use customer’s prior online buying behavior to offer them customized recommendations. Better recommendations can be made based on customer’s interests, likes and dislikes, purchasing capacity, etc., captured through his activity on various OSNs. Instagram images posted by an individual can highlight the places he visited or would love to; Foursquare check-ins, tips, and reviews provide ideas on the cuisine an individual like; LinkedIn profile can provide an insight into the purchasing capacity of the individual, and so on. Thus, the information about an individual is mostly incomplete within the boundary of any single social platform. Matching identities of these online users across diverse social media platforms is therefore imperative to enable a comprehensive profiling for viable personalization and targeting. Security practitioners may also benefit from such identity resolution; they often require to identify a person’s characteristics and detect the presence of any deceitful traits in his public profile. Linking an individual’s multiple online identities can immensely help in such endeavors.

In the past few years, many researchers and analysts have successfully utilized the structured and unstructured social media data for developing predictive models and thereby extracted meaningful information [6], [7], [8]. A large proportion of the current identity matching techniques involves two important phases: feature extraction, and model construction. In the feature extraction stage, features are first extracted for a few users from their corresponding profiles. Based on an analysis of similarities amongst these features extracted from users’ profiles, one can ascertain whether they belong to the same or otherwise. To compute the similarities between extracted features like words and numbers, one can use string and number matching algorithms, and the similarity scores so obtained can be normalized such that 1 equates to no similarity and 0 to an exact match. The extracted features are then utilized for model development, wherein a supervised, semi-supervised, or unsupervised model is trained. These trained models are further utilized to ascertain whether the identities match for a new set of data. Most current studies on identity resolution techniques are focused mainly on profile information (username, name, education, location, etc.) of users, and overlook the content information. It may be noted that while many desired profile attribute information about an individual may not be available due to privacy concerns [9], such concerns do not usually extend to the content attribute, thereby making the posts, tweets, and URLs, which manifest interests, views and writing style of users, easily accessible to the researchers. User’s content information (e.g., posts, tweets, and URLs), however diverse they may be across the OSNs, can play an important role in the process of identity matching. For example, users show their interests or views on a topic via posts or tweets. This information can help in matching an individual’s identity across the OSNs and thereby increase the accuracy of the identity resolution frameworks.

As stated above, while privacy concerns may restrict the availability of profile attributes and the inconsistency therein limit their usefulness, the same is not true for content attributes. This being the motivation, the current study explores the use of content attributes in matching user’s profiles across OSNs and presents a scheme wherein the important features extracted from posts and tweets are successfully used to match the identities of users across OSNs. This study extracts implicit features from the content attributes of OSN users, like part-of-speeches, symbols, emoticons, common high frequency words, etc., for the purpose of identity resolution across OSNs. No research, to the best of our knowledge, had tried to utilize these sets of implicit features extracted from content attribute of OSN users to link their identities across OSNs. Thus, this paper contributes to the existing literature by introducing a set of implicit features extracted from the content information of users that can be used for identity resolution and also presents a way to exploit these features towards that endeavor. The analysis presented herein confirms that the use of such features in the matching process can enhance the accuracy of identity resolution frameworks, especially in the light of the fact that unique identifiers like user’s mobile number, email or social security number are not always present in the user’s profile. This study also indicates that the content attributes on their own, without any consideration to the profile or network attributes, can match identities of social network users with substantial accuracy. This approach thus has the promise of liberating the analysts from the issue of hidden or missing profile attribute values due to privacy concerns while working towards identity resolution.

The rest of this paper is organized as follows: related work from the literature is presented in Section 2; the description of the methodology and the proposed algorithm are explained in Section 3; Section 4 elucidates the evaluation experiments followed by model development; results are discussed in Section 5; and Section 6 concludes the paper.

Section snippets

Related work

Many individuals communicate over social media platforms using messages, posts, tweets, etc. [10], [11], [12]. Through these, individuals freely share their views and feelings with their network of friends, bigger group of acquaintances, or even the entire online world. OSNs have accumulated an exceptional amount of composed dialect, tremendous amounts of which are freely accessible [13]. Twitter clients alone compose around 500 million messages each day [14]. The composed dialect thus gathered

Methodology

The objective of this proposed scheme for identity resolution is to predict the most probable (MP) profile of a user ‘X’ from Facebook, given the Twitter profile of the same user ‘X’, assuming that the user ‘X’ has accounts in both the OSNs; We consider Twitter as the source (S) OSN from where we take a given profile, and Facebook as the target (T) OSN wherein we identify the corresponding matching profile. Towards this, the method uses part-of-speech tagging (noun, proper noun, verb, etc.),

Experimental setup

This section elucidates the experimental setup including details of the dataset and evaluation metrics used for the analysis.

Results and discussion

Fig. 9, Fig. 10, Fig. 11 show the comparison of area under ROC, Precision/Recall, and Sensitivity/Specificity curves respectively for all the classifiers. ROC is a probability curve and area under the ROC curve (AUC) represents degree or measure of separability that the classifier achieves for the corresponding data. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting matches as matches, and no-matches as no-matches, with

Conclusions and future work

In the real world, it is not always necessary that a single user has exactly similar profiles on multiple OSNs. The reasons thereof

can vary from the heterogeneous nature of social network platforms to the social behavior of users. The purpose of this study is to develop a scheme for users’ profiles matching across OSNs, by utilizing the implicit features extracted only from content attributes. The motivation is to avoid all those attributes that the OSN users might restrict from being publicly

CRediT authorship contribution statement

Deepesh Kumar Srivastava: Conceptualization, Methodology, Data curation, Writing - original draft, Visualization, Investigation, Validation, Writing - review & editing. Basav Roychoudhury: Conceptualization, Methodology, Data curation, Supervision, Writing - review & editing.

References (57)

  • LiY. et al.

    A deep dive into user display names across social networks

    Inform. Sci.

    (2018)
  • LewisK. et al.

    Tastes, ties, and time: A new social network dataset using Facebook. com

    Soc. Netw.

    (2008)
  • ZhaoS. et al.

    Identity construction on Facebook: Digital empowerment in anchored relationships

    Comput. Hum. Behav.

    (2008)
  • YuX. et al.

    Modeling user intrinsic characteristic on social media for identity linkage

    ACM Trans. Soc. Comput.

    (2018)
  • NieY. et al.

    Identifying users across social networks based on dynamic core interests

    Neurocomputing

    (2016)
  • LiY. et al.

    Matching user accounts based on user generated content across social networks

    Future Gener. Comput. Syst.

    (2018)
  • CaoQ. et al.

    Exploring determinants of voting for the helpfulness of online user reviews: A text mining approach

    Decis. Support Syst.

    (2011)
  • GerberM.S.

    Predicting crime using Twitter and kernel density estimation

    Decis. Support Syst.

    (2014)
  • LiN. et al.

    Using text mining and sentiment analysis for online forums hotspot detection and forecast

    Decis. Support Syst.

    (2010)
  • AcklandR. et al.

    Online collective identity: The case of the environmental movement

    Social Networks

    (2011)
  • YuD. et al.

    Constrained NMF-based semi-supervised learning for social media spammer detection

    Knowl.-Based Syst.

    (2017)
  • ChenR.G. et al.

    Unsupervised cluster analyses of character networks in fiction: Community structure and centrality

    Knowl.-Based Syst.

    (2019)
  • PeledO. et al.

    Matching entities across online social networks

    Neurocomputing

    (2016)
  • ShenK. et al.

    A research framework on social networking sites usage: Critical review and theoretical extension

  • BartunovS. et al.

    Joint link-attribute user identity resolution in online social networks

  • MaréS.J. et al.

    On the protection of social networks user’s information

    Knowl.-Based Syst.

    (2013)
  • KaurW. et al.

    Liking, sharing, commenting and reacting on facebook: User behaviors’ impact on sentiment intensity

    Telemat. Inform.

    (2018)
  • BremsC. et al.

    Personal Branding on Twitter: How employed and freelance journalists stage themselves on social media

    Digit. J.

    (2017)
  • Cited by (23)

    • Handling topic dependencies alongside topology interactions using fuzzy inferences for discovering communities in social networks

      2022, Expert Systems with Applications
      Citation Excerpt :

      In recent years, the increasing development of social media has made a significant impact on human life aspects (Dwivedi et al. 2021). Given a large number of individuals and social objects alongside the huge amount of data in these networks, the analysis of social networks, as a proper subset of dynamic complex systems, has become an important and challenging research topic in the field of data mining (Kong et al. 2019; Curiskis et al. 2020; Srivastava & Roychoudhury 2020). In this regard, discovering communities is one of the useful and challenging research areas (Akachar 2021).

    • Phoenix precision algorithm for blind people with enhanced voice assistant

      2023, Advanced Applications of Generative AI and Natural Language Processing Models
    • Intelligent business sustainability on marketing system

      2023, Data-Driven Decision Making for Long-Term Business Success
    • The synergy of management information systems and predictive analytics for marketing

      2023, Data-Driven Decision Making for Long-Term Business Success
    View all citing articles on Scopus

    No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.2020.105624.

    View full text