Words are important: A textual content based identity resolution scheme across multiple online social networks☆
Introduction
The immensely popular online social networks (OSNs) have provided analysts with a simple but rich source of data about the individuals and the society. These networks capture information about individuals’ online behavior wherein they talk, chat and respond on a plethora of topics and issues [1], as well as their real-world behavior such as their sharing location, check-ins, photos, etc. Diverse OSNs employ distinct means to engage individuals and fulfill their requirements [2]. For instance, people use Facebook to create a network of their acquaintances, LinkedIn for extending their professional contacts and networks, and Twitter to share information. On these OSNs of interest, an individual registers his online identity, which includes a set of attributes that portray him uniquely within the given network, distinct from the other users. Such online identity, while enabling unique identification of users, helps people on the network to connect and interact with each other. A user’s online identity includes profile information like name, location, education, etc., network information containing the links to people he is connected to, and content information made up of the posts created by him or those shared by or with him. The OSNs, being privy to all these data, can allow for better overall understanding and profiling of their users. For example, the methods for community detection in a social network can be improved by meaningful integration of this publicly available information. Similarly, the problem of link prediction, which can also be extended to inferring missing links in social networks, can be solved by utilizing user profile data. Marketers can harness these publicly available data to understand their clients better and extend appropriate services for better customer satisfaction.
In the literature, the process of identifying and coupling identities of a person across different OSNs given his identity on one social network is termed as ‘Identity Resolution across OSNs’ [3]. However, the heterogeneity present in terms of any given user’s profile across the networks poses a great challenge to this endeavor. Due to increasing awareness about privacy, many individuals either avoid or restrict disclosing some of their information [4], [5]; this leads to incomplete and sometimes even contradictory data for the identity resolution methods to tackle. These often result in dissimilar online personalities for the same person scattered across the web, with no explicit connection directing one to the other.
There has been a lot of interest amongst the researcher community as well as the industry on the potential benefits of matching an individual’s online identity across multiple OSNs. Business organizations, political parties, and non-profit institutions utilize social media data to understand sentiments towards their products, events, ideologies, or institutions. To get easy access to such data, these entities create accounts on these OSN platforms and request individuals to ‘follow’ or ‘like’ their accounts, and also share their feedback on these accounts. A consistent social media strategy allows these entities to develop a social audience. One measure to evaluate the success of such an endeavor is the number of users discussing or liking their offerings, and the sentiments expressed therein. However, with these entities setting up accounts on multiple social networks, it is difficult to estimate their exact social audience. A single individual can contribute to the same activity through his multiple OSN accounts; for instance, follow ‘Netflix India’ official account on Twitter using his Twitter account, and like ‘Netflix India’ account on Facebook through his Facebook account. Identity resolution of all their social media audience can allow these entities to estimate the actual number of individuals involved. Similarly, most e-commerce sites use customer’s prior online buying behavior to offer them customized recommendations. Better recommendations can be made based on customer’s interests, likes and dislikes, purchasing capacity, etc., captured through his activity on various OSNs. Instagram images posted by an individual can highlight the places he visited or would love to; Foursquare check-ins, tips, and reviews provide ideas on the cuisine an individual like; LinkedIn profile can provide an insight into the purchasing capacity of the individual, and so on. Thus, the information about an individual is mostly incomplete within the boundary of any single social platform. Matching identities of these online users across diverse social media platforms is therefore imperative to enable a comprehensive profiling for viable personalization and targeting. Security practitioners may also benefit from such identity resolution; they often require to identify a person’s characteristics and detect the presence of any deceitful traits in his public profile. Linking an individual’s multiple online identities can immensely help in such endeavors.
In the past few years, many researchers and analysts have successfully utilized the structured and unstructured social media data for developing predictive models and thereby extracted meaningful information [6], [7], [8]. A large proportion of the current identity matching techniques involves two important phases: feature extraction, and model construction. In the feature extraction stage, features are first extracted for a few users from their corresponding profiles. Based on an analysis of similarities amongst these features extracted from users’ profiles, one can ascertain whether they belong to the same or otherwise. To compute the similarities between extracted features like words and numbers, one can use string and number matching algorithms, and the similarity scores so obtained can be normalized such that 1 equates to no similarity and 0 to an exact match. The extracted features are then utilized for model development, wherein a supervised, semi-supervised, or unsupervised model is trained. These trained models are further utilized to ascertain whether the identities match for a new set of data. Most current studies on identity resolution techniques are focused mainly on profile information (username, name, education, location, etc.) of users, and overlook the content information. It may be noted that while many desired profile attribute information about an individual may not be available due to privacy concerns [9], such concerns do not usually extend to the content attribute, thereby making the posts, tweets, and URLs, which manifest interests, views and writing style of users, easily accessible to the researchers. User’s content information (e.g., posts, tweets, and URLs), however diverse they may be across the OSNs, can play an important role in the process of identity matching. For example, users show their interests or views on a topic via posts or tweets. This information can help in matching an individual’s identity across the OSNs and thereby increase the accuracy of the identity resolution frameworks.
As stated above, while privacy concerns may restrict the availability of profile attributes and the inconsistency therein limit their usefulness, the same is not true for content attributes. This being the motivation, the current study explores the use of content attributes in matching user’s profiles across OSNs and presents a scheme wherein the important features extracted from posts and tweets are successfully used to match the identities of users across OSNs. This study extracts implicit features from the content attributes of OSN users, like part-of-speeches, symbols, emoticons, common high frequency words, etc., for the purpose of identity resolution across OSNs. No research, to the best of our knowledge, had tried to utilize these sets of implicit features extracted from content attribute of OSN users to link their identities across OSNs. Thus, this paper contributes to the existing literature by introducing a set of implicit features extracted from the content information of users that can be used for identity resolution and also presents a way to exploit these features towards that endeavor. The analysis presented herein confirms that the use of such features in the matching process can enhance the accuracy of identity resolution frameworks, especially in the light of the fact that unique identifiers like user’s mobile number, email or social security number are not always present in the user’s profile. This study also indicates that the content attributes on their own, without any consideration to the profile or network attributes, can match identities of social network users with substantial accuracy. This approach thus has the promise of liberating the analysts from the issue of hidden or missing profile attribute values due to privacy concerns while working towards identity resolution.
The rest of this paper is organized as follows: related work from the literature is presented in Section 2; the description of the methodology and the proposed algorithm are explained in Section 3; Section 4 elucidates the evaluation experiments followed by model development; results are discussed in Section 5; and Section 6 concludes the paper.
Section snippets
Related work
Many individuals communicate over social media platforms using messages, posts, tweets, etc. [10], [11], [12]. Through these, individuals freely share their views and feelings with their network of friends, bigger group of acquaintances, or even the entire online world. OSNs have accumulated an exceptional amount of composed dialect, tremendous amounts of which are freely accessible [13]. Twitter clients alone compose around 500 million messages each day [14]. The composed dialect thus gathered
Methodology
The objective of this proposed scheme for identity resolution is to predict the most probable (MP) profile of a user ‘X’ from Facebook, given the Twitter profile of the same user ‘X’, assuming that the user ‘X’ has accounts in both the OSNs; We consider Twitter as the source (S) OSN from where we take a given profile, and Facebook as the target (T) OSN wherein we identify the corresponding matching profile. Towards this, the method uses part-of-speech tagging (noun, proper noun, verb, etc.),
Experimental setup
This section elucidates the experimental setup including details of the dataset and evaluation metrics used for the analysis.
Results and discussion
Fig. 9, Fig. 10, Fig. 11 show the comparison of area under ROC, Precision/Recall, and Sensitivity/Specificity curves respectively for all the classifiers. ROC is a probability curve and area under the ROC curve (AUC) represents degree or measure of separability that the classifier achieves for the corresponding data. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting matches as matches, and no-matches as no-matches, with
Conclusions and future work
In the real world, it is not always necessary that a single user has exactly similar profiles on multiple OSNs. The reasons thereof
can vary from the heterogeneous nature of social network platforms to the social behavior of users. The purpose of this study is to develop a scheme for users’ profiles matching across OSNs, by utilizing the implicit features extracted only from content attributes. The motivation is to avoid all those attributes that the OSN users might restrict from being publicly
CRediT authorship contribution statement
Deepesh Kumar Srivastava: Conceptualization, Methodology, Data curation, Writing - original draft, Visualization, Investigation, Validation, Writing - review & editing. Basav Roychoudhury: Conceptualization, Methodology, Data curation, Supervision, Writing - review & editing.
References (57)
More than words: Social networks’ text mining for consumer brand sentiments
Expert Syst. Appl.
(2013)- et al.
An empirical analysis of users’ privacy disclosure behaviors on social network sites
Inf. Manag.
(2015) - et al.
Community extraction and visualization in social networks applied to Twitter
Inform. Sci.
(2018) - et al.
Improving user recommendation by extracting social topics and interest topics of users in uni-directional social networks
Knowl.-Based Syst.
(2018) - et al.
Efficient incremental dynamic link prediction algorithms in social network
Knowl.-Based Syst.
(2017) - et al.
Factors mediating disclosure in social network sites
Comput. Hum. Behav.
(2011) - et al.
An empirical study of the factors affecting social network service use
Comput. Hum. Behav.
(2010) - et al.
Who is talking? An ontology-based opinion leader identification framework for word-of-mouth marketing in online social blogs
Decis. Support Syst.
(2011) - et al.
Frameworks for entity matching: A comparison
Data Knowl. Eng.
(2010) - et al.
Social big data: Recent achievements and new challenges
Inf. Fusion
(2016)
A deep dive into user display names across social networks
Inform. Sci.
Tastes, ties, and time: A new social network dataset using Facebook. com
Soc. Netw.
Identity construction on Facebook: Digital empowerment in anchored relationships
Comput. Hum. Behav.
Modeling user intrinsic characteristic on social media for identity linkage
ACM Trans. Soc. Comput.
Identifying users across social networks based on dynamic core interests
Neurocomputing
Matching user accounts based on user generated content across social networks
Future Gener. Comput. Syst.
Exploring determinants of voting for the helpfulness of online user reviews: A text mining approach
Decis. Support Syst.
Predicting crime using Twitter and kernel density estimation
Decis. Support Syst.
Using text mining and sentiment analysis for online forums hotspot detection and forecast
Decis. Support Syst.
Online collective identity: The case of the environmental movement
Social Networks
Constrained NMF-based semi-supervised learning for social media spammer detection
Knowl.-Based Syst.
Unsupervised cluster analyses of character networks in fiction: Community structure and centrality
Knowl.-Based Syst.
Matching entities across online social networks
Neurocomputing
A research framework on social networking sites usage: Critical review and theoretical extension
Joint link-attribute user identity resolution in online social networks
On the protection of social networks user’s information
Knowl.-Based Syst.
Liking, sharing, commenting and reacting on facebook: User behaviors’ impact on sentiment intensity
Telemat. Inform.
Personal Branding on Twitter: How employed and freelance journalists stage themselves on social media
Digit. J.
Cited by (23)
Truth-value unconstrained face clustering for identity resolution in a distributed environment of criminal police information systems
2023, Engineering Applications of Artificial IntelligenceHandling topic dependencies alongside topology interactions using fuzzy inferences for discovering communities in social networks
2022, Expert Systems with ApplicationsCitation Excerpt :In recent years, the increasing development of social media has made a significant impact on human life aspects (Dwivedi et al. 2021). Given a large number of individuals and social objects alongside the huge amount of data in these networks, the analysis of social networks, as a proper subset of dynamic complex systems, has become an important and challenging research topic in the field of data mining (Kong et al. 2019; Curiskis et al. 2020; Srivastava & Roychoudhury 2020). In this regard, discovering communities is one of the useful and challenging research areas (Akachar 2021).
Comparison of machine learning algorithms for content based personality resolution of tweets
2021, Social Sciences and Humanities OpenPhoenix precision algorithm for blind people with enhanced voice assistant
2023, Advanced Applications of Generative AI and Natural Language Processing ModelsIntelligent business sustainability on marketing system
2023, Data-Driven Decision Making for Long-Term Business SuccessThe synergy of management information systems and predictive analytics for marketing
2023, Data-Driven Decision Making for Long-Term Business Success
- ☆
No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.2020.105624.