PRESS: A personalised approach for mining top-k groups of objects with subspace similarity

https://doi.org/10.1016/j.datak.2020.101833Get rights and content

Abstract

Personalised analytics is a powerful technology that can be used to improve the career, lifestyle, and health of individuals by providing them with an in-depth analysis of their characteristics as compared to other people. Existing research has often focused on mining general patterns or clusters, but without the facility for customisation to an individual’s needs. It is challenging to adapt such approaches to the personalised case, due to the high computational overhead they require for discovering patterns that are good across an entire dataset, rather than with respect to an individual. In this paper, we tackle the challenge of personalised pattern mining and propose a query-driven approach to mine objects with subspace similarity. Given a query object in a categorical dataset, our proposed algorithm, PRESS (Personalised Subspace Similarity), determines the top-k groups of objects, where each group has high similarity to the query for some particular subspace. We evaluate the efficiency and effectiveness of our approach on both synthetic and real datasets.

Introduction

The study of similarities among people (or more generally among objects) enables us to gain a broader and deeper understanding of them. It can provide individuals with personalised feedback and guidance for building their career, identifying the role in social networks and diagnosing rare/unusual diseases. For example, a newly graduated student might select a career pathway by identifying employees with a similar profile. Alternatively, a personalised property recommendation system could help a first-time home buyer in finding a villa unit with spacious lounge nearby a city and a supermarket while being flexible on having an open yard or a car space. In this paper, we aim to identify groups of objects that share similarities in subspaces (subset of attributes) with respect to the given query object in a categorical dataset. Our approach is unsupervised as it does not require any prior knowledge of the class/label information of the objects. The inherent groupings among objects are discovered by exploiting the subspace similarity between the objects and the query.

Most of the existing query ranking approaches [1], [2], [3] highlight the need for similarity search over a fixed subspace that is supplied by the user. Authors in these works made an assumption that the query user must have certain knowledge about the subspace he/she is looking for. However, in some scenarios the user may not have any domain knowledge or may have little to no preference about the subspace (e.g., non-expert laptop buyer) and wants to identify the objects/persons similar to a given query object/person. It is very unusual to find an object that matches with all the characteristics/features of the query object. Hence, subspace similarity search is essential specially in the presence of large number of attributes. Nevertheless, it is computationally very expensive to enumerate all the possible combinations of subspaces (exponential in numbers). Our proposed approach would be suitable for these cases as we dynamically determine the groups of objects with corresponding subspaces, which are similar to the given user query.

We present two different scenarios illustrating the significance of the proposed research problem. The first scenario illustrates the challenges of providing treatment to a patient with unusual symptoms. Our proposed approach can play an effective role in this case by providing insights about the query patient characteristics. In the second scenario, we mention the difficulties faced by a novice customer with flexible needs in the electronics or real estate market. Then, we discuss how inexperienced customers can be guided carefully to make their own decisions by using our proposed approach. We label the first and second scenario as Query Insights and Query Exploration, respectively, throughout the paper.

Scenario-1 (Query Insights): Consider the sample medical records of eight patients (in Table 1) and a female patient named Lisa who has recovered from an injury of a broken arm three months ago but suffers from high blood pressure. Lisa also has light chest pain with dry cough. The medical practitioners find it difficult to diagnose Lisa due to the lack of clear medical practice guideline for her treatment. In such a scenario, treatment options can be provided to the medical practitioners by identifying patients with health problems similar to Lisa. In the medical literature, this decision making scenario is known as the “green-button”, referring to the wish of a clinician to have available a magic green button, that once pressed, could identify cohorts of patients who are similar to an individual patient who is proving difficult to diagnose [4].

In this scenario, our approach would suggest a group of two female patients, Mary and Jeny to be most similar to Lisa who are suffering from chest pain and dry cough. However, we would recommend another group to consider that comes with a new female patient Rose in addition to the patients, i.e., Mary and Jeny, by relaxing the value on the pain attribute (Rose is having muscle pain in contrast to Lisa’s chest pain). Rose could be a potential candidate for Lisa because Lisa’s chest pain might be the effect of the accident that had happened to her arm. Hence, the practitioners can be guided with more information by identifying “groups of patients” that combine the matching and non-matching symptoms with respect to the given query. For instance, if the practitioners are aware of the medicines used for the identified “groups of patients”, they can apply this knowledge for diagnosing the query patient. In this example, Lisa can be treated according to the medicines, i.e., ‘M-A’ and ‘M-D’, which were prescribed to Mary, Jeny and Rose, respectively. Sections 5.3.3 Query insights, 5.4 AMiner case study provide more discussions about achieving “Query Insights” on two real datasets, i.e., Laptop and Academic Citation Network, respectively.

Scenario-2 (Query Exploration): Our approach is also applicable to the first-time home buyers i.e., buying homes in zillow,1 who find it difficult to determine suitable homes with their desired features. In this scenario, the user might have strict requirements for some attributes while being flexible on others. Our approach is able to identify “groups of properties” satisfying all such strict requirements, where each group is the best option for the user within a given subspace of flexible attributes. Moreover, the non-expert users of the electronic products, e.g., laptops and mobiles, may find our approach helpful. We construct the reference query from a sample product supplied by the user and provide them with a personalised ranked list of products with interpretable subsets of features, which can help the customers in deciding their most desirable products Section 5.3.1 provides a real-world example of this scenario on Laptop dataset.

To the best of our knowledge, we are the first to introduce query-oriented approach for mining object clusters in subspaces for categorical data. Our approach requires no knowledge about the type/class of the objects in the dataset while identifying the clusters of objects, hence it is unsupervised. Existing (unsupervised) query ranking techniques [1], [2], [3] identify objects from the whole dataset for a fixed subspace whereas our approach dynamically determines the groups and subspaces by maximising the similar with respect to the query characteristics. Inlying or outlying aspects mining [5], [6], [7] approaches also focus on a query, but they identify a subspace that makes the query most inlying or outlying with respect to all objects in the dataset. However, we identify the subset of attributes for a cluster of objects such that the objects become most similar (inlying) to a specific query object.

Subspace clustering [8], [9], [10] is closely related but not directly applicable as it identifies clusters of objects such that the intra-cluster similarity is maximised while the inter-cluster similarity is minimised. In contrast, our approach groups objects such that they share similar characteristics in subspaces with respect to a particular query. Thus, if one were to apply subspace clustering to solve our proposed problem, it would likely result in poor quality of clusters in terms of similarity with the given query. Bi-clustering [11], [12] is another related research topic where simultaneous clustering of rows/objects and columns/attributes of data matrix is performed. However, this technique is also not suitable for our task as it mines general patterns across the matrix instead of discovering patterns significant for a particular (query) object.

Our proposed algorithm efficiently enumerates the subspaces to mine personalised clusters, where the objects are homogeneous with respect to the query subspaces. We exploit the strictness (if required by the user) and flexibility of the query subspaces and, effectively reduce the search space by eliminating redundant groups that can never be a part of the answer. Our effective pruning strategies adopt a ‘top-k’ enumeration approach. We further provide a guideline for the determination of divergent objects from the discovered personalised groups to make it more user friendly. As there is no ground truth available for the query focused object-clusters, we would not be able to use the cluster validity metrics (e.g., NMI, purity and F-score) for evaluating the performance of PRESS. Instead, we conduct qualitative case studies to assess the effectiveness and efficiency of our approach. The key contributions can be summarised as follows:

  • Novelty: We define a novel personalisation framework, where we aim to maximise subspace similarity for a given query and a group of objects.

  • Effectiveness: We develop effective strategies to mine subspaces for datasets possessing categorical attributes.

  • Scalability: Our experimental study demonstrates that our proposed algorithm is highly scalable with respect to the number of attributes and objects in the dataset.

The rest of the paper is organised as follows. Section 2 reviews existing works related to this problem. We introduce new terminologies and formally define the problem in Section 3. In Section 4, we propose our algorithms to solve the problem and in Section 5, we present extensive experiments and perform case-studies to verify the effectiveness of our algorithms. Section 6 concludes the paper with future research directions.

Section snippets

Query oriented similarity search

kNN Classification or regression techniques [13], [14], [15], [16] predict the class (value of the target attribute) of a given test/query object using its k nearest neighbours. However, in our approach we neither have an intention to predict the class of the given query object, nor do we use the class information of the objects. We determine the inherent groupings of objects in an unsupervised manner such that the objects in a group share the highest similarity with the query object.

Ranking

Terminologies and problem definition

Let Of={o1,o2,,on} be a set of n objects and Af={a1,a2,,am} be a set of m categorical attributes. Each object o Of including the query q consists of m attributes. For each attribute a Af, there are t possible categorical values, e.g., attribute cough in Table 1 has t=3 values. We find the query q as {gender = female, age = teenager, high blood-pressure = yes, pain = chest, cough = dry, skin-condition = normal} in Table 1.

Assume va represents the value of attribute a. For an attribute set A

The framework

We present a framework of our approach in Algorithm 1 to mine k groups of objects illustrating similarity with the query q in their corresponding subspaces. There are three key steps.

    Step 1:

    We determine the groups of objects with respect to the completely matched attributes (Ac) given q. Function FindCG retrieves the non-redundant groups of objects in terms of Ac using the closure mining strategy (Lines 1.1–1.2) as described in Section 4.2.

    Step 2:

    The candidate groups with both completely (Ac)

Experimental results

Since PRESS mines query specific object-clusters (not the general object-clusters in the literature) and there is no ground truth available for these personalised clusters, we are unable to apply the cluster validation techniques, e.g., Normalised Mutual Information (NMI), purity, F-measure and precision/recall, for the evaluation of PRESS’ performance. Instead, we perform qualitative case studies on two different datasets, i.e., AMiner3 (Academic Citation Network) and

Conclusions and future work

We introduce a novel problem: finding objects with high subspace similarity with respect to a query. We propose a query-driven approach PRESS for identifying top-k most similar groups of objects and subspaces in categorical datasets. We mine closures to enumerate the query subspaces and develop dynamic pruning strategies to address the problem efficiently. We also provide a summarisation strategy to provide the user with more interpretable results, where we offer the most divergent set of

CRediT authorship contribution statement

Tahrima Hashem: Conceptualization, Methodology, Software, Formal analysis, Validation, Investigation, Data curation, Writing - original draft, Visualization. Lida Rashidi: Conceptualization, Methodology, Formal analysis, Validation, Investigation, Writing - review & editing, Supervision. Lars Kulik: Conceptualization, Resources, Writing - review & editing, Validation, Investigation, Supervision, Project administration, Funding acquisition. James Bailey: Conceptualization, Resources, Writing -

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This research work is supported by the Australian Research Council (ARC) Discovery Grant DP170102472.

Tahrima Hashem is a Ph.D. student in the Department of Computing and Information Systems at the University of Melbourne. She is also a lecturer in the Department of Computer Science and Engineering, University of Dhaka, Bangladesh. Currently, she is on study leave for doing Ph.D. Her research interests include pattern mining, similarity search and personalised data mining.

References (40)

  • R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, Automatic subspace clustering of high dimensional data for data...
  • K. Kailing, H.-P. Kriegel, P. Kröger, Density-connected subspace clustering for high-dimensional data, in: SDM, 2004,...
  • I. Assent, R. Krieger, E. Müller, T. Seidl, DUSC: Dimensionality unbiased subspace clustering, in: Seventh IEEE...
  • R. Martinez, C. Pasquier, N. Pasquier, GenMiner:Mining informative association rules from genomic data, in: BIBM, 2007,...
  • HenriquesR. et al.

    BicPAM:pattern-based biclustering for biomedical data analysis

    Algorithms Mol. Biol.

    (2014)
  • ZhangS. et al.

    Learning k for kNN classification

    ACM Trans. Intell. Syst. Technol.

    (2017)
  • WuX. et al.

    Top 10 algorithms in data mining

    Knowl. Inf. Syst.

    (2008)
  • TanP. et al.

    Introduction to Data Mining

    (2005)
  • F.H. Al-Qahtani, S.F. Crone, Multivariate k-nearest neighbour regression for time series data — A novel algorithm for...
  • D. Xin, J. Han, H. Cheng, X. Li, Answering Top-k queries with multi-dimensional selections: The ranking cube approach,...
  • Cited by (0)

    Tahrima Hashem is a Ph.D. student in the Department of Computing and Information Systems at the University of Melbourne. She is also a lecturer in the Department of Computer Science and Engineering, University of Dhaka, Bangladesh. Currently, she is on study leave for doing Ph.D. Her research interests include pattern mining, similarity search and personalised data mining.

    Lida Rashidi is a postdoctoral research fellow in the Department of Computing and Information Systems at the University of Melbourne. She received her Ph.D. in computer science from Melbourne University in 2017. Her main research interests are graph theory, social network analysis, and anomaly detection.

    Lars Kulik is a Professor in the School of Computing and Information Systems at the University of Melbourne. His research focuses on computational approaches for protecting privacy, efficient algorithms for intelligent traffic systems, spatiotemporal data mining, personalised data analytics, and algorithms for mobile and wearable computing environments.

    James Bailey is a Professor in the School of Computing and Information Systems at the University of Melbourne. He was an Australian Research Council Future Fellow from 2012–2015. His research interests are in the area of data mining and machine learning, particularly perturbation analysis, clustering, correlation assessment and anomaly detection and explanation. His research has been translated to systems in the area of health, partnering with both hospitals (real time medical emergency prediction for patients) and industry (cognitive systems for immersive simulation training). He has received the best paper award at conferences such as IEEE ICDM, PAKDD and SIAM SDM. He was co-PC Chair of PAKDD 2016 and co-General Chair of ACM CIKM 2015. He is a member of several Editorial Boards, including ACM Transactions on Data Science, IEEE Transactions on Big Data, and Knowledge and Information Systems.

    View full text