Abstract
An increasing number of applications are modeled and analyzed in network form, where nodes represent entities of interest and edges represent interactions or relationships between entities. Commonly, such relationship analysis tools assume homogeneity in both node type and edge type. Recent research has sought to redress the assumption of homogeneity and focused on mining heterogeneous information networks (HINs) where both nodes and edges can be of different types. Building on such efforts, in this work, we articulate a novel approach for mining relationships across entities in such networks while accounting for user preference over relationship type and interestingness metric. We formalize the problem as a top-k lightest paths problem, contextualized in a real-world communication network, and seek to find the k most interesting path instances matching the preferred relationship type. Our solution, PROphetic HEuristic Algorithm for Path Searching (PRO-HEAPS), leverages a combination of novel graph preprocessing techniques, well-designed heuristics and the venerable A* search algorithm. We run our algorithm on real-world large-scale graphs and show that our algorithm significantly outperforms a wide variety of baseline approaches with speedups as large as 100X.
To widen the range of applications, we also extend PRO-HEAPS to (i) support relationship analysis between two groups of entities and (ii) allow pattern path in the query to contain logical statements with operators AND, OR, NOT, and wild-card “.”. We run experiments using this generalized version of PRO-HEAPS and demonstrate that the advantage of PRO-HEAPS becomes even more pronounced for these general cases. Furthermore, we conduct a comprehensive analysis to study how the performance of PRO-HEAPS varies with respect to various attributes of the input HIN. We finally conduct a case study to demonstrate valuable applications of our algorithm.
- Takuya Akiba, Yoichi Iwata, and Yuichi Yoshida. 2013. Fast exact shortest-path distance queries on large networks by pruned landmark labeling. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 349--360. Google ScholarDigital Library
- Boanerges Aleman-Meza, Christian Halaschek-Wiener, Satya Sanket Sahoo, Amit Sheth, and I. Budak Arpinar. 2005. Template based semantic similarity for security applications. In Proceedings of the Intelligence and Security Informatics. Springer, 621--622. Google ScholarDigital Library
- Noga Alon, Raphael Yuster, and Uri Zwick. 1995. Color-coding. Journal of the ACM 42, 4 (1995), 844--856. Google ScholarDigital Library
- Yiyuan Bai, Chaokun Wang, Xiang Ying, Meng Wang, and Yunqing Gong. 2014. Path pattern query processing on large graphs. In Proceedings of the IEEE 4th International Conference on Big Data and Cloud Computing (BdCloud). IEEE. Google ScholarDigital Library
- Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, and Shashank Sudarshan. 2002. Keyword searching and browsing in databases using BANKS. In ICDE. IEEE.Google Scholar
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3 (2003), 993--1022. Google ScholarDigital Library
- Thayne Coffman, Seth Greenblatt, and Sherry Marcus. 2004. Graph-based technologies for intelligence analysis. Communications of the ACM 47, 3 (2004), 45--47. Google ScholarDigital Library
- Diane J. Cook and Lawrence B. Holder. 1994. Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research 1 (1994), 231--255. Google ScholarDigital Library
- Atish Das Sarma, Sreenivas Gollapudi, Marc Najork, and Rina Panigrahy. 2010. A sketch-based distance oracle for web-scale graphs. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. ACM, 401--410. Google ScholarDigital Library
- Christos Faloutsos, Kevin S. McCurley, and Andrew Tomkins. 2004. Fast discovery of connection subgraphs. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 118--127. Google ScholarDigital Library
- Lujun Fang, Anish Das Sarma, Cong Yu, and Philip Bohannon. 2011. Rex: Explaining relationships between entity pairs. Proceedings of the VLDB Endowment 5, 3 (2011), 241--252. Google ScholarDigital Library
- Michael R. Garey and David S. Johnson. 2002. Computers and Intractability, Vol. 29. WH Freeman.Google Scholar
- Rosalba Giugno and Dennis Shasha. 2002. Graphgrep: A fast and universal method for querying graphs. In Proceedings of 16th International Conference on Pattern Recognition, Vol. 2. IEEE, 112--115.Google ScholarCross Ref
- Eleni Hadjiconstantinou and Nicos Christofides. 1999. An efficient implementation of an algorithm for finding k shortest simple paths. Networks 34.2 (1999), 88--101.Google Scholar
- John Hershberger, Matthew Maxel, and Subhash Suri. 2007. Finding the k shortest simple paths: A new algorithm and its implementation. ACM Transactions on Algorithms 3, 4 (2007), 45. Google ScholarDigital Library
- Petter Holme and Beom Jun Kim. 2002. Growing scale-free networks with tunable clustering. Physical Review E 65, 2 (2002), 026107.Google ScholarCross Ref
- Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S. Sudarshan, Rushi Desai, and Hrishikesh Karambelkar. 2005. Bidirectional expansion for keyword search on graph databases. Proceedings of the VLDB Endowment (2005), 505--516. Google ScholarDigital Library
- Naoki Katoh, Ibaraki Toshihide, and Mine Hisashi. 1982. An efficient algorithm for k shortest simple paths. Networks 12, 4 (1982), 411--427.Google ScholarCross Ref
- Arijit Khan, Nan Li, Xifeng Yan, Ziyu Guan, Supriyo Chakraborty, and Shu Tao. 2011. Neighborhood based fast graph search in large networks. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, 901--912. Google ScholarDigital Library
- Arijit Khan, Yinghui Wu, Charu C. Aggarwal, and Xifeng Yan. 2013. Nema: Fast graph search with label similarity. Proceedings of the VLDB Endowment 6, 3 (2013), 181--192. Google ScholarDigital Library
- Ni Lao and William W. Cohen. 2010. Relational retrieval using a combination of path-constrained random walks. Machine Learning 81, 1 (2010), 53--67. Google ScholarDigital Library
- Jiongqian Liang, Deepak Ajwani, Patrick K. Nicholson, Alessandra Sala, and Srinivasan Parthasarathy. 2016. What links alice and bob? matching and ranking semantic patterns in heterogeneous networks. In Proceedings of the 25th International Conference on World Wide Web. ACM, 879--889. Google ScholarDigital Library
- Jiongqian Liang, Peter Jacobs, Jiankai Sun, and Srinivasan Parthasarathy. 2018. SEANO: Semi-supervised embedding in attributed networks with outliers. In Proceedings of the 2018 SIAM International Conference on Data Mining. SIAM.Google ScholarCross Ref
- Changping Meng, Reynold Cheng, Silviu Maniu, Pierre Senellart, and Wangda Zhang. 2015. Discovering meta-paths in large heterogeneous information networks. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 754--764. Google ScholarDigital Library
- Judea Pearl. 1984. Heuristics: Intelligent search strategies for computer problem solving. Addison-Wesley (1984). Google ScholarDigital Library
- Stuart Russell and Peter Norvig. 1995. Artificial Intelligence: A modern approach. Pearson Education 25 (1995), 97--104.Google Scholar
- Jacob Scott, Trey Ideker, Richard M. Karp, and Roded Sharan. 2006. Efficient algorithms for detecting signaling pathways in protein interaction networks. Journal of Computational Biology 13, 2 (2006), 133--144.Google ScholarCross Ref
- Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno. 2002. Algorithmics and applications of tree and graph searching. In Proceedings of the ACM SIGMOD Symposium on Principles of Database Systems. ACM, 39--52. Google ScholarDigital Library
- Chuan Shi, Xiangnan Kong, Yue Huang, S. Yu Philip, and Bin Wu. 2014. Hetesim: A general framework for relevance measure in heterogeneous networks. IEEE Transactions on Knowledge and Data Engineering 26, 10 (2014), 2479--2492.Google ScholarCross Ref
- Chuan Shi, Yitong Li, Jiawei Zhang, Yizhou Sun, and S. Yu Philip. 2017. A survey of heterogeneous information network analysis. IEEE Transactions on Knowledge and Data Engineering 29, 1 (2017), 17--37. Google ScholarDigital Library
- Chuan Shi, Zhiqiang Zhang, Ping Luo, Philip S. Yu, Yading Yue, and Bin Wu. 2015. Semantic path based personalized recommendation on weighted heterogeneous information networks. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 453--462. Google ScholarDigital Library
- Yu-Keng Shih and Srinivasan Parthasarathy. 2012. A single source k-shortest paths algorithm to infer regulatory pathways in a gene network. Bioinformatics 28, 12 (2012), i49--i58. Google ScholarDigital Library
- Christian Sommer. 2014. Shortest-path queries in static networks. ACM Computing Surveys 46, 4 (2014), 45. Google ScholarDigital Library
- Yizhou Sun and Jiawei Han. 2013. Mining heterogeneous information networks: A structural analysis approach. ACM SIGKDD Explorations Newsletter 14, 2 (2013), 20--28. Google ScholarDigital Library
- Yizhou Sun, Jiawei Han, Charu C. Aggarwal, and Nitesh V. Chawla. 2012. When will it happen? Relationship prediction in heterogeneous information networks. In Proceedings of the 5th ACM international conference on Web search and data mining. ACM, 663--672. Google ScholarDigital Library
- Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, and Tianyi Wu. 2011. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment 4, 11 (2011), 992--1003.Google ScholarDigital Library
- Hanghang Tong and Christos Faloutsos. 2006. Center-piece subgraphs: Problem definition and fast solutions. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 404--413. Google ScholarDigital Library
- Julian R. Ullmann. 1976. An algorithm for subgraph isomorphism. Journal of the ACM 23, 1 (1976), 31--42. Google ScholarDigital Library
- Alexander Ullrich and Christian V. Forst. 2009. k-PathA: K-shortest path algorithm. In Proceedings of the IEEE International Workshop on High Performance Computational Systems Biology. Google ScholarDigital Library
- Michael Wolverton, Pauline Berry, Ian W. Harrison, John D. Lowrance, David N. Morley, Andres C. Rodriguez, Enrique H. Ruspini, and Jerome Thomere. 2003. LAW: A workbench for approximate pattern matching in relational data. In Proceedings of the 5th Innovative Applications of Artificial Intelligence Conference, Vol. 3. 143--150.Google Scholar
- Jin Y. Yen. 1971. Finding the shortest loopless paths in a network. Management Science 17, 11 (1971), 712--716.Google ScholarCross Ref
Index Terms
- Prioritized Relationship Analysis in Heterogeneous Information Networks
Recommendations
What Links Alice and Bob?: Matching and Ranking Semantic Patterns in Heterogeneous Networks
WWW '16: Proceedings of the 25th International Conference on World Wide WebAn increasing number of applications are modeled and analyzed in network form, where nodes represent entities of interest and edges represent interactions or relationships between entities. Commonly, such relationship analysis tools assume homogeneity ...
On relationship formation in heterogeneous information networks: An inferring method based on multilabel learning
This paper studies how relationships form in heterogeneous information networks (HINs). The objective is not only to predict relationships in a given HIN more accurately but also to discover the interdependency between different type of relationships. A ...
Mining heterogeneous information networks: the next frontier
KDD '12: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data miningReal world physical and abstract data objects are interconnected, forming gigantic, interconnected networks. By structuring these data objects into multiple types, such networks become semi-structured heterogeneous information networks. Most real world ...
Comments