ABSTRACT
Data discovery is a multi-dimensional field encompassing information extraction, information retrieval, exploratory data analysis, visualization and recommendations among other things. Data Marketplaces are platforms where users discover and shop for data products. These products themselves are produced by modern data stacks governed by frameworks like Data Fabric. Knowledge Graphs and semantic technologies already form a core part of Data Fabric and hence could be leveraged for data discovery. In this tutorial, we’ll present state of the art semantic technologies that enable automation of various tasks in data discovery. In particular, we’ll focus on data enrichment, datasets search and recommendations, and explorations within a dataset.
- James Bennett, Stan Lanning, 2007. The netflix prize. In Proceedings of KDD cup and workshop, Vol. 2007. Citeseer, 35.Google Scholar
- Vanya BK, Balaji Ganesan, Aniket Saxena, Devbrat Sharma, and Arvind Agarwal. 2021. Towards Automated Evaluation of Explanations in Graph Neural Networks. arxiv:2106.11864 [cs.AI]Google Scholar
- Alex Bogatu, Alvaro AA Fernandes, Norman W Paton, and Nikolaos Konstantinou. 2020. Dataset discovery in data lakes. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 709–720.Google ScholarCross Ref
- Dan Brickley, Matthew Burgess, and Natasha Noy. 2019. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. In The World Wide Web Conference. 1365–1375.Google ScholarDigital Library
- Sonia Castelo, Rémi Rampin, Aécio Santos, Aline Bessa, Fernando Chirigati, and Juliana Freire. 2021. Auctus: a dataset search engine for data discovery and augmentation. Proceedings of the VLDB Endowment 14, 12 (2021), 2791–2794.Google ScholarDigital Library
- Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibáñez, Emilia Kacprzak, and Paul Groth. 2020. Dataset search: a survey. The VLDB Journal 29, 1 (2020), 251–272.Google ScholarDigital Library
- Ritwik Chaudhuri, Kushal Mukherjee, Ramasuri Narayanam, Rohith Dwarakanath Vallam, Ayush Kumar, Antriksh Mathur, Shweta Garg, Sudhanshu Singh, and Gyana Parija. 2019. Collaborative reinforcement learning model for sustainability of cooperation in sequential social dilemmas. In Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems. 1877–1879.Google Scholar
- Code Engine. 2022. Code Engine. https://www.ibm.com/cloud/code-engineGoogle Scholar
- Databrics Marketplace. [n.d.]. Databrics Marketplace. https://www.databricks.com/Google Scholar
- data.world. [n.d.]. data.world. https://data.world/Google Scholar
- Henrik Dibowski, Stefan Schmid, Yulia Svetashova, Cory Henson, and Tuan Tran. 2020. Using Semantic Technologies to Manage a Data Lake: Data Catalog, Provenance and Access Control.. In SSWS@ ISWC. 65–80.Google Scholar
- Balaji Ganesan. 2020. Link Prediction in the Real World. Guest Lectures, RVCE Bengaluru and NIE Mysore, India (2020). https://balajinix.wordpress.com/2020/06/09/keep-on-learning/Google Scholar
- Balaji Ganesan and Kalapriya Kannan. 2020. D’Avatar Challenge. AMLD 2020 (2020). https://www.aicrowd.com/challenges/amld-2020-d-avatar-challengeGoogle Scholar
- Balaji Ganesan and Srinivas Parkala. 2020. Explainable Link Prediction for Master Data Management. IBM University Relations Webinar(2020). https://www.ibm.com/in-en/university/academia-programs/events/explainable-link-prediciton-for-master-data-management/?parent=workshops-conference&sct=Google Scholar
- Balaji Ganesan, Matheen Ahmed Pasha, Srinivas Parkala, Neeraj R Singh, Gayatri Mishra, Jim O’Neill, Sumit Bhatia, Hima Patel, Sameep Mehta, and Somashekar Naganna. 2020. Explainable Link Prediction for Master Data Management. NeurIPS 2020 Demo (2020). http://link-prediction-demo.mybluemix.net/Google Scholar
- Balaji Ganesan, Avirup Saha, Jaydeep Sen, Matheen Ahmed Pasha, Sumit Bhatia, and Arvind Agarwal. 2020. Anu question answering system. In ISWC (Demos/Industry).Google Scholar
- Himanshu Gupta, C Rajmohan, Sameep Mehta, and Kiran Pulapa. 2020. On Efficiently Processing Business Lineage Queries. In 2020 IEEE International Conference on Big Data (Big Data). IEEE, 513–522.Google Scholar
- Ahmed Helal, Mossad Helali, Khaled Ammar, and Essam Mansour. 2021. A demonstration of KGLac: a data discovery and enrichment platform for data science. Proceedings of the VLDB Endowment 14, 12 (2021), 2675–2678.Google ScholarDigital Library
- IBM Watson Knowledge Catalog. [n.d.]. IBM Watson Knowledge Catalog. https://www.ibm.com/cloud/watson-knowledge-catalogGoogle Scholar
- SK Mainul Islam, Abhinav Nagpal, Balaji Ganesan, and Pranay Kumar Lohia. 2021. Fair Data Generation using Language Models with Hard Constraints. In Annual Conference on Neural Information Processing Systems.Google Scholar
- Vivek Iyer, Arvind Agarwal, and Harshit Kumar. 2021. VeeAlign: Multifaceted Context Representation Using Dual Attention for Ontology Alignment. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 10780–10792.Google ScholarCross Ref
- Jenna Lau-Caruso and Lena Woolf. [n.d.]. IBM Semantic Search. https://medium.com/@lwoolf_91808/effortlessly-find-the-right-data-with-semantic-search-cdb2bd9593ac/Google Scholar
- Sameep Mehta and Hima Patel. 2020. Data Lifecycle Management Course. (2020).Google Scholar
- Microsoft. [n.d.]. Power BI. https://powerbi.microsoft.com/en-au/Google Scholar
- Tova Milo and Amit Somech. 2018. Deep Reinforcement-Learning Framework for Exploratory Data Analysis. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (Houston, TX, USA) (aiDM’18). Association for Computing Machinery, New York, NY, USA, Article 4, 4 pages. https://doi.org/10.1145/3211954.3211958Google ScholarDigital Library
- Tova Milo and Amit Somech. 2020. Automating Exploratory Data Analysis via Machine Learning: An Overview. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 2617–2622. https://doi.org/10.1145/3318464.3383126Google ScholarDigital Library
- Fatemeh Nargesian, Erkang Zhu, Renée J Miller, Ken Q Pu, and Patricia C Arocena. 2019. Data lake management: challenges and opportunities. Proceedings of the VLDB Endowment 12, 12 (2019), 1986–1989.Google ScholarDigital Library
- Fatma Özcan, Chuan Lei, Abdul Quamar, and Vasilis Efthymiou. 2021. Semantic enrichment of data for AI applications. In Proceedings of the Fifth Workshop on Data Management for End-To-End Machine Learning. 1–7.Google ScholarDigital Library
- Python Graph Gallery. 2022. Python Graph Gallery. https://www.python-graph-gallery.com/Google Scholar
- C Rajmohan, Pranay Lohia, Himanshu Gupta, Siddhartha Brahma, Mauricio Hernandez, and Sameep Mehta. 2019. On efficiently processing workflow provenance queries in spark. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1443–1452.Google ScholarCross Ref
- Avirup Saha and Balaji Ganesan. 2021. Short Text Clustering in Continuous Time Using Stacked Dirichlet-Hawkes Process with Inverse Cluster Frequency Prior. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google Scholar
- Salesforce. 2022. Tableau. https://www.tableau.com/Google Scholar
- Snowflake Marketplace. [n.d.]. Snowflake Marketplace. https://www.snowflake.com/en/Google Scholar
- Streampipe. 2022. Streampipe. https://steampipe.io/Google Scholar
- Lingraj S Vannur, Balaji Ganesan, Lokesh Nagalapatti, Hima Patel, and MN Thippeswamy. 2020. Data Augmentation for Personal Knowledge Base Population. arXiv preprint arXiv:2002.10943(2020).Google Scholar
- Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya Parameswaran, and Neoklis Polyzotis. 2015. SeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics. Proc. VLDB Endow. 8, 13 (sep 2015), 2182–2193. https://doi.org/10.14778/2831360.2831371Google ScholarDigital Library
- Jian Wei, Jianhua He, Kai Chen, Yi Zhou, and Zuoyin Tang. 2017. Collaborative filtering and deep learning based recommendation system for cold start items. Expert Systems with Applications 69 (2017), 29–39.Google ScholarCross Ref
Index Terms
- Tutorial on Semantic Automation for Data Discovery
Recommendations
Data exploration: a roll call of all user-data interaction functionality
ExploreDB '16: Proceedings of the Third International Workshop on Exploratory Search in Databases and the WebData exploration encompasses a variety of interaction types and data functionality, such as search, data analysis, curation, constraint satisfaction, data mining, and visualization. Data exploration naturally begins when a user is given a set of data ...
Interactive construction of semantic widgets for visualizing semantic web data
EICS '12: Proceedings of the 4th ACM SIGCHI symposium on Engineering interactive computing systemsThe rapidly growing amount of semantically represented data on the Web creates the need for more intuitive methods and tools to interact with these data and to use them in standard Web applications. We present a method how users can interactively define ...
Semantic Data Management in Practice
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web CompanionAfter years of research and development, standards and technologies for semantic data are sufficiently mature to be used as the foundation of novel data science projects that employ semantic technologies in various application domains such as bio-...
Comments