ABSTRACT
The computer science community is paying more and more attention to data due to its crucial role in performing analysis and prediction. Researchers have proposed many data containers such as files, databases, data warehouses, cloud systems, and recently data lakes in the last decade. The latter enables holding data in its native format, making it suitable for performing massive data prediction, particularly for real-time application development. Although data lake is well adopted in the computer science industry, its acceptance by the research community is still in its infancy stage. This paper sheds light on existing works for performing analysis and predictions on data placed in data lakes. Our study reveals the necessary data management steps, which need to be followed in a decision process, and the requirements to be respected, namely curation, quality evaluation, privacy-preservation, and prediction. This study aims to categorize and analyze proposals related to each step mentioned above.
- Alessandro Acquisti, Leslie K. John, and George Loewenstein. 2013. What is privacy worth?Journal of Legal Studies 42, 2 (2013), 249–274.Google Scholar
- Ahmad Ahmadov, Maik Thiele, Julian Eberius, Wolfgang Lehner, and Robert Wrembel. 2016. Towards a Hybrid Imputation Approach Using Web Tables. Proceedings - 2015 2nd IEEE/ACM International Symposium on Big Data Computing, BDC 2015Ml(2016), 21–30.Google Scholar
- Amin Beheshti, Boualem Benatallah, Reza Nouri, and Alireza Tabebordbar. 2018. CoreKG: a Knowledge Lake Service. Knowledge Lake Service. PVLDB 11, 12 (2018), 1942–1945.Google ScholarDigital Library
- Seyed Mehdi Reza Beheshti, Alireza Tabebordbar, Boualem Benatallah, and Reza Nouri. 2019. On automating basic data curation tasks. 26th International World Wide Web Conference 2017, WWW 2017 Companion (2019), 165–169.Google Scholar
- Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: a data discovery system. Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018 (2018), 1001–1012.Google Scholar
- Manoj Diwakar, Amrendra Tripathi, Kapil Joshi, Minakshi Memoria, Prabhishek Singh, and Neeraj Kumar. 2020. Latest trends on heart disease prediction using machine learning and image fusion. Materials Today: Proceedings 37, Part 2 (2020), 3213–3218.Google Scholar
- Huang Fang. 2015. Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem. In 2015 IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems, IEEE-CYBER 2015. Institute of Electrical and Electronics Engineers Inc., 820–824.Google ScholarCross Ref
- Muneeb Ul Hassan, Mubashir Husain Rehmani, and Jinjun Chen. 2019. Privacy preservation in blockchain based IoT systems: Integration issues, prospects, challenges, and future research directions. Future Generation Computer Systems 97 (2019), 512–529.Google ScholarDigital Library
- Nikolaos Konstantinou, Edward Abel, Luigi Bellomarini, Alex Bogatu, Cristina Civili, Endri Irfanie, Martin Koehler, Lacramioara Mazilu, Emanuel Sallinger, Alvaro A.A. Fernandes, Georg Gottlob, John A. Keane, and Norman W. Paton. 2019. VADA: an architecture for end user informed data preparation. Journal of Big Data 6, 1 (2019), 1–32.Google ScholarCross Ref
- Endah Kristiani, Yuan An Chen, Chao Tung Yang, Chin Yin Huang, Yu Tse Tsan, and Wei Cheng Chan. 2021. Using deep ensemble for influenza-like illness consultation rate prediction. Future Generation Computer Systems 117 (2021), 369–386.Google ScholarCross Ref
- Faiza Loukil, Chirine Ghedira-Guegan, Aïcha Nabila Benharkat, Khouloud Boukadi, and Zakaria Maamar. 2017. Privacy-aware in the IoT applications: A systematic literature review. In Lecture Notes in Computer Science, Vol. 10573 LNCS. Springer Verlag, 552–569.Google Scholar
- Rajganesh Nagarajan and Ramkumar Thirunavukarasu. 2020. A Service Context-Aware QoS Prediction and Recommendation of Cloud Infrastructure Services. Arabian Journal for Science and Engineering 45, 4 (2020), 2929–2943.Google ScholarCross Ref
- Fatemeh Nargesian, Erkang Zhu, Renée J. Miller, Ken Q. Pu, and Patricia C. Arocena. 2018. Data lake management: Challenges and opportunities. Proceedings of the VLDB Endowment 12, 12 (2018), 1986–1989.Google ScholarDigital Library
- Ana León Palacio, Óscar Pastor López, and Juan Carlos Casamayor Ródenas. 2018. A method to identify relevant genome data: Conceptual modeling for the medicine of precision. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11157 LNCS. Springer Verlag, 597–609.Google Scholar
- Kai Petersen, Robert Feldt, Shahid Mujtaba, and Michael Mattsson. 2008. Systematic mapping studies in software engineering. Technical Report.Google Scholar
- Giri Kumar Tayi and Donald P. Ballou. 1998. Examining Data Quality. Commun. ACM 41, 2 (1998), 54–57.Google ScholarDigital Library
- Leizhi Wang, Zhenduo Zhu, Lauren Sassoubre, Guan Yu, Chen Liao, Qingfang Hu, and Yintang Wang. 2021. Improving the robustness of beach water quality modeling using an ensemble machine learning approach. Science of the Total Environment 765 (2021), 142760.Google ScholarCross Ref
Recommendations
Data lake management: challenges and opportunities
The ubiquity of data lakes has created fascinating new challenges for data management research. In this tutorial, we review the state-of-the-art in data management for data lakes. We consider how data lakes are introducing new problems including dataset ...
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
AbstractConsolidation of the research information improves the quality of data integration, reducing duplicates between systems and enabling the required flexibility and scalability when processing various data sources. We assume that the combination of a ...
Investigations into Data Ecosystems: a systematic mapping study
AbstractData Ecosystems are socio-technical complex networks in which actors interact and collaborate with each other to find, archive, publish, consume, or reuse data as well as to foster innovation, create value, and support new businesses. While the ...
Comments