skip to main content
10.1145/3472163.3472173acmotherconferencesArticle/Chapter ViewAbstractPublication PagesideasConference Proceedingsconference-collections
short-paper

Data Management in the Data Lake: A Systematic Mapping

Published:07 September 2021Publication History

ABSTRACT

The computer science community is paying more and more attention to data due to its crucial role in performing analysis and prediction. Researchers have proposed many data containers such as files, databases, data warehouses, cloud systems, and recently data lakes in the last decade. The latter enables holding data in its native format, making it suitable for performing massive data prediction, particularly for real-time application development. Although data lake is well adopted in the computer science industry, its acceptance by the research community is still in its infancy stage. This paper sheds light on existing works for performing analysis and predictions on data placed in data lakes. Our study reveals the necessary data management steps, which need to be followed in a decision process, and the requirements to be respected, namely curation, quality evaluation, privacy-preservation, and prediction. This study aims to categorize and analyze proposals related to each step mentioned above.

References

  1. Alessandro Acquisti, Leslie K. John, and George Loewenstein. 2013. What is privacy worth?Journal of Legal Studies 42, 2 (2013), 249–274.Google ScholarGoogle Scholar
  2. Ahmad Ahmadov, Maik Thiele, Julian Eberius, Wolfgang Lehner, and Robert Wrembel. 2016. Towards a Hybrid Imputation Approach Using Web Tables. Proceedings - 2015 2nd IEEE/ACM International Symposium on Big Data Computing, BDC 2015Ml(2016), 21–30.Google ScholarGoogle Scholar
  3. Amin Beheshti, Boualem Benatallah, Reza Nouri, and Alireza Tabebordbar. 2018. CoreKG: a Knowledge Lake Service. Knowledge Lake Service. PVLDB 11, 12 (2018), 1942–1945.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Seyed Mehdi Reza Beheshti, Alireza Tabebordbar, Boualem Benatallah, and Reza Nouri. 2019. On automating basic data curation tasks. 26th International World Wide Web Conference 2017, WWW 2017 Companion (2019), 165–169.Google ScholarGoogle Scholar
  5. Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: a data discovery system. Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018 (2018), 1001–1012.Google ScholarGoogle Scholar
  6. Manoj Diwakar, Amrendra Tripathi, Kapil Joshi, Minakshi Memoria, Prabhishek Singh, and Neeraj Kumar. 2020. Latest trends on heart disease prediction using machine learning and image fusion. Materials Today: Proceedings 37, Part 2 (2020), 3213–3218.Google ScholarGoogle Scholar
  7. Huang Fang. 2015. Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem. In 2015 IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems, IEEE-CYBER 2015. Institute of Electrical and Electronics Engineers Inc., 820–824.Google ScholarGoogle ScholarCross RefCross Ref
  8. Muneeb Ul Hassan, Mubashir Husain Rehmani, and Jinjun Chen. 2019. Privacy preservation in blockchain based IoT systems: Integration issues, prospects, challenges, and future research directions. Future Generation Computer Systems 97 (2019), 512–529.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Nikolaos Konstantinou, Edward Abel, Luigi Bellomarini, Alex Bogatu, Cristina Civili, Endri Irfanie, Martin Koehler, Lacramioara Mazilu, Emanuel Sallinger, Alvaro A.A. Fernandes, Georg Gottlob, John A. Keane, and Norman W. Paton. 2019. VADA: an architecture for end user informed data preparation. Journal of Big Data 6, 1 (2019), 1–32.Google ScholarGoogle ScholarCross RefCross Ref
  10. Endah Kristiani, Yuan An Chen, Chao Tung Yang, Chin Yin Huang, Yu Tse Tsan, and Wei Cheng Chan. 2021. Using deep ensemble for influenza-like illness consultation rate prediction. Future Generation Computer Systems 117 (2021), 369–386.Google ScholarGoogle ScholarCross RefCross Ref
  11. Faiza Loukil, Chirine Ghedira-Guegan, Aïcha Nabila Benharkat, Khouloud Boukadi, and Zakaria Maamar. 2017. Privacy-aware in the IoT applications: A systematic literature review. In Lecture Notes in Computer Science, Vol. 10573 LNCS. Springer Verlag, 552–569.Google ScholarGoogle Scholar
  12. Rajganesh Nagarajan and Ramkumar Thirunavukarasu. 2020. A Service Context-Aware QoS Prediction and Recommendation of Cloud Infrastructure Services. Arabian Journal for Science and Engineering 45, 4 (2020), 2929–2943.Google ScholarGoogle ScholarCross RefCross Ref
  13. Fatemeh Nargesian, Erkang Zhu, Renée J. Miller, Ken Q. Pu, and Patricia C. Arocena. 2018. Data lake management: Challenges and opportunities. Proceedings of the VLDB Endowment 12, 12 (2018), 1986–1989.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ana León Palacio, Óscar Pastor López, and Juan Carlos Casamayor Ródenas. 2018. A method to identify relevant genome data: Conceptual modeling for the medicine of precision. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11157 LNCS. Springer Verlag, 597–609.Google ScholarGoogle Scholar
  15. Kai Petersen, Robert Feldt, Shahid Mujtaba, and Michael Mattsson. 2008. Systematic mapping studies in software engineering. Technical Report.Google ScholarGoogle Scholar
  16. Giri Kumar Tayi and Donald P. Ballou. 1998. Examining Data Quality. Commun. ACM 41, 2 (1998), 54–57.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Leizhi Wang, Zhenduo Zhu, Lauren Sassoubre, Guan Yu, Chen Liao, Qingfang Hu, and Yintang Wang. 2021. Improving the robustness of beach water quality modeling using an ensemble machine learning approach. Science of the Total Environment 765 (2021), 142760.Google ScholarGoogle ScholarCross RefCross Ref

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    IDEAS '21: Proceedings of the 25th International Database Engineering & Applications Symposium
    July 2021
    308 pages
    ISBN:9781450389914
    DOI:10.1145/3472163

    Copyright © 2021 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 7 September 2021

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • short-paper
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate74of210submissions,35%
  • Article Metrics

    • Downloads (Last 12 months)65
    • Downloads (Last 6 weeks)12

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format