short-paper

Data Management in the Data Lake: A Systematic Mapping

Authors:
Firas Zouari

Universite Jean Moulin (Lyon III), France

Universite Jean Moulin (Lyon III), France
View Profile

,
Nadia Kabachi

Universite Claude Bernard (Lyon I), France

Universite Claude Bernard (Lyon I), France
View Profile

,
Khouloud Boukadi

Universite de Sfax, Tunisia

Universite de Sfax, Tunisia
View Profile

,
Chirine Ghedira Guegan

Universite Jean Moulin (Lyon III), France

Universite Jean Moulin (Lyon III), France
View Profile

IDEAS '21: Proceedings of the 25th International Database Engineering & Applications SymposiumJuly 2021Pages 280–284https://doi.org/10.1145/3472163.3472173

Published:07 September 2021Publication History

IDEAS '21: Proceedings of the 25th International Database Engineering & Applications Symposium

Pages 280–284

ABSTRACT

The computer science community is paying more and more attention to data due to its crucial role in performing analysis and prediction. Researchers have proposed many data containers such as files, databases, data warehouses, cloud systems, and recently data lakes in the last decade. The latter enables holding data in its native format, making it suitable for performing massive data prediction, particularly for real-time application development. Although data lake is well adopted in the computer science industry, its acceptance by the research community is still in its infancy stage. This paper sheds light on existing works for performing analysis and predictions on data placed in data lakes. Our study reveals the necessary data management steps, which need to be followed in a decision process, and the requirements to be respected, namely curation, quality evaluation, privacy-preservation, and prediction. This study aims to categorize and analyze proposals related to each step mentioned above.

References

Alessandro Acquisti, Leslie K. John, and George Loewenstein. 2013. What is privacy worth?Journal of Legal Studies 42, 2 (2013), 249–274.Google Scholar
Ahmad Ahmadov, Maik Thiele, Julian Eberius, Wolfgang Lehner, and Robert Wrembel. 2016. Towards a Hybrid Imputation Approach Using Web Tables. Proceedings - 2015 2nd IEEE/ACM International Symposium on Big Data Computing, BDC 2015Ml(2016), 21–30.Google Scholar
Amin Beheshti, Boualem Benatallah, Reza Nouri, and Alireza Tabebordbar. 2018. CoreKG: a Knowledge Lake Service. Knowledge Lake Service. PVLDB 11, 12 (2018), 1942–1945.Google ScholarDigital Library
Seyed Mehdi Reza Beheshti, Alireza Tabebordbar, Boualem Benatallah, and Reza Nouri. 2019. On automating basic data curation tasks. 26th International World Wide Web Conference 2017, WWW 2017 Companion (2019), 165–169.Google Scholar
Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: a data discovery system. Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018 (2018), 1001–1012.Google Scholar
Manoj Diwakar, Amrendra Tripathi, Kapil Joshi, Minakshi Memoria, Prabhishek Singh, and Neeraj Kumar. 2020. Latest trends on heart disease prediction using machine learning and image fusion. Materials Today: Proceedings 37, Part 2 (2020), 3213–3218.Google Scholar
Huang Fang. 2015. Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem. In 2015 IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems, IEEE-CYBER 2015. Institute of Electrical and Electronics Engineers Inc., 820–824.Google ScholarCross Ref
Muneeb Ul Hassan, Mubashir Husain Rehmani, and Jinjun Chen. 2019. Privacy preservation in blockchain based IoT systems: Integration issues, prospects, challenges, and future research directions. Future Generation Computer Systems 97 (2019), 512–529.Google ScholarDigital Library
Nikolaos Konstantinou, Edward Abel, Luigi Bellomarini, Alex Bogatu, Cristina Civili, Endri Irfanie, Martin Koehler, Lacramioara Mazilu, Emanuel Sallinger, Alvaro A.A. Fernandes, Georg Gottlob, John A. Keane, and Norman W. Paton. 2019. VADA: an architecture for end user informed data preparation. Journal of Big Data 6, 1 (2019), 1–32.Google ScholarCross Ref
Endah Kristiani, Yuan An Chen, Chao Tung Yang, Chin Yin Huang, Yu Tse Tsan, and Wei Cheng Chan. 2021. Using deep ensemble for influenza-like illness consultation rate prediction. Future Generation Computer Systems 117 (2021), 369–386.Google ScholarCross Ref
Faiza Loukil, Chirine Ghedira-Guegan, Aïcha Nabila Benharkat, Khouloud Boukadi, and Zakaria Maamar. 2017. Privacy-aware in the IoT applications: A systematic literature review. In Lecture Notes in Computer Science, Vol. 10573 LNCS. Springer Verlag, 552–569.Google Scholar
Rajganesh Nagarajan and Ramkumar Thirunavukarasu. 2020. A Service Context-Aware QoS Prediction and Recommendation of Cloud Infrastructure Services. Arabian Journal for Science and Engineering 45, 4 (2020), 2929–2943.Google ScholarCross Ref
Fatemeh Nargesian, Erkang Zhu, Renée J. Miller, Ken Q. Pu, and Patricia C. Arocena. 2018. Data lake management: Challenges and opportunities. Proceedings of the VLDB Endowment 12, 12 (2018), 1986–1989.Google ScholarDigital Library
Ana León Palacio, Óscar Pastor López, and Juan Carlos Casamayor Ródenas. 2018. A method to identify relevant genome data: Conceptual modeling for the medicine of precision. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11157 LNCS. Springer Verlag, 597–609.Google Scholar
Kai Petersen, Robert Feldt, Shahid Mujtaba, and Michael Mattsson. 2008. Systematic mapping studies in software engineering. Technical Report.Google Scholar
Giri Kumar Tayi and Donald P. Ballou. 1998. Examining Data Quality. Commun. ACM 41, 2 (1998), 54–57.Google ScholarDigital Library
Leizhi Wang, Zhenduo Zhu, Lauren Sassoubre, Guan Yu, Chen Liao, Qingfang Hu, and Yintang Wang. 2021. Improving the robustness of beach water quality modeling using an ensemble machine learning approach. Science of the Total Environment 765 (2021), 142760.Google ScholarCross Ref

Recommendations

Data lake management: challenges and opportunities

The ubiquity of data lakes has created fascinating new challenges for data management research. In this tutorial, we review the state-of-the-art in data management for data lakes. We consider how data lakes are introducing new problems including dataset ...
Read More
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Abstract
Consolidation of the research information improves the quality of data integration, reducing duplicates between systems and enabling the required flexibility and scalability when processing various data sources. We assume that the combination of a ...
Read More
Investigations into Data Ecosystems: a systematic mapping study
Abstract
Data Ecosystems are socio-technical complex networks in which actors interact and collaborate with each other to find, archive, publish, consume, or reuse data as well as to foster innovation, create value, and support new businesses. While the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
IDEAS '21: Proceedings of the 25th International Database Engineering & Applications Symposium
July 2021
308 pages
ISBN:9781450389914
DOI:10.1145/3472163
Editors:
Bipin C. Desai
Concordia University, Canada
,
Bipin C. Desai
Concordia University, Canada
,
Jeffrey Ullman
Stanford University, USA
,
Richard McClatchey
Univ. Of West England, U.K
,
Motomichi Toyoma
Keio University, Japan
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 September 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Data lake
Data management
Systematic mapping
Qualifiers
- short-paper
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate74of210submissions,35%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 270
  Total Downloads
- Downloads (Last 12 months)65
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Data Management in the Data Lake: A Systematic Mapping

IDEAS '21: Proceedings of the 25th International Database Engineering & Applications Symposium

ABSTRACT

References

Cited By

Recommendations

Data lake management: challenges and opportunities

Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS

Investigations into Data Ecosystems: a systematic mapping study

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Data Management in the Data Lake: A Systematic Mapping

IDEAS '21: Proceedings of the 25th International Database Engineering & Applications Symposium

ABSTRACT

References

Cited By

Recommendations

Data lake management: challenges and opportunities

Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS

Investigations into Data Ecosystems: a systematic mapping study

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media