Abstract
Over the past decade, the data lake concept has emerged as an alternative to data warehouses for storing and analyzing big data. A data lake allows storing data without any predefined schema. Therefore, data querying and analysis depend on a metadata system that must be efficient and comprehensive. However, metadata management in data lakes remains a current issue and the criteria for evaluating its effectiveness are more or less nonexistent.
In this paper, we introduce MEDAL, a generic, graph-based model for metadata management in data lakes. We also propose evaluation criteria for data lake metadata systems through a list of expected features. Eventually, we show that our approach is more comprehensive than existing metadata systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alrehamy, H., Walker, C.: Personal data lake with data gravity pull. In: BDCloud 2015, Dalian, china, vol. 88, pp. 160–167. IEEE Computer Society Washington (2015). https://doi.org/10.1109/BDCloud.2015.62
Ansari, J.W., Karim, N., Decker, S., Cochez, M., Beyan, O.: Extending data lake metadata management by semantic profiling. In: ESWC 2018, Heraklion, Crete, Greece, ESWC, pp. 1–15 (2018). https://2018.eswc-conferences.org/wp-content/uploads/2018/02/ESWC2018_paper_127.pdf
Beheshti, A., Benatallah, B., Nouri, R., Chhieng, V.M., Xiong, H., Zhao, X.: CoreDB: a data lake service. In: CIKM 2017, pp. 2451–2454. ACM, Singapore (2017). https://doi.org/10.1145/3132847.3133171
Beheshti, A., Benatallah, B., Nouri, R., Tabebordbar, A.: CoreKG: a knowledge lake service. Proc. VLDB Endow. 11(12), 1942–1945 (2018). https://doi.org/10.14778/3229863.3236230
Diamantini, C., Giudice, P.L., Musarella, L., Potena, D., Storti, E., Ursino, D.: A new metadata model to uniformly handle heterogeneous data lake sources. In: Benczúr, A., et al. (eds.) ADBIS 2018. CCIS, vol. 909, pp. 165–177. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00063-9_17
Dixon, J.: Pentaho, Hadoop, and Data Lakes (2010). https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-anddata-lakes/
Fang, H.: Managing data lakes in big data era: what’s a data lake and why has it became popular in data management ecosystem. In: CYBER 2015, Shenyang, China, pp. 820–824. IEEE (2015). https://doi.org/10.1109/CYBER.2015.7288049
Farid, M., Roatis, A., Ilyas, I.F., Hoffmann, H.F., Chu, X.: CLAMS: bringing quality to data lakes. In: SIGMOD 2016, pp. 2089–2092. ACM, San Francisco (2016). https://doi.org/10.1145/2882903.2899391
Farrugia, A., Claxton, R., Thompson, S.: Towards social network analytics for understanding and managing enterprise data lakes. In: ASONAM 2016, pp. 1213–1220. IEEE, San Francisco (2016). https://doi.org/10.1109/ASONAM.2016.7752393
Fauduet, L., Peyrard, S.: A data-first preservation strategy: data management in SPAR. In: iPRES 2010, Vienna, Austria, pp. 1–8 (2010). http://www.ifs.tuwien.ac.at/dp/ipres2010/papers/fauduet-13.pdf
Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: SIGMOD, pp. 2097–2100. ACM Digital Library, San Francisco (2016). https://doi.org/10.1145/2882903.2899389
Halevy, A., et al.: Managing Google’s data lake: an overview of the goods system. In: SIGMOD 2016, pp. 795–806. ACM, San Francisco (2016). https://doi.org/10.1145/2882903.2903730
Hellerstein, J.M., et al.: Ground: a data context service. In: CIDR 2017, Chaminade, CA, USA (2017). http://cidrdb.org/cidr2017/papers/p111-hellerstein-cidr17.pdf
Khine, P.P., Wang, Z.S.: Data lake: a new ideology in big data era. In: WCSN 2017, Wuhan, China, ITM Web of Conferences, vol. 17, pp. 1–6 (2017). https://doi.org/10.1051/itmconf/2018170302
Kimball, R.: Slowly changing dimensions. Inf. Manag. 18(9), 29 (2008)
Laskowski, N.: Data lake governance: A big data do or die (2016). https://searchcio.techtarget.com/feature/Data-lake-governance-A-big-data-do-or-die
Maccioni, A., Torlone, R.: KAYAK: A framework for just-in-time data preparation in a data lake. In: Krogstie, J., Reijers, H.A. (eds.) CAiSE 2018. LNCS, vol. 10816, pp. 474–489. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91563-0_29
Madera, C., Laurent, A.: The next information architecture evolution: the data lake wave. In: MEDES 2016, Biarritz, France, pp. 174–180 (2016). http://dl.acm.org/citation.cfm-id=3012077
Mathis, C.: Data lakes. Datenbank-Spektrum 17(3), 289–293 (2017). https://doi.org/10.1007/s13222-017-0272-7
Miloslavskaya, N., Tolstoy, A.: Big data, fast data and data lake concepts. In: BICA 2016, NY, USA, Procedia Computer Science, vol. 88, pp. 1–6 (2016). https://doi.org/10.1016/j.procs.2016.07.439
O’Leary, D.E.: Embedding AI and crowdsourcing in the big data lake. IEEE Intell. Syst. 29(5), 70–73 (2014). https://doi.org/10.1109/MIS.2014.82
Quix, C., Hai, R., Vatov, I.: GEMMS: a generic and extensible metadata management system for data lakes. In: CAiSE 2016, Ljubljana, Slovenia, pp. 129–136 (2016). http://ceur-ws.org/Vol-1612/paper17.pdf
Sawadogo, P.N., Kibata, T., Darmont, J.: Metadata management for textual documents in data lakes. In: ICEIS 2019, Heraklion, Crete, Greece, pp. 72–83 (2019). https://doi.org/10.5220/0007706300720083
Singh, K., et al.: Visual Bayesian fusion to navigate a data lake. In: FUSION 2016, pp. 987–994. IEEE, Heidelberg (2016)
Sirosh, J.: The intelligent data lake (2016). https://azure.microsoft.com/frfr/blog/the-intelligent-data-lake/
Suriarachchi, I., Plale, B.: Crossing analytics systems: a case for integrated provenance in data lakes. In: e-Science 2016, Baltimore, MD, USA, pp. 349–354 (2016). https://doi.org/10.1109/eScience.2016.7870919
Terrizzano, I., Schwarz, P., Roth, M., Colino, J.E.: Data wrangling: the challenging journey from the wild to the lake. In: CIDR 2015, Asilomar, CA, USA, pp. 1–9 (2015). http://cidrdb.org/cidr2015/Papers/CIDR15_Paper2.pdf
Acknowledgments
Part of the research presented in this article is funded by the Auvergne-Rhône-Alpes Region, as part of the AURA-PMI project that finances Pegdwendé Nicolas Sawadogo’s PhD thesis.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Sawadogo, P.N., Scholly, É., Favre, C., Ferey, É., Loudcher, S., Darmont, J. (2019). Metadata Systems for Data Lakes: Models and Features. In: Welzer, T., et al. New Trends in Databases and Information Systems. ADBIS 2019. Communications in Computer and Information Science, vol 1064. Springer, Cham. https://doi.org/10.1007/978-3-030-30278-8_43
Download citation
DOI: https://doi.org/10.1007/978-3-030-30278-8_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30277-1
Online ISBN: 978-3-030-30278-8
eBook Packages: Computer ScienceComputer Science (R0)