Skip to main content

Metadata Systems for Data Lakes: Models and Features

  • Conference paper
  • First Online:
New Trends in Databases and Information Systems (ADBIS 2019)

Abstract

Over the past decade, the data lake concept has emerged as an alternative to data warehouses for storing and analyzing big data. A data lake allows storing data without any predefined schema. Therefore, data querying and analysis depend on a metadata system that must be efficient and comprehensive. However, metadata management in data lakes remains a current issue and the criteria for evaluating its effectiveness are more or less nonexistent.

In this paper, we introduce MEDAL, a generic, graph-based model for metadata management in data lakes. We also propose evaluation criteria for data lake metadata systems through a list of expected features. Eventually, we show that our approach is more comprehensive than existing metadata systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alrehamy, H., Walker, C.: Personal data lake with data gravity pull. In: BDCloud 2015, Dalian, china, vol. 88, pp. 160–167. IEEE Computer Society Washington (2015). https://doi.org/10.1109/BDCloud.2015.62

  2. Ansari, J.W., Karim, N., Decker, S., Cochez, M., Beyan, O.: Extending data lake metadata management by semantic profiling. In: ESWC 2018, Heraklion, Crete, Greece, ESWC, pp. 1–15 (2018). https://2018.eswc-conferences.org/wp-content/uploads/2018/02/ESWC2018_paper_127.pdf

  3. Beheshti, A., Benatallah, B., Nouri, R., Chhieng, V.M., Xiong, H., Zhao, X.: CoreDB: a data lake service. In: CIKM 2017, pp. 2451–2454. ACM, Singapore (2017). https://doi.org/10.1145/3132847.3133171

  4. Beheshti, A., Benatallah, B., Nouri, R., Tabebordbar, A.: CoreKG: a knowledge lake service. Proc. VLDB Endow. 11(12), 1942–1945 (2018). https://doi.org/10.14778/3229863.3236230

    Article  Google Scholar 

  5. Diamantini, C., Giudice, P.L., Musarella, L., Potena, D., Storti, E., Ursino, D.: A new metadata model to uniformly handle heterogeneous data lake sources. In: Benczúr, A., et al. (eds.) ADBIS 2018. CCIS, vol. 909, pp. 165–177. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00063-9_17

    Chapter  Google Scholar 

  6. Dixon, J.: Pentaho, Hadoop, and Data Lakes (2010). https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-anddata-lakes/

  7. Fang, H.: Managing data lakes in big data era: what’s a data lake and why has it became popular in data management ecosystem. In: CYBER 2015, Shenyang, China, pp. 820–824. IEEE (2015). https://doi.org/10.1109/CYBER.2015.7288049

  8. Farid, M., Roatis, A., Ilyas, I.F., Hoffmann, H.F., Chu, X.: CLAMS: bringing quality to data lakes. In: SIGMOD 2016, pp. 2089–2092. ACM, San Francisco (2016). https://doi.org/10.1145/2882903.2899391

  9. Farrugia, A., Claxton, R., Thompson, S.: Towards social network analytics for understanding and managing enterprise data lakes. In: ASONAM 2016, pp. 1213–1220. IEEE, San Francisco (2016). https://doi.org/10.1109/ASONAM.2016.7752393

  10. Fauduet, L., Peyrard, S.: A data-first preservation strategy: data management in SPAR. In: iPRES 2010, Vienna, Austria, pp. 1–8 (2010). http://www.ifs.tuwien.ac.at/dp/ipres2010/papers/fauduet-13.pdf

  11. Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: SIGMOD, pp. 2097–2100. ACM Digital Library, San Francisco (2016). https://doi.org/10.1145/2882903.2899389

  12. Halevy, A., et al.: Managing Google’s data lake: an overview of the goods system. In: SIGMOD 2016, pp. 795–806. ACM, San Francisco (2016). https://doi.org/10.1145/2882903.2903730

  13. Hellerstein, J.M., et al.: Ground: a data context service. In: CIDR 2017, Chaminade, CA, USA (2017). http://cidrdb.org/cidr2017/papers/p111-hellerstein-cidr17.pdf

  14. Khine, P.P., Wang, Z.S.: Data lake: a new ideology in big data era. In: WCSN 2017, Wuhan, China, ITM Web of Conferences, vol. 17, pp. 1–6 (2017). https://doi.org/10.1051/itmconf/2018170302

  15. Kimball, R.: Slowly changing dimensions. Inf. Manag. 18(9), 29 (2008)

    Google Scholar 

  16. Laskowski, N.: Data lake governance: A big data do or die (2016). https://searchcio.techtarget.com/feature/Data-lake-governance-A-big-data-do-or-die

  17. Maccioni, A., Torlone, R.: KAYAK: A framework for just-in-time data preparation in a data lake. In: Krogstie, J., Reijers, H.A. (eds.) CAiSE 2018. LNCS, vol. 10816, pp. 474–489. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91563-0_29

    Chapter  Google Scholar 

  18. Madera, C., Laurent, A.: The next information architecture evolution: the data lake wave. In: MEDES 2016, Biarritz, France, pp. 174–180 (2016). http://dl.acm.org/citation.cfm-id=3012077

  19. Mathis, C.: Data lakes. Datenbank-Spektrum 17(3), 289–293 (2017). https://doi.org/10.1007/s13222-017-0272-7

    Article  Google Scholar 

  20. Miloslavskaya, N., Tolstoy, A.: Big data, fast data and data lake concepts. In: BICA 2016, NY, USA, Procedia Computer Science, vol. 88, pp. 1–6 (2016). https://doi.org/10.1016/j.procs.2016.07.439

    Article  Google Scholar 

  21. O’Leary, D.E.: Embedding AI and crowdsourcing in the big data lake. IEEE Intell. Syst. 29(5), 70–73 (2014). https://doi.org/10.1109/MIS.2014.82

    Article  Google Scholar 

  22. Quix, C., Hai, R., Vatov, I.: GEMMS: a generic and extensible metadata management system for data lakes. In: CAiSE 2016, Ljubljana, Slovenia, pp. 129–136 (2016). http://ceur-ws.org/Vol-1612/paper17.pdf

  23. Sawadogo, P.N., Kibata, T., Darmont, J.: Metadata management for textual documents in data lakes. In: ICEIS 2019, Heraklion, Crete, Greece, pp. 72–83 (2019). https://doi.org/10.5220/0007706300720083

  24. Singh, K., et al.: Visual Bayesian fusion to navigate a data lake. In: FUSION 2016, pp. 987–994. IEEE, Heidelberg (2016)

    Google Scholar 

  25. Sirosh, J.: The intelligent data lake (2016). https://azure.microsoft.com/frfr/blog/the-intelligent-data-lake/

  26. Suriarachchi, I., Plale, B.: Crossing analytics systems: a case for integrated provenance in data lakes. In: e-Science 2016, Baltimore, MD, USA, pp. 349–354 (2016). https://doi.org/10.1109/eScience.2016.7870919

  27. Terrizzano, I., Schwarz, P., Roth, M., Colino, J.E.: Data wrangling: the challenging journey from the wild to the lake. In: CIDR 2015, Asilomar, CA, USA, pp. 1–9 (2015). http://cidrdb.org/cidr2015/Papers/CIDR15_Paper2.pdf

Download references

Acknowledgments

Part of the research presented in this article is funded by the Auvergne-Rhône-Alpes Region, as part of the AURA-PMI project that finances Pegdwendé Nicolas Sawadogo’s PhD thesis.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pegdwendé N. Sawadogo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sawadogo, P.N., Scholly, É., Favre, C., Ferey, É., Loudcher, S., Darmont, J. (2019). Metadata Systems for Data Lakes: Models and Features. In: Welzer, T., et al. New Trends in Databases and Information Systems. ADBIS 2019. Communications in Computer and Information Science, vol 1064. Springer, Cham. https://doi.org/10.1007/978-3-030-30278-8_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-30278-8_43

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-30277-1

  • Online ISBN: 978-3-030-30278-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics