Skip to main content

Leveraging the Data Lake: Current State and Challenges

  • Conference paper
  • First Online:
Big Data Analytics and Knowledge Discovery (DaWaK 2019)

Abstract

The digital transformation leads to massive amounts of heterogeneous data challenging traditional data warehouse solutions in enterprises. In order to exploit these complex data for competitive advantages, the data lake recently emerged as a concept for more flexible and powerful data analytics. However, existing literature on data lakes is rather vague and incomplete, and the various realization approaches that have been proposed neither cover all aspects of data lakes nor do they provide a comprehensive design and realization strategy. Hence, enterprises face multiple challenges when building data lakes. To address these shortcomings, we investigate existing data lake literature and discuss various design and realization aspects for data lakes, such as governance or data models. Based on these insights, we identify challenges and research gaps concerning (1) data lake architecture, (2) data lake governance, and (3) a comprehensive strategy to realize data lakes. These challenges still need to be addressed to successfully leverage the data lake in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://hadoop.apache.org/.

References

  1. Lee, J., Kao, H.-A., Yang, S.: Service innovation and smart analytics for industry 4.0 and big data environment. In: Proceedings of the 6th CIRP Conference on Industrial Product-Service Systems (2014)

    Google Scholar 

  2. Russom, P.: Big data analytics. TDWI best Practices report, fourth Quarter (2011)

    Google Scholar 

  3. Margulies, J.C.: Data as Competitive Advantage. Winterberry Group (October) (2015)

    Google Scholar 

  4. Tyagi, P., Demirkan, H.: Data lakes: the biggest big data challenges. Analytics 9(6), 56–63 (2016)

    Google Scholar 

  5. Dixon, J.: Pentaho, Hadoop, and Data Lakes. https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/

  6. Dixon, J.: Data Lakes Revisited. https://jamesdixon.wordpress.com/2014/09/25/data-lakes-revisited/

  7. Madera, C., Laurent, A.: The next information architecture evolution: the data lake wave. In: Proceedings of the 8th International Conference on Management of Digital EcoSystems (MEDES) (2016)

    Google Scholar 

  8. Fang, H.: Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem. In: Proceedings of the 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER) (2015)

    Google Scholar 

  9. O’Leary, D.E.: Embedding AI and crowdsourcing in the big data lake. IEEE Intell. Syst. 29(5), 70–73 (2014)

    Article  Google Scholar 

  10. Terrizzano, I., Schwarz, P., Roth, M., Colino, J.E.: Data wrangling: the challenging journey from the wild to the lake. In: Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR) (2015)

    Google Scholar 

  11. Mathis, C.: Data lakes. Datenbank-Spektrum. 17(3), 289–293 (2017)

    Article  Google Scholar 

  12. Gröger, C., Hoos, E.: Ganzheitliches metadatenmanagement im data lake: anforderungen, IT-werkzeuge und herausforderungen in der Praxis. In: Proceedings der 18. Fachtagung Datenbanksysteme für Business, Technologie und Web (BTW) (2019)

    Google Scholar 

  13. Lock, M.: Maximizing your data lake with a cloud or hybrid approach. Aberdeen Group (2016)

    Google Scholar 

  14. IBM Analytics: The governed data lake approach. IBM (2016)

    Google Scholar 

  15. Madsen, M.: How to Build an enterprise data lake: important considerations before jumping in. Third Nature Inc. (2015)

    Google Scholar 

  16. Gartner Inc.: Gartner Says Beware of the Data Lake Fallacy (2014). https://www.gartner.com/newsroom/id/2809117

  17. Patel, P., Wood, G., Diaz, A.: Data lake governance best practices. DZone Guide Big Data - Data Sci. Adv. Analytics 4, 6–7 (2017)

    Google Scholar 

  18. Chessell, M., Scheepers, F., Nguyen, N., van Kessel, R., van der Starre, R.: Governing and Managing Big Data for Analytics and Decision Makers. IBM, New York (2014)

    Google Scholar 

  19. Topchyan, A.R.: Enabling data driven projects for a modern enterprise. Proc. Inst. Syst. Program. RAS (ISP RAS 2016) 28(3), 209–230 (2016)

    Article  Google Scholar 

  20. Stein, B., Morrison, A.: The enterprise data lake: better integration and deeper analytics. Technol. Forecast Rethinking Integr. 1, 1–9 (2014)

    Google Scholar 

  21. Farid, M., Roati, A., Ilyas, I.F., Hoffmann, H.-F., Reuters, T., Chu, X.: CLAMS: bringing quality to data lakes. In: Proceedings of the 2016 International Conference on Management of Data (SIGMOD) (2016)

    Google Scholar 

  22. Gorelik, A.: The Enterprise Big Data Lake. O’Reilly Media, Inc., Newton (2016)

    Google Scholar 

  23. Sharma, B.: Architecting Data Lakes - Data Management Architectures for Advanced Business Use Cases. O’Reilly Media, Inc., Newton (2018)

    Google Scholar 

  24. Zikopoulos, P., DeRoos, D., Bienko, C., Buglio, R., Andrews, M.: Big Data Beyond the Hype. McGraw-Hill Education, New York (2015)

    Google Scholar 

  25. Inmon, B.: Data Lake Architecture - Designing the Data Lake and avoiding the Garbage Dump. Technics Publications, New Jersey (2016)

    Google Scholar 

  26. Marz, N., Warren, J.: Big Data - Principles and Best Practices Of Scalable Real-Time Data Systems. Manning Publications Co., New York (2015)

    Google Scholar 

  27. Gröger, C.: Building an industry 4.0 analytics platform. Datenbank-Spektrum 18(1), 5–14 (2018)

    Article  Google Scholar 

  28. Giebler, C., Stach, C., Schwarz, H., Mitschang, B.: BRAID - a hybrid processing architecture for big data. In: Proceedings of the 7th International Conference on Data Science, Technology and Applications (DATA) (2018)

    Google Scholar 

  29. Nadal, S., et al.: A software reference architecture for semantic-aware Big Data systems. Inf. Softw. Technol. 90, 75–92 (2017)

    Article  Google Scholar 

  30. Stiglich, P.: Data modeling in the age of big data. Bus. Intell. J. 19(4), 17–22 (2014)

    Google Scholar 

  31. Houle, P.: Data Lakes, Data Ponds, and Data Droplets (2017). http://ontology2.com/the-book/data-lakes-ponds-and-droplets.html

  32. Cernjeka, K., Jaksic, D., Jovanovic, V.: NoSQL document store translation to data vault based EDW. In: 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) (2018)

    Google Scholar 

  33. Schnider, D., Martino, A., Eschermann, M.: Comparison of Data Modeling Methods for a Core Data Warehouse. Trivadis, Basel (2014)

    Google Scholar 

  34. Gröger, C., Schwarz, H., Mitschang, B.: The deep data warehouse: link-based integration and enrichment of warehouse data and unstructured content. In: Proceedings of the 2014 IEEE 18th International Enterprise Distributed Object Computing Conference (EDOC) (2014)

    Google Scholar 

  35. Herrero, V., Abelló, A., Romero, O.: NOSQL design for analytical workloads: variability matters. In: Proceedings of the 35th International Conference on Conceptual Modeling (ER) (2016)

    Chapter  Google Scholar 

  36. Quix, C., Hai, R., Vatov, I.: Metadata extraction and management in data lakes with GEMMS. Complex Syst. Inform. Model. Q. 9(9), 67–83 (2016)

    Article  Google Scholar 

  37. Halevy, A., et al.: Managing Google’s data lake: an overview of the goods system. IEEE Data Eng. Bullet. 39, 5–14 (2016)

    Google Scholar 

  38. Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of the 2016 International Conference on Management of Data (SIGMOD) (2016)

    Google Scholar 

  39. Gallinucci, E., Golfarelli, M., Rizzi, S.: Schema profiling of document-oriented databases. Inf. Syst. 75, 13–25 (2018)

    Article  Google Scholar 

  40. Walker, C., Alrehamy, H.: Personal data lake with data gravity pull. In: Proceedings of the 2015 IEEE Fifth International Conference on Big Data and Cloud Computing (BDCloud) IEEE (2015)

    Google Scholar 

  41. Nogueira, I., Romdhane, M., Darmont, J.: Modeling data lake metadata with a data vault. In: Proceedings of the 22nd International Database Engineering Applications Symposium (IDEAS) (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Corinna Giebler .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Giebler, C., Gröger, C., Hoos, E., Schwarz, H., Mitschang, B. (2019). Leveraging the Data Lake: Current State and Challenges. In: Ordonez, C., Song, IY., Anderst-Kotsis, G., Tjoa, A., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2019. Lecture Notes in Computer Science(), vol 11708. Springer, Cham. https://doi.org/10.1007/978-3-030-27520-4_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-27520-4_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-27519-8

  • Online ISBN: 978-3-030-27520-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics