Abstract
The digital transformation leads to massive amounts of heterogeneous data challenging traditional data warehouse solutions in enterprises. In order to exploit these complex data for competitive advantages, the data lake recently emerged as a concept for more flexible and powerful data analytics. However, existing literature on data lakes is rather vague and incomplete, and the various realization approaches that have been proposed neither cover all aspects of data lakes nor do they provide a comprehensive design and realization strategy. Hence, enterprises face multiple challenges when building data lakes. To address these shortcomings, we investigate existing data lake literature and discuss various design and realization aspects for data lakes, such as governance or data models. Based on these insights, we identify challenges and research gaps concerning (1) data lake architecture, (2) data lake governance, and (3) a comprehensive strategy to realize data lakes. These challenges still need to be addressed to successfully leverage the data lake in practice.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Lee, J., Kao, H.-A., Yang, S.: Service innovation and smart analytics for industry 4.0 and big data environment. In: Proceedings of the 6th CIRP Conference on Industrial Product-Service Systems (2014)
Russom, P.: Big data analytics. TDWI best Practices report, fourth Quarter (2011)
Margulies, J.C.: Data as Competitive Advantage. Winterberry Group (October) (2015)
Tyagi, P., Demirkan, H.: Data lakes: the biggest big data challenges. Analytics 9(6), 56–63 (2016)
Dixon, J.: Pentaho, Hadoop, and Data Lakes. https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/
Dixon, J.: Data Lakes Revisited. https://jamesdixon.wordpress.com/2014/09/25/data-lakes-revisited/
Madera, C., Laurent, A.: The next information architecture evolution: the data lake wave. In: Proceedings of the 8th International Conference on Management of Digital EcoSystems (MEDES) (2016)
Fang, H.: Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem. In: Proceedings of the 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER) (2015)
O’Leary, D.E.: Embedding AI and crowdsourcing in the big data lake. IEEE Intell. Syst. 29(5), 70–73 (2014)
Terrizzano, I., Schwarz, P., Roth, M., Colino, J.E.: Data wrangling: the challenging journey from the wild to the lake. In: Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR) (2015)
Mathis, C.: Data lakes. Datenbank-Spektrum. 17(3), 289–293 (2017)
Gröger, C., Hoos, E.: Ganzheitliches metadatenmanagement im data lake: anforderungen, IT-werkzeuge und herausforderungen in der Praxis. In: Proceedings der 18. Fachtagung Datenbanksysteme für Business, Technologie und Web (BTW) (2019)
Lock, M.: Maximizing your data lake with a cloud or hybrid approach. Aberdeen Group (2016)
IBM Analytics: The governed data lake approach. IBM (2016)
Madsen, M.: How to Build an enterprise data lake: important considerations before jumping in. Third Nature Inc. (2015)
Gartner Inc.: Gartner Says Beware of the Data Lake Fallacy (2014). https://www.gartner.com/newsroom/id/2809117
Patel, P., Wood, G., Diaz, A.: Data lake governance best practices. DZone Guide Big Data - Data Sci. Adv. Analytics 4, 6–7 (2017)
Chessell, M., Scheepers, F., Nguyen, N., van Kessel, R., van der Starre, R.: Governing and Managing Big Data for Analytics and Decision Makers. IBM, New York (2014)
Topchyan, A.R.: Enabling data driven projects for a modern enterprise. Proc. Inst. Syst. Program. RAS (ISP RAS 2016) 28(3), 209–230 (2016)
Stein, B., Morrison, A.: The enterprise data lake: better integration and deeper analytics. Technol. Forecast Rethinking Integr. 1, 1–9 (2014)
Farid, M., Roati, A., Ilyas, I.F., Hoffmann, H.-F., Reuters, T., Chu, X.: CLAMS: bringing quality to data lakes. In: Proceedings of the 2016 International Conference on Management of Data (SIGMOD) (2016)
Gorelik, A.: The Enterprise Big Data Lake. O’Reilly Media, Inc., Newton (2016)
Sharma, B.: Architecting Data Lakes - Data Management Architectures for Advanced Business Use Cases. O’Reilly Media, Inc., Newton (2018)
Zikopoulos, P., DeRoos, D., Bienko, C., Buglio, R., Andrews, M.: Big Data Beyond the Hype. McGraw-Hill Education, New York (2015)
Inmon, B.: Data Lake Architecture - Designing the Data Lake and avoiding the Garbage Dump. Technics Publications, New Jersey (2016)
Marz, N., Warren, J.: Big Data - Principles and Best Practices Of Scalable Real-Time Data Systems. Manning Publications Co., New York (2015)
Gröger, C.: Building an industry 4.0 analytics platform. Datenbank-Spektrum 18(1), 5–14 (2018)
Giebler, C., Stach, C., Schwarz, H., Mitschang, B.: BRAID - a hybrid processing architecture for big data. In: Proceedings of the 7th International Conference on Data Science, Technology and Applications (DATA) (2018)
Nadal, S., et al.: A software reference architecture for semantic-aware Big Data systems. Inf. Softw. Technol. 90, 75–92 (2017)
Stiglich, P.: Data modeling in the age of big data. Bus. Intell. J. 19(4), 17–22 (2014)
Houle, P.: Data Lakes, Data Ponds, and Data Droplets (2017). http://ontology2.com/the-book/data-lakes-ponds-and-droplets.html
Cernjeka, K., Jaksic, D., Jovanovic, V.: NoSQL document store translation to data vault based EDW. In: 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) (2018)
Schnider, D., Martino, A., Eschermann, M.: Comparison of Data Modeling Methods for a Core Data Warehouse. Trivadis, Basel (2014)
Gröger, C., Schwarz, H., Mitschang, B.: The deep data warehouse: link-based integration and enrichment of warehouse data and unstructured content. In: Proceedings of the 2014 IEEE 18th International Enterprise Distributed Object Computing Conference (EDOC) (2014)
Herrero, V., Abelló, A., Romero, O.: NOSQL design for analytical workloads: variability matters. In: Proceedings of the 35th International Conference on Conceptual Modeling (ER) (2016)
Quix, C., Hai, R., Vatov, I.: Metadata extraction and management in data lakes with GEMMS. Complex Syst. Inform. Model. Q. 9(9), 67–83 (2016)
Halevy, A., et al.: Managing Google’s data lake: an overview of the goods system. IEEE Data Eng. Bullet. 39, 5–14 (2016)
Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of the 2016 International Conference on Management of Data (SIGMOD) (2016)
Gallinucci, E., Golfarelli, M., Rizzi, S.: Schema profiling of document-oriented databases. Inf. Syst. 75, 13–25 (2018)
Walker, C., Alrehamy, H.: Personal data lake with data gravity pull. In: Proceedings of the 2015 IEEE Fifth International Conference on Big Data and Cloud Computing (BDCloud) IEEE (2015)
Nogueira, I., Romdhane, M., Darmont, J.: Modeling data lake metadata with a data vault. In: Proceedings of the 22nd International Database Engineering Applications Symposium (IDEAS) (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Giebler, C., Gröger, C., Hoos, E., Schwarz, H., Mitschang, B. (2019). Leveraging the Data Lake: Current State and Challenges. In: Ordonez, C., Song, IY., Anderst-Kotsis, G., Tjoa, A., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2019. Lecture Notes in Computer Science(), vol 11708. Springer, Cham. https://doi.org/10.1007/978-3-030-27520-4_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-27520-4_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27519-8
Online ISBN: 978-3-030-27520-4
eBook Packages: Computer ScienceComputer Science (R0)