Abstract
Smart processing of Big Data has been recently emerged as a field that provides quite a few challenges related to how multiple heterogeneous data sources that produce massive amounts of structured, semi-structured and unstructured data may be handled. One solution to this problem is manage this fusion of disparate data sources through Data Lakes. The latter, though, suffers from the lack of a disciplined approach to collect, store and retrieve data to support predictive and prescriptive analytics. This chapter tackles this challenge by introducing a novel standardization framework for managing data in Data Lakes that combines mainly the 5Vs Big Data characteristics and blueprint ontologies. It organizes a Data Lake using a ponds architecture and describes a metadata semantic enrichment mechanism that enables fast storing to and efficient retrieval. The mechanism supports Visual Querying and offers increased security via Blockchain and Non-Fungible Tokens. The proposed approach is compared against other known metadata systems utilizing a set of functional properties with very encouraging results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19(2), 171–209 (2014)
Singh, D.S., Singh, G.: Big data-a review. Int. Res. J. Eng. Technol. 4(04), 2395-0056 (2017)
What is Big Data? Oracle. https://www.oracle.com/big-data/what-is-big-data/. Accessed 01 Aug 2022
25+ impressive Big Data Statistics for 2022. https://techjury.net/blog/big-data-statistics. Accessed 01 Aug 2022
Bertino, E.: Big data - opportunities and challenges: panel position paper. In: IEEE 37th Annual Computer Software and Applications Conference 2013, pp. 479–80. IEEE Computer Society (2013)
Günther, W.A., Mehrizi, M.H.R., Huysman, M., Feldberg, F.: Debating big data: a literature review on realizing value from big data. J. Strateg. Inf. Syst. 26(3), 191–209 (2017)
Pingos, M., Andreou, A.: A data lake metadata enrichment mechanism via semantic blueprints. In: Proceedings of the 17th International Conference on Evaluation of Novel Approaches to Software Engineering, pp. 186–196 (2022). ISBN 978-989-758-568-5, ISSN 2184–4895
Blazquez, D., Domenech, J.: Big data sources and methods for social and economic analyses. Technol. Forecast. Soc. Chang. 130, 99–113 (2018)
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT press, Cambridge (2009)
Ranganath, R., Gerrish, S., Blei, D.: Black box variational inference. In: Artificial intelligence and statistics 2014, pp. 814–822. PMLR (2014)
Chen, C., Carlson, D., Gan, Z., Li, C., Carin, L.: Bridging the gap between stochastic gradient MCMC and stochastic optimization. In: Artificial Intelligence and Statistics 2016, pp. 1051–1060. PMLR (2016)
Tran, D., Hoffman, M.D., Saurous, R.A., Brevdo, E., Murphy, K., Blei, D.M.: Deep probabilistic programming. arXiv preprint arXiv:1701.03757 (2017)
Salvatier, J., Wiecki, T.V., Fonnesbeck, C.: Probabilistic programming in Python using PyMC3. PeerJ Comput. Sci. 2, e55 (2106)
Li, Q., Han, Z., Wu, X.M.: Deeper insights into graph convolutional networks for semi-supervised learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Tokic, M.: Adaptive ε-greedy exploration in reinforcement learning based on value differences. In: Dillmann, R., Beyerer, J., Hanebeck, U.D., Schultz, T. (eds.) KI 2010. LNCS (LNAI), vol. 6359, pp. 203–210. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16111-7_23
Langford, J., Zhang, T.: The epoch-greedy algorithm for contextual multi-armed bandits. Adv. Neural Inf. Process. Syst. 20(1), 96–1 (2007)
Chapelle, O., Li, L.: An empirical evaluation of thompson sampling. Adv. Neural Inf. Process. Syst. 24 (2011)
Krishnamurthy, A., Langford, J., Slivkins, A., Zhang, C.: Contextual bandits with continuous actions: Smoothing, zooming, and adapting. J. Mach. Learn. Res. 21(1), 5402–5446 (2020)
Khine, P.P, Wang, Z.S.: Data lake: a new ideology in big data era. In: ITM Web of Conference 2018, vol. 17, p. 03025. EDP Sciences (2018)
Farid, M., Roatiş, A., Ilyas, I F., Hoffmann, H.F., Chu, X.: CLAMS: bringing quality to data lakes. In: Proceedings of the ACM SIGMOD International Conference on Management of Data 2016, pp. 2089–2092. ACM (2016)
Miloslavskaya, N., Tolstoy, A.: Big data, fast data and data lake concepts. Procedia Comput. Sci. 88, 300–305 (2016)
Sawadogo, P., Darmont, J.: On data lake architectures and metadata management. J. Intell. Inf. Syst. 56(1), 97–120 (2020). https://doi.org/10.1007/s10844-020-00608-7
The Enterprise Big Data Lake (O’Reilly Online Learning). https://www.oreilly.com/library. Accessed 04 Aug 2022
Bell, D., Lycett, M., Marshan, A., Monaghan, A.: Exploring future challenges for big data in the humanitarian domain. J. Bus. Res. 131, 453–468 (2021)
Kościelniak, H., Puto, A.: BIG DATA in decision making processes of enterprises. Procedia Comput. Sci. 65, 1052–1058 (2015)
Gandomi, A., Haider, M.: Beyond the hype: Big data concepts, methods, and analytics. Int. J. Inf. Manag. 35(2), 137–144 (2015)
Luckow, A., Kennedy, K., Manhardt, F., Djerekarov, E., Vorster, B., Apon, A.: Automotive big data: applications, workloads and infrastructures. In: IEEE International Conference on Big Data 2015, pp. 1201–1210. IEEE (2015)
Kim, Y., You, E., Kang, M., Choi, J.: Does big data matter to value creation? based on oracle solution case. J. Inf. Technol. Serv. 11(3), 39–48 (2012)
Papazoglou, M.P., Elgammal, A.: The manufacturing blueprint environment: Bringing intelligence into manufacturing. In: International Conference on Engineering, Technology and Innovation (ICE/ITMC) 2017, pp. 750–759. IEEE (2017)
Fang, H.: Managing data lakes in big data era: what’s a data lake and why has it became popular in data management ecosystem. In: IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems (IEEE-CYBER) 2015, pp. 820–824 (2015). IEEE (2015)
Pingos, M., Christodoulou, P., Andreou, A. S.: DLMetaChain: an IoT data lake architecture based on the blockchain. In: Information, Intelligence, Systems and Applications Conference (IISA) 2022
Raj, P.: Empowering digital twins with Blockchain. Adv. Comput. 121, 267–283 (2021)
Petersen, N., Halilaj, L., Grangel-González, I., Lohmann, S., Lange, C., Auer, S.: Realizing an RDF-based information model for a manufacturing company – a case study. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10588, pp. 350–366. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68204-4_31
Mehdi, G., et al.: Semantic rule-based equipment diagnostics. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10588, pp. 314–333. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68204-4_29
Sawadogo, P.N., Scholly, É., Favre, C., Ferey, É., Loudcher, S., Darmont, J.: Metadata systems for data lakes: models and features. In: Welzer, T., et al. (eds.) ADBIS 2019. CCIS, vol. 1064, pp. 440–451. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30278-8_43
Beheshti, A., Benatallah, B., Nouri, R., Tabebordbar, A.: CoreKG: a knowledge lake service. In: Proceedings of the VLDB Endowment 2018, pp. 1942–1945. ACM (2018)
Pingos, M., Andreou, A.S.: A smart manufacturing data lake metadata framework for process mining. In: International Conference on Software Engineering Advances (ICSEA) (2022)
Solidity Programming Language. https://soliditylang.org/. Accessed 05 Aug 2022
IPFS Powers the Distributed Web. https://ipfs.tech/. Accessed 08 Aug 2022
Acknowledgement
This chapter is part of the outcomes of the CSA Twinning project DESTINI. This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 945357. Special thanks are due to Stelios Mappouras for his valuable suggestions on the Blockchain application development.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Pingos, M., Andreou, A.S. (2023). Exploiting Metadata Semantics in Data Lakes Using Blueprints. In: Kaindl, H., Mannion, M., Maciaszek, L.A. (eds) Evaluation of Novel Approaches to Software Engineering. ENASE 2022. Communications in Computer and Information Science, vol 1829. Springer, Cham. https://doi.org/10.1007/978-3-031-36597-3_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-36597-3_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36596-6
Online ISBN: 978-3-031-36597-3
eBook Packages: Computer ScienceComputer Science (R0)