Skip to main content

Exploiting Metadata Semantics in Data Lakes Using Blueprints

  • Conference paper
  • First Online:
Evaluation of Novel Approaches to Software Engineering (ENASE 2022)

Abstract

Smart processing of Big Data has been recently emerged as a field that provides quite a few challenges related to how multiple heterogeneous data sources that produce massive amounts of structured, semi-structured and unstructured data may be handled. One solution to this problem is manage this fusion of disparate data sources through Data Lakes. The latter, though, suffers from the lack of a disciplined approach to collect, store and retrieve data to support predictive and prescriptive analytics. This chapter tackles this challenge by introducing a novel standardization framework for managing data in Data Lakes that combines mainly the 5Vs Big Data characteristics and blueprint ontologies. It organizes a Data Lake using a ponds architecture and describes a metadata semantic enrichment mechanism that enables fast storing to and efficient retrieval. The mechanism supports Visual Querying and offers increased security via Blockchain and Non-Fungible Tokens. The proposed approach is compared against other known metadata systems utilizing a set of functional properties with very encouraging results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19(2), 171–209 (2014)

    Article  Google Scholar 

  2. Singh, D.S., Singh, G.: Big data-a review. Int. Res. J. Eng. Technol. 4(04), 2395-0056 (2017)

    Google Scholar 

  3. What is Big Data? Oracle. https://www.oracle.com/big-data/what-is-big-data/. Accessed 01 Aug 2022

  4. 25+ impressive Big Data Statistics for 2022. https://techjury.net/blog/big-data-statistics. Accessed 01 Aug 2022

  5. Bertino, E.: Big data - opportunities and challenges: panel position paper. In: IEEE 37th Annual Computer Software and Applications Conference 2013, pp. 479–80. IEEE Computer Society (2013)

    Google Scholar 

  6. Günther, W.A., Mehrizi, M.H.R., Huysman, M., Feldberg, F.: Debating big data: a literature review on realizing value from big data. J. Strateg. Inf. Syst. 26(3), 191–209 (2017)

    Article  Google Scholar 

  7. Pingos, M., Andreou, A.: A data lake metadata enrichment mechanism via semantic blueprints. In: Proceedings of the 17th International Conference on Evaluation of Novel Approaches to Software Engineering, pp. 186–196 (2022). ISBN 978-989-758-568-5, ISSN 2184–4895

    Google Scholar 

  8. Blazquez, D., Domenech, J.: Big data sources and methods for social and economic analyses. Technol. Forecast. Soc. Chang. 130, 99–113 (2018)

    Article  Google Scholar 

  9. Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT press, Cambridge (2009)

    MATH  Google Scholar 

  10. Ranganath, R., Gerrish, S., Blei, D.: Black box variational inference. In: Artificial intelligence and statistics 2014, pp. 814–822. PMLR (2014)

    Google Scholar 

  11. Chen, C., Carlson, D., Gan, Z., Li, C., Carin, L.: Bridging the gap between stochastic gradient MCMC and stochastic optimization. In: Artificial Intelligence and Statistics 2016, pp. 1051–1060. PMLR (2016)

    Google Scholar 

  12. Tran, D., Hoffman, M.D., Saurous, R.A., Brevdo, E., Murphy, K., Blei, D.M.: Deep probabilistic programming. arXiv preprint arXiv:1701.03757 (2017)

  13. Salvatier, J., Wiecki, T.V., Fonnesbeck, C.: Probabilistic programming in Python using PyMC3. PeerJ Comput. Sci. 2, e55 (2106)

    Google Scholar 

  14. Li, Q., Han, Z., Wu, X.M.: Deeper insights into graph convolutional networks for semi-supervised learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  15. Tokic, M.: Adaptive ε-greedy exploration in reinforcement learning based on value differences. In: Dillmann, R., Beyerer, J., Hanebeck, U.D., Schultz, T. (eds.) KI 2010. LNCS (LNAI), vol. 6359, pp. 203–210. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16111-7_23

    Chapter  Google Scholar 

  16. Langford, J., Zhang, T.: The epoch-greedy algorithm for contextual multi-armed bandits. Adv. Neural Inf. Process. Syst. 20(1), 96–1 (2007)

    Google Scholar 

  17. Chapelle, O., Li, L.: An empirical evaluation of thompson sampling. Adv. Neural Inf. Process. Syst. 24 (2011)

    Google Scholar 

  18. Krishnamurthy, A., Langford, J., Slivkins, A., Zhang, C.: Contextual bandits with continuous actions: Smoothing, zooming, and adapting. J. Mach. Learn. Res. 21(1), 5402–5446 (2020)

    MathSciNet  MATH  Google Scholar 

  19. Khine, P.P, Wang, Z.S.: Data lake: a new ideology in big data era. In: ITM Web of Conference 2018, vol. 17, p. 03025. EDP Sciences (2018)

    Google Scholar 

  20. Farid, M., Roatiş, A., Ilyas, I F., Hoffmann, H.F., Chu, X.: CLAMS: bringing quality to data lakes. In: Proceedings of the ACM SIGMOD International Conference on Management of Data 2016, pp. 2089–2092. ACM (2016)

    Google Scholar 

  21. Miloslavskaya, N., Tolstoy, A.: Big data, fast data and data lake concepts. Procedia Comput. Sci. 88, 300–305 (2016)

    Article  Google Scholar 

  22. Sawadogo, P., Darmont, J.: On data lake architectures and metadata management. J. Intell. Inf. Syst. 56(1), 97–120 (2020). https://doi.org/10.1007/s10844-020-00608-7

    Article  Google Scholar 

  23. The Enterprise Big Data Lake (O’Reilly Online Learning). https://www.oreilly.com/library. Accessed 04 Aug 2022

  24. Bell, D., Lycett, M., Marshan, A., Monaghan, A.: Exploring future challenges for big data in the humanitarian domain. J. Bus. Res. 131, 453–468 (2021)

    Article  Google Scholar 

  25. Kościelniak, H., Puto, A.: BIG DATA in decision making processes of enterprises. Procedia Comput. Sci. 65, 1052–1058 (2015)

    Article  Google Scholar 

  26. Gandomi, A., Haider, M.: Beyond the hype: Big data concepts, methods, and analytics. Int. J. Inf. Manag. 35(2), 137–144 (2015)

    Article  Google Scholar 

  27. Luckow, A., Kennedy, K., Manhardt, F., Djerekarov, E., Vorster, B., Apon, A.: Automotive big data: applications, workloads and infrastructures. In: IEEE International Conference on Big Data 2015, pp. 1201–1210. IEEE (2015)

    Google Scholar 

  28. Kim, Y., You, E., Kang, M., Choi, J.: Does big data matter to value creation? based on oracle solution case. J. Inf. Technol. Serv. 11(3), 39–48 (2012)

    Google Scholar 

  29. Papazoglou, M.P., Elgammal, A.: The manufacturing blueprint environment: Bringing intelligence into manufacturing. In: International Conference on Engineering, Technology and Innovation (ICE/ITMC) 2017, pp. 750–759. IEEE (2017)

    Google Scholar 

  30. Fang, H.: Managing data lakes in big data era: what’s a data lake and why has it became popular in data management ecosystem. In: IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems (IEEE-CYBER) 2015, pp. 820–824 (2015). IEEE (2015)

    Google Scholar 

  31. Pingos, M., Christodoulou, P., Andreou, A. S.: DLMetaChain: an IoT data lake architecture based on the blockchain. In: Information, Intelligence, Systems and Applications Conference (IISA) 2022

    Google Scholar 

  32. Raj, P.: Empowering digital twins with Blockchain. Adv. Comput. 121, 267–283 (2021)

    Article  Google Scholar 

  33. Petersen, N., Halilaj, L., Grangel-González, I., Lohmann, S., Lange, C., Auer, S.: Realizing an RDF-based information model for a manufacturing company – a case study. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10588, pp. 350–366. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68204-4_31

    Chapter  Google Scholar 

  34. Mehdi, G., et al.: Semantic rule-based equipment diagnostics. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10588, pp. 314–333. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68204-4_29

    Chapter  Google Scholar 

  35. Sawadogo, P.N., Scholly, É., Favre, C., Ferey, É., Loudcher, S., Darmont, J.: Metadata systems for data lakes: models and features. In: Welzer, T., et al. (eds.) ADBIS 2019. CCIS, vol. 1064, pp. 440–451. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30278-8_43

    Chapter  Google Scholar 

  36. Beheshti, A., Benatallah, B., Nouri, R., Tabebordbar, A.: CoreKG: a knowledge lake service. In: Proceedings of the VLDB Endowment 2018, pp. 1942–1945. ACM (2018)

    Google Scholar 

  37. Pingos, M., Andreou, A.S.: A smart manufacturing data lake metadata framework for process mining. In: International Conference on Software Engineering Advances (ICSEA) (2022)

    Google Scholar 

  38. Solidity Programming Language. https://soliditylang.org/. Accessed 05 Aug 2022

  39. IPFS Powers the Distributed Web. https://ipfs.tech/. Accessed 08 Aug 2022

Download references

Acknowledgement

This chapter is part of the outcomes of the CSA Twinning project DESTINI. This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 945357. Special thanks are due to Stelios Mappouras for his valuable suggestions on the Blockchain application development.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michalis Pingos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pingos, M., Andreou, A.S. (2023). Exploiting Metadata Semantics in Data Lakes Using Blueprints. In: Kaindl, H., Mannion, M., Maciaszek, L.A. (eds) Evaluation of Novel Approaches to Software Engineering. ENASE 2022. Communications in Computer and Information Science, vol 1829. Springer, Cham. https://doi.org/10.1007/978-3-031-36597-3_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-36597-3_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-36596-6

  • Online ISBN: 978-3-031-36597-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics