skip to main content
10.1145/2882903.2903730acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open Access

Goods: Organizing Google's Datasets

Published:14 June 2016Publication History

ABSTRACT

Enterprises increasingly rely on structured datasets to run their businesses. These datasets take a variety of forms, such as structured files, databases, spreadsheets, or even services that provide access to the data. The datasets often reside in different storage systems, may vary in their formats, may change every day. In this paper, we present GOODS, a project to rethink how we organize structured datasets at scale, in a setting where teams use diverse and often idiosyncratic ways to produce the datasets and where there is no centralized system for storing and querying them. GOODS extracts metadata ranging from salient information about each dataset (owners, timestamps, schema) to relationships among datasets, such as similarity and provenance. It then exposes this metadata through services that allow engineers to find datasets within the company, to monitor datasets, to annotate them in order to enable others to use their datasets, and to analyze relationships between them. We discuss the technical challenges that we had to overcome in order to crawl and infer the metadata for billions of datasets, to maintain the consistency of our metadata catalog at scale, and to expose the metadata to users. We believe that many of the lessons that we learned are applicable to building large-scale enterprise-level data-management systems in general.

References

  1. Azure data lake. https://azure.microsoft.com/en-us/solutions/data-lake/.Google ScholarGoogle Scholar
  2. Azure marketplace. http://datamarket.azure.com/browse/data.Google ScholarGoogle Scholar
  3. CKAN. http://ckan.org.Google ScholarGoogle Scholar
  4. Data lakes and the promise of unsiloed data. http://www.pwc.com/us/en/technology-forecast/2014/cloud-computing/features/data-lakes.html.Google ScholarGoogle Scholar
  5. Quandl. https://www.quandl.com.Google ScholarGoogle Scholar
  6. A universally unique identifier (uuid) urn namespace. https://www.ietf.org/rfc/rfc4122.txt.Google ScholarGoogle Scholar
  7. S. Balakrishnan, A. Y. Halevy, B. Harb, H. Lee, J. Madhavan, A. Rostamizadeh, W. Shen, K. Wilder, F. Wu, and C. Yu. Applying webtables in practice. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 2015.Google ScholarGoogle Scholar
  8. A. P. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. DataHub: Collaborative data science & dataset version management at scale. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 2015.Google ScholarGoogle Scholar
  9. A. P. Bhardwaj, A. Deshpande, A. J. Elmore, D. R. Karger, S. Madden, A. G. Parameswaran, H. Subramanyam, E. Wu, and R. Zhang. Collaborative data analytics with DataHub. PVLDB, 8(12):1916--1927, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Bhattacherjee, A. Chavan, S. Huang, A. Deshpande, and A. G. Parameswaran. Principles of dataset versioning: Exploring the recreation/storage tradeoff. PVLDB, 8(12):1346--1357, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Brown. Get smarter answers from the knowledge graph. http://insidesearch.blogspot.com/2012/12/get-smarter-answers-from-knowledge_4.html, 2012.Google ScholarGoogle Scholar
  12. M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2):4:1--4:26, June 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in databases: Why, how, and where. Found. Trends databases, 1(4):379--474, Apr. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Flajolet, E. Fusy, G. O., and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. Analysis of Algorithms (AOFA), 2007.Google ScholarGoogle ScholarCross RefCross Ref
  16. M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: A new abstraction for information management. SIGMOD Rec., 34(4):27--33, Dec. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. I. Konstantinou, E. Angelou, D. Tsoumakos, and N. Koziris. Distributed indexing of web scale datasets for the cloud. In Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud, MDAC '10, pages 1:1--1:6, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. W. Leung, M. Shao, T. Bisson, S. Pasupathy, and E. L. Miller. Spyglass: Fast, scalable metadata search for large-scale storage systems. In M. I. Seltzer and R. Wheeler, editors, FAST, pages 153--166. USENIX, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. Commun. ACM, 54(6):114--123, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. Seltzer. Provenance-aware storage systems. In Proceedings of the Annual Conference on USENIX '06 Annual Technical Conference, pages 43--56, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Rao and B. Moon. An internet-scale service for publishing and locating xml documents. In Proceedings of the 2009 Int'l Conference on Data Engineering (ICDE), pages 1459--1462, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. I. Terrizzano, P. M. Schwarz, M. Roth, and J. E. Colino. Data wrangling: The challenging journey from the wild to the lake. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 2015.Google ScholarGoogle Scholar
  23. K. Varda. Protocol buffers: Google's data interchange format. Google Open Source Blog, Accessed July, 2008.Google ScholarGoogle Scholar
  24. J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.Google ScholarGoogle Scholar
  25. L. Xu, H. Jiang, X. Liu, L. Tian, Y. Hua, and J. Hu. Propeller: A scalable metadata organization for a versatile searchable file system. Technical Report 119, Department of Computer Science and Engineering, University of Nebraska-Lincoln, 2011.Google ScholarGoogle Scholar
  26. M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. InfoGather: entity augmentation and attribute discovery by holistic matching with web tables. In K. S. Candan, Y. Chen, R. T. Snodgrass, L. Gravano, and A. Fuxman, editors, SIGMOD Conference, pages 97--108. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Goods: Organizing Google's Datasets

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader