research-article

Open Access

Goods: Organizing Google's Datasets

Authors:
Alon Halevy

Recruit Institute of Technology, Mountain View, CA, USA

Recruit Institute of Technology, Mountain View, CA, USA
View Profile

,
Flip Korn

Google Inc., New York City, NY, USA

Google Inc., New York City, NY, USA
View Profile

,
Natalya F. Noy

Google Inc., Mountain View, CA, USA

Google Inc., Mountain View, CA, USA
View Profile

,
Christopher Olston

Google Inc., Mountain View, CA, USA

Google Inc., Mountain View, CA, USA
View Profile

,
Neoklis Polyzotis

Google Inc., Mountain View, CA, USA

Google Inc., Mountain View, CA, USA
View Profile

,
Sudip Roy

Google Inc., Mountain View, CA, USA

Google Inc., Mountain View, CA, USA
View Profile

,
Steven Euijong Whang

Google Inc., Mountain View, CA, USA

Google Inc., Mountain View, CA, USA
View Profile

SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataJune 2016Pages 795–806https://doi.org/10.1145/2882903.2903730

Published:14 June 2016Publication History

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 795–806

ABSTRACT

Enterprises increasingly rely on structured datasets to run their businesses. These datasets take a variety of forms, such as structured files, databases, spreadsheets, or even services that provide access to the data. The datasets often reside in different storage systems, may vary in their formats, may change every day. In this paper, we present GOODS, a project to rethink how we organize structured datasets at scale, in a setting where teams use diverse and often idiosyncratic ways to produce the datasets and where there is no centralized system for storing and querying them. GOODS extracts metadata ranging from salient information about each dataset (owners, timestamps, schema) to relationships among datasets, such as similarity and provenance. It then exposes this metadata through services that allow engineers to find datasets within the company, to monitor datasets, to annotate them in order to enable others to use their datasets, and to analyze relationships between them. We discuss the technical challenges that we had to overcome in order to crawl and infer the metadata for billions of datasets, to maintain the consistency of our metadata catalog at scale, and to expose the metadata to users. We believe that many of the lessons that we learned are applicable to building large-scale enterprise-level data-management systems in general.

References

Azure data lake. https://azure.microsoft.com/en-us/solutions/data-lake/.Google Scholar
Azure marketplace. http://datamarket.azure.com/browse/data.Google Scholar
CKAN. http://ckan.org.Google Scholar
Data lakes and the promise of unsiloed data. http://www.pwc.com/us/en/technology-forecast/2014/cloud-computing/features/data-lakes.html.Google Scholar
Quandl. https://www.quandl.com.Google Scholar
A universally unique identifier (uuid) urn namespace. https://www.ietf.org/rfc/rfc4122.txt.Google Scholar
S. Balakrishnan, A. Y. Halevy, B. Harb, H. Lee, J. Madhavan, A. Rostamizadeh, W. Shen, K. Wilder, F. Wu, and C. Yu. Applying webtables in practice. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 2015.Google Scholar
A. P. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. DataHub: Collaborative data science & dataset version management at scale. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 2015.Google Scholar
A. P. Bhardwaj, A. Deshpande, A. J. Elmore, D. R. Karger, S. Madden, A. G. Parameswaran, H. Subramanyam, E. Wu, and R. Zhang. Collaborative data analytics with DataHub. PVLDB, 8(12):1916--1927, 2015. Google ScholarDigital Library
S. Bhattacherjee, A. Chavan, S. Huang, A. Deshpande, and A. G. Parameswaran. Principles of dataset versioning: Exploring the recreation/storage tradeoff. PVLDB, 8(12):1346--1357, 2015. Google ScholarDigital Library
A. Brown. Get smarter answers from the knowledge graph. http://insidesearch.blogspot.com/2012/12/get-smarter-answers-from-knowledge_4.html, 2012.Google Scholar
M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008. Google ScholarDigital Library
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2):4:1--4:26, June 2008. Google ScholarDigital Library
J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in databases: Why, how, and where. Found. Trends databases, 1(4):379--474, Apr. 2009. Google ScholarDigital Library
P. Flajolet, E. Fusy, G. O., and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. Analysis of Algorithms (AOFA), 2007.Google ScholarCross Ref
M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: A new abstraction for information management. SIGMOD Rec., 34(4):27--33, Dec. 2005. Google ScholarDigital Library
I. Konstantinou, E. Angelou, D. Tsoumakos, and N. Koziris. Distributed indexing of web scale datasets for the cloud. In Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud, MDAC '10, pages 1:1--1:6, 2010. Google ScholarDigital Library
A. W. Leung, M. Shao, T. Bisson, S. Pasupathy, and E. L. Miller. Spyglass: Fast, scalable metadata search for large-scale storage systems. In M. I. Seltzer and R. Wheeler, editors, FAST, pages 153--166. USENIX, 2009. Google ScholarDigital Library
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. Commun. ACM, 54(6):114--123, 2011. Google ScholarDigital Library
K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. Seltzer. Provenance-aware storage systems. In Proceedings of the Annual Conference on USENIX '06 Annual Technical Conference, pages 43--56, 2006. Google ScholarDigital Library
P. Rao and B. Moon. An internet-scale service for publishing and locating xml documents. In Proceedings of the 2009 Int'l Conference on Data Engineering (ICDE), pages 1459--1462, 2009. Google ScholarDigital Library
I. Terrizzano, P. M. Schwarz, M. Roth, and J. E. Colino. Data wrangling: The challenging journey from the wild to the lake. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 2015.Google Scholar
K. Varda. Protocol buffers: Google's data interchange format. Google Open Source Blog, Accessed July, 2008.Google Scholar
J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.Google Scholar
L. Xu, H. Jiang, X. Liu, L. Tian, Y. Hua, and J. Hu. Propeller: A scalable metadata organization for a versatile searchable file system. Technical Report 119, Department of Computer Science and Engineering, University of Nebraska-Lincoln, 2011.Google Scholar
M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. InfoGather: entity augmentation and attribute discovery by holistic matching with web tables. In K. S. Candan, Y. Chen, R. T. Snodgrass, L. Gravano, and A. Fuxman, editors, SIGMOD Conference, pages 97--108. ACM, 2012. Google ScholarDigital Library

Index Terms

Goods: Organizing Google's Datasets
1. Information systems
  1. Data management systems
    1. Database management system engines
    2. Information integration

Recommendations

IndeGx: A model and a framework for indexing RDF knowledge graphs with SPARQL-based test suits
Abstract
In recent years, a large number of RDF datasets have been built and published on the Web in fields as diverse as linguistics or life sciences, as well as general datasets such as DBpedia or Wikidata. The joint exploitation of these ...
Read More
Serverless Workflows for Indexing Large Scientific Data
WOSC '19: Proceedings of the 5th International Workshop on Serverless Computing

The use and reuse of scientific data is ultimately dependent on the ability to understand what those data represent, how they were captured, and how they can be used. In many ways, data are only as useful as the metadata available to describe them. ...
Read More
Accountable Data: The Politics and Pragmatics of Disclosure Datasets
FAccT '22: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency

This paper attends specifically to what I call “disclosure datasets” - tabular datasets produced in accordance with laws requiring various kinds of disclosure. For the purposes of this paper, the most significant defining feature of disclosure datasets ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA
Copyright © 2016 Owner/Author
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data culture
data flow
data lakes
data monitoring
data organization
data provenance
data search
enterprise data management
metadata extraction
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 114
  Total Citations
  View Citations
- 7,462
  Total Downloads
- Downloads (Last 12 months)784
- Downloads (Last 6 weeks)114
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Goods: Organizing Google's Datasets

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

IndeGx: A model and a framework for indexing RDF knowledge graphs with SPARQL-based test suits

Serverless Workflows for Indexing Large Scientific Data

Accountable Data: The Politics and Pragmatics of Disclosure Datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Goods: Organizing Google's Datasets

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

IndeGx: A model and a framework for indexing RDF knowledge graphs with SPARQL-based test suits

Serverless Workflows for Indexing Large Scientific Data

Accountable Data: The Politics and Pragmatics of Disclosure Datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media