research-article

Learning to create data-integrating queries

Authors:
Partha Pratim Talukdar

University of Pennsylvania, Philadelphia, PA

University of Pennsylvania, Philadelphia, PA
View Profile

,
Marie Jacob

University of Pennsylvania, Philadelphia, PA

University of Pennsylvania, Philadelphia, PA
View Profile

,
Muhammad Salman Mehmood

University of Pennsylvania, Philadelphia, PA

University of Pennsylvania, Philadelphia, PA
View Profile

,
Koby Crammer

University of Pennsylvania, Philadelphia, PA

University of Pennsylvania, Philadelphia, PA
View Profile

,
Zachary G. Ives

University of Pennsylvania, Philadelphia, PA

University of Pennsylvania, Philadelphia, PA
View Profile

,
Fernando Pereira

University of Pennsylvania, Philadelphia, PA

University of Pennsylvania, Philadelphia, PA
View Profile

,
Sudipto Guha

University of Pennsylvania, Philadelphia, PA

University of Pennsylvania, Philadelphia, PA
View Profile

Proceedings of the VLDB Endowment Volume 1 Issue 1pp 785–796https://doi.org/10.14778/1453856.1453941

Published:01 August 2008Publication History

Proceedings of the VLDB Endowment

Abstract

The number of potentially-related data resources available for querying --- databases, data warehouses, virtual integrated schemas --- continues to grow rapidly. Perhaps no area has seen this problem as acutely as the life sciences, where hundreds of large, complex, interlinked data resources are available on fields like proteomics, genomics, disease studies, and pharmacology. The schemas of individual databases are often large on their own, but users also need to pose queries across multiple sources, exploiting foreign keys and schema mappings. Since the users are not experts, they typically rely on the existence of pre-defined Web forms and associated query templates, developed by programmers to meet the particular scientists' needs. Unfortunately, such forms are scarce commodities, often limited to a single database, and mismatched with biologists' information needs that are often context-sensitive and span multiple databases.

We present a system with which a non-expert user can author new query templates and Web forms, to be reused by anyone with related information needs. The user poses keyword queries that are matched against source relations and their attributes; the system uses sequences of associations (e.g., foreign keys, links, schema mappings, synonyms, and taxonomies) to create multiple ranked queries linking the matches to keywords; the set of queries is attached to a Web query form. Now the user and his or her associates may pose specific queries by filling in parameters in the form. Importantly, the answers to this query are ranked and annotated with data provenance, and the user provides feedback on the utility of the answers, from which the system ultimately learns to assign costs to sources and associations according to the user's specific information need, as a result changing the ranking of the queries used to generate results. We evaluate the effectiveness of our method against "gold standard" costs from domain experts and demonstrate the method's scalability.

References

R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999. Google ScholarDigital Library
A. Balmin, V. Hristidis, and Y. Papakonstantinou. Objectrank: Authority-based keyword search in databases. In VLDB, 2004. Google ScholarDigital Library
A. Bernal, K. Crammer, A. Hatzigeorgiou, and F. Pereira. Global discriminative training for higher-accuracy computational gene prediction. PLoS Computational Biology, 3, 2007.Google Scholar
G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS. In ICDE, 2002. Google ScholarDigital Library
C. Botev and J. Shanmugasundaram. Context-sensitive keyword search and ranking for XML. In WebDB, 2005.Google Scholar
S. C. Boulakia, O. Biton, S. B. Davidson, and C. Froidevaux. BioGuideSRS: querying multiple sources with a user-centric perspective. Bioinformatics, 23(10), 2007. Google ScholarDigital Library
P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT, 2001. Google ScholarDigital Library
M. J. Carey. BEA Liquid Data for WebLogic: XML-based enterprise information integration. In ICDE, 2004. Google ScholarDigital Library
M. J. Carey, D. Florescu, Z. G. Ives, Y. Lu, J. Shanmugasundaram, E. Shekita, and S. Subramanian. XPERANTO: Publishing object-relational data as XML. In WebDB '00, 2000.Google Scholar
S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. A. Shah. TelegraphCQ: Continuous dataflow processing for an uncertain world. In CIDR, 2003.Google Scholar
W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD, 1998. Google ScholarDigital Library
K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551--585, 2006. Google ScholarDigital Library
Y. Cui. Lineage Tracing in Data Warehouses. PhD thesis, Stanford University, 2001. Google ScholarDigital Library
N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004. Google ScholarDigital Library
C. Duin and A. Volgenant. Reduction tests for the steiner problem in graphs. Netw., 19, 1989.Google Scholar
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 2007. Google ScholarDigital Library
M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979. Google ScholarDigital Library
L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins in an RDBMS for web data integration. In WWW, 2003. Google ScholarDigital Library
T. J. Green, G. Karvounarakis, Z. G. Ives, and V. Tannen. Update exchange with mappings and provenance. In VLDB, 2007. Amended version available as Univ. of Pennsylvania report MS-CIS-07-26. Google ScholarDigital Library
T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, 2007. Google ScholarDigital Library
L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. XRANK: Ranked keyword search over XML documents. In SIGMOD, 2003. Google ScholarDigital Library
A. Y. Halevy, Z. G. Ives, D. Suciu, and I. Tatarinov. Schema mediation in peer data management systems. In ICDE, March 2003.Google ScholarCross Ref
V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases. In VLDB, 2002. Google ScholarDigital Library
Z. G. Ives, D. Florescu, M. T. Friedman, A. Y. Levy, and D. S. Weld. An adaptive query execution system for data integration. In SIGMOD, 1999. Google ScholarDigital Library
V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar. Bidirectional expansion for keyword search on graph databases. In VLDB, 2005. Google ScholarDigital Library
G. Kasneci, F. M. Suchanek, G. Ifrim, M. Ramanath, and G. Weikum. Naga: Searching and ranking knowledge. In ICDE, 2008. Google ScholarDigital Library
B. Kimelfeld and Y. Sagiv. Finding and approximating top-k answers in keyword proximity search. In PODS, 2006. Google ScholarDigital Library
J. C. Kissinger, B. P. Brunk, J. Crabtree, M. J. Fraunholz, B. Gajria, A. J. Milgram, D. S. Pearson, J. Schug, A. Bahl, S. J. Diskin, H. Ginsburg, G. R. Grant, D. Gupta, P. Labo, L. Li, M. D. Mailman, S. K. McWeeney, P. Whetzel, C. J. Stoeckert, Jr., and D. S. Roos. The Plasmodium genome database: Designing and mining a eukaryotic genomics resource. Nature, 419, 2002.Google Scholar
E. L. Lawler. A procedure for computing the k best solutions to discrete optimization problems and its application to the shortest path problem. Management Science, 18, 1972.Google Scholar
C. Li, K. C.-C. Chang, I. F. Ilyas, and S. Song. RankSQL: Query algebra and optimization for relational top-k queries. In SIGMOD, 2005. Google ScholarDigital Library
A. Marian, N. Bruno, and L. Gravano. Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst., 29(2), 2004. Google ScholarDigital Library
R. McDonald and F. Pereira. Online learning of approximate dependency parsing algorithms. In European Association for Computational Linguistics, 2006.Google Scholar
P. Mork, R. Shaker, A. Halevy, and P. Tarczy-Hornoch. PQL: A declarative query language over dynamic biological schemata. In AMIA Symposium, November 2002. Google ScholarDigital Library
A.-M. Popescu, O. Etzioni, and H. Kautz. Towards a theory of natural language interfaces to databases. In IUI '03, 2003. Google ScholarDigital Library
E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB J., 10(4), 2001. Google ScholarDigital Library
R. Tuchinda and C. A. Knoblock. Building mashups by example. In IUI '08, 2008. Google ScholarDigital Library
R. Varadarajan, V. Hristidis, and L. Raschid. Explaining and reformulating authority flow queries. In ICDE, 2008. Google ScholarDigital Library
P. Winter. Steiner problem in networks: a survey. Netw., 17(2), 1987. Google ScholarDigital Library
P. Winter and J. M. Smith. Path-distance heuristics for the steiner problem in undirected networks. Algorithmica, 7(2&3):309--327, 1992.Google ScholarCross Ref
L. Wolsey. Integer Programming. Wiley-Interscience, 1998.Google Scholar
R. T. Wong. A dual ascent approach for steiner tree problems on a directed graph. Mathematical Programming, 28(3):271--287, October 1981.Google ScholarCross Ref
J. Y. Yen. Finding the k shortest loopless paths in a network. Management Science, 18(17), 1971.Google Scholar

Recommendations

Approximating expressive queries on graph-modeled data

We present GeX for the approximate matching of complex queries on graph-modeled data.GeX generalizes existing approaches and allows for querying any graph-based datasets.GeX query language supports queries ranging from keyword-based to complex ones.GeX ...
Read More
Ranking queries on uncertain data

Uncertain data is inherent in a few important applications. It is far from trivial to extend ranking queries (also known as top-k queries), a popular type of queries on certain data, to uncertain data. In this paper, we cast ranking queries on uncertain ...
Read More
A truly dynamic data structure for top-k queries on uncertain data
SSDBM'11: Proceedings of the 23rd international conference on Scientific and statistical database management

Top-k queries allow end-users to focus on the most important (top-k) (answers amongst those which satisfy the query. In traditional databases, a user defined score function assigns a score value to each tuple and a top-k query returns k tuples with the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 1, Issue 1
August 2008
1216 pages
ISSN:2150-8097
Editors:
Peter Buneman,
Beng Chin Ooi,
Kenneth Ross,
Gerald Weber
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2008
Published in pvldb Volume 1, Issue 1
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 35
  Total Citations
  View Citations
- 377
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning to create data-integrating queries

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Approximating expressive queries on graph-modeled data

Ranking queries on uncertain data

A truly dynamic data structure for top-k queries on uncertain data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Learning to create data-integrating queries

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Approximating expressive queries on graph-modeled data

Ranking queries on uncertain data

A truly dynamic data structure for top-k queries on uncertain data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media