Abstract
The number of potentially-related data resources available for querying --- databases, data warehouses, virtual integrated schemas --- continues to grow rapidly. Perhaps no area has seen this problem as acutely as the life sciences, where hundreds of large, complex, interlinked data resources are available on fields like proteomics, genomics, disease studies, and pharmacology. The schemas of individual databases are often large on their own, but users also need to pose queries across multiple sources, exploiting foreign keys and schema mappings. Since the users are not experts, they typically rely on the existence of pre-defined Web forms and associated query templates, developed by programmers to meet the particular scientists' needs. Unfortunately, such forms are scarce commodities, often limited to a single database, and mismatched with biologists' information needs that are often context-sensitive and span multiple databases.
We present a system with which a non-expert user can author new query templates and Web forms, to be reused by anyone with related information needs. The user poses keyword queries that are matched against source relations and their attributes; the system uses sequences of associations (e.g., foreign keys, links, schema mappings, synonyms, and taxonomies) to create multiple ranked queries linking the matches to keywords; the set of queries is attached to a Web query form. Now the user and his or her associates may pose specific queries by filling in parameters in the form. Importantly, the answers to this query are ranked and annotated with data provenance, and the user provides feedback on the utility of the answers, from which the system ultimately learns to assign costs to sources and associations according to the user's specific information need, as a result changing the ranking of the queries used to generate results. We evaluate the effectiveness of our method against "gold standard" costs from domain experts and demonstrate the method's scalability.
- R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999. Google ScholarDigital Library
- A. Balmin, V. Hristidis, and Y. Papakonstantinou. Objectrank: Authority-based keyword search in databases. In VLDB, 2004. Google ScholarDigital Library
- A. Bernal, K. Crammer, A. Hatzigeorgiou, and F. Pereira. Global discriminative training for higher-accuracy computational gene prediction. PLoS Computational Biology, 3, 2007.Google Scholar
- G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS. In ICDE, 2002. Google ScholarDigital Library
- C. Botev and J. Shanmugasundaram. Context-sensitive keyword search and ranking for XML. In WebDB, 2005.Google Scholar
- S. C. Boulakia, O. Biton, S. B. Davidson, and C. Froidevaux. BioGuideSRS: querying multiple sources with a user-centric perspective. Bioinformatics, 23(10), 2007. Google ScholarDigital Library
- P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT, 2001. Google ScholarDigital Library
- M. J. Carey. BEA Liquid Data for WebLogic: XML-based enterprise information integration. In ICDE, 2004. Google ScholarDigital Library
- M. J. Carey, D. Florescu, Z. G. Ives, Y. Lu, J. Shanmugasundaram, E. Shekita, and S. Subramanian. XPERANTO: Publishing object-relational data as XML. In WebDB '00, 2000.Google Scholar
- S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. A. Shah. TelegraphCQ: Continuous dataflow processing for an uncertain world. In CIDR, 2003.Google Scholar
- W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD, 1998. Google ScholarDigital Library
- K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551--585, 2006. Google ScholarDigital Library
- Y. Cui. Lineage Tracing in Data Warehouses. PhD thesis, Stanford University, 2001. Google ScholarDigital Library
- N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004. Google ScholarDigital Library
- C. Duin and A. Volgenant. Reduction tests for the steiner problem in graphs. Netw., 19, 1989.Google Scholar
- A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 2007. Google ScholarDigital Library
- M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979. Google ScholarDigital Library
- L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins in an RDBMS for web data integration. In WWW, 2003. Google ScholarDigital Library
- T. J. Green, G. Karvounarakis, Z. G. Ives, and V. Tannen. Update exchange with mappings and provenance. In VLDB, 2007. Amended version available as Univ. of Pennsylvania report MS-CIS-07-26. Google ScholarDigital Library
- T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, 2007. Google ScholarDigital Library
- L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. XRANK: Ranked keyword search over XML documents. In SIGMOD, 2003. Google ScholarDigital Library
- A. Y. Halevy, Z. G. Ives, D. Suciu, and I. Tatarinov. Schema mediation in peer data management systems. In ICDE, March 2003.Google ScholarCross Ref
- V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases. In VLDB, 2002. Google ScholarDigital Library
- Z. G. Ives, D. Florescu, M. T. Friedman, A. Y. Levy, and D. S. Weld. An adaptive query execution system for data integration. In SIGMOD, 1999. Google ScholarDigital Library
- V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar. Bidirectional expansion for keyword search on graph databases. In VLDB, 2005. Google ScholarDigital Library
- G. Kasneci, F. M. Suchanek, G. Ifrim, M. Ramanath, and G. Weikum. Naga: Searching and ranking knowledge. In ICDE, 2008. Google ScholarDigital Library
- B. Kimelfeld and Y. Sagiv. Finding and approximating top-k answers in keyword proximity search. In PODS, 2006. Google ScholarDigital Library
- J. C. Kissinger, B. P. Brunk, J. Crabtree, M. J. Fraunholz, B. Gajria, A. J. Milgram, D. S. Pearson, J. Schug, A. Bahl, S. J. Diskin, H. Ginsburg, G. R. Grant, D. Gupta, P. Labo, L. Li, M. D. Mailman, S. K. McWeeney, P. Whetzel, C. J. Stoeckert, Jr., and D. S. Roos. The Plasmodium genome database: Designing and mining a eukaryotic genomics resource. Nature, 419, 2002.Google Scholar
- E. L. Lawler. A procedure for computing the k best solutions to discrete optimization problems and its application to the shortest path problem. Management Science, 18, 1972.Google Scholar
- C. Li, K. C.-C. Chang, I. F. Ilyas, and S. Song. RankSQL: Query algebra and optimization for relational top-k queries. In SIGMOD, 2005. Google ScholarDigital Library
- A. Marian, N. Bruno, and L. Gravano. Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst., 29(2), 2004. Google ScholarDigital Library
- R. McDonald and F. Pereira. Online learning of approximate dependency parsing algorithms. In European Association for Computational Linguistics, 2006.Google Scholar
- P. Mork, R. Shaker, A. Halevy, and P. Tarczy-Hornoch. PQL: A declarative query language over dynamic biological schemata. In AMIA Symposium, November 2002. Google ScholarDigital Library
- A.-M. Popescu, O. Etzioni, and H. Kautz. Towards a theory of natural language interfaces to databases. In IUI '03, 2003. Google ScholarDigital Library
- E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB J., 10(4), 2001. Google ScholarDigital Library
- R. Tuchinda and C. A. Knoblock. Building mashups by example. In IUI '08, 2008. Google ScholarDigital Library
- R. Varadarajan, V. Hristidis, and L. Raschid. Explaining and reformulating authority flow queries. In ICDE, 2008. Google ScholarDigital Library
- P. Winter. Steiner problem in networks: a survey. Netw., 17(2), 1987. Google ScholarDigital Library
- P. Winter and J. M. Smith. Path-distance heuristics for the steiner problem in undirected networks. Algorithmica, 7(2&3):309--327, 1992.Google ScholarCross Ref
- L. Wolsey. Integer Programming. Wiley-Interscience, 1998.Google Scholar
- R. T. Wong. A dual ascent approach for steiner tree problems on a directed graph. Mathematical Programming, 28(3):271--287, October 1981.Google ScholarCross Ref
- J. Y. Yen. Finding the k shortest loopless paths in a network. Management Science, 18(17), 1971.Google Scholar
Recommendations
Approximating expressive queries on graph-modeled data
We present GeX for the approximate matching of complex queries on graph-modeled data.GeX generalizes existing approaches and allows for querying any graph-based datasets.GeX query language supports queries ranging from keyword-based to complex ones.GeX ...
Ranking queries on uncertain data
Uncertain data is inherent in a few important applications. It is far from trivial to extend ranking queries (also known as top-k queries), a popular type of queries on certain data, to uncertain data. In this paper, we cast ranking queries on uncertain ...
A truly dynamic data structure for top-k queries on uncertain data
SSDBM'11: Proceedings of the 23rd international conference on Scientific and statistical database managementTop-k queries allow end-users to focus on the most important (top-k) (answers amongst those which satisfy the query. In traditional databases, a user defined score function assigns a score value to each tuple and a top-k query returns k tuples with the ...
Comments