skip to main content
research-article

Learning to create data-integrating queries

Published:01 August 2008Publication History
Skip Abstract Section

Abstract

The number of potentially-related data resources available for querying --- databases, data warehouses, virtual integrated schemas --- continues to grow rapidly. Perhaps no area has seen this problem as acutely as the life sciences, where hundreds of large, complex, interlinked data resources are available on fields like proteomics, genomics, disease studies, and pharmacology. The schemas of individual databases are often large on their own, but users also need to pose queries across multiple sources, exploiting foreign keys and schema mappings. Since the users are not experts, they typically rely on the existence of pre-defined Web forms and associated query templates, developed by programmers to meet the particular scientists' needs. Unfortunately, such forms are scarce commodities, often limited to a single database, and mismatched with biologists' information needs that are often context-sensitive and span multiple databases.

We present a system with which a non-expert user can author new query templates and Web forms, to be reused by anyone with related information needs. The user poses keyword queries that are matched against source relations and their attributes; the system uses sequences of associations (e.g., foreign keys, links, schema mappings, synonyms, and taxonomies) to create multiple ranked queries linking the matches to keywords; the set of queries is attached to a Web query form. Now the user and his or her associates may pose specific queries by filling in parameters in the form. Importantly, the answers to this query are ranked and annotated with data provenance, and the user provides feedback on the utility of the answers, from which the system ultimately learns to assign costs to sources and associations according to the user's specific information need, as a result changing the ranking of the queries used to generate results. We evaluate the effectiveness of our method against "gold standard" costs from domain experts and demonstrate the method's scalability.

References

  1. R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Balmin, V. Hristidis, and Y. Papakonstantinou. Objectrank: Authority-based keyword search in databases. In VLDB, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Bernal, K. Crammer, A. Hatzigeorgiou, and F. Pereira. Global discriminative training for higher-accuracy computational gene prediction. PLoS Computational Biology, 3, 2007.Google ScholarGoogle Scholar
  4. G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS. In ICDE, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Botev and J. Shanmugasundaram. Context-sensitive keyword search and ranking for XML. In WebDB, 2005.Google ScholarGoogle Scholar
  6. S. C. Boulakia, O. Biton, S. B. Davidson, and C. Froidevaux. BioGuideSRS: querying multiple sources with a user-centric perspective. Bioinformatics, 23(10), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. J. Carey. BEA Liquid Data for WebLogic: XML-based enterprise information integration. In ICDE, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. J. Carey, D. Florescu, Z. G. Ives, Y. Lu, J. Shanmugasundaram, E. Shekita, and S. Subramanian. XPERANTO: Publishing object-relational data as XML. In WebDB '00, 2000.Google ScholarGoogle Scholar
  10. S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. A. Shah. TelegraphCQ: Continuous dataflow processing for an uncertain world. In CIDR, 2003.Google ScholarGoogle Scholar
  11. W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551--585, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. Cui. Lineage Tracing in Data Warehouses. PhD thesis, Stanford University, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Duin and A. Volgenant. Reduction tests for the steiner problem in graphs. Netw., 19, 1989.Google ScholarGoogle Scholar
  16. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins in an RDBMS for web data integration. In WWW, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. J. Green, G. Karvounarakis, Z. G. Ives, and V. Tannen. Update exchange with mappings and provenance. In VLDB, 2007. Amended version available as Univ. of Pennsylvania report MS-CIS-07-26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. XRANK: Ranked keyword search over XML documents. In SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Y. Halevy, Z. G. Ives, D. Suciu, and I. Tatarinov. Schema mediation in peer data management systems. In ICDE, March 2003.Google ScholarGoogle ScholarCross RefCross Ref
  23. V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases. In VLDB, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Z. G. Ives, D. Florescu, M. T. Friedman, A. Y. Levy, and D. S. Weld. An adaptive query execution system for data integration. In SIGMOD, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar. Bidirectional expansion for keyword search on graph databases. In VLDB, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Kasneci, F. M. Suchanek, G. Ifrim, M. Ramanath, and G. Weikum. Naga: Searching and ranking knowledge. In ICDE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. B. Kimelfeld and Y. Sagiv. Finding and approximating top-k answers in keyword proximity search. In PODS, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. C. Kissinger, B. P. Brunk, J. Crabtree, M. J. Fraunholz, B. Gajria, A. J. Milgram, D. S. Pearson, J. Schug, A. Bahl, S. J. Diskin, H. Ginsburg, G. R. Grant, D. Gupta, P. Labo, L. Li, M. D. Mailman, S. K. McWeeney, P. Whetzel, C. J. Stoeckert, Jr., and D. S. Roos. The Plasmodium genome database: Designing and mining a eukaryotic genomics resource. Nature, 419, 2002.Google ScholarGoogle Scholar
  29. E. L. Lawler. A procedure for computing the k best solutions to discrete optimization problems and its application to the shortest path problem. Management Science, 18, 1972.Google ScholarGoogle Scholar
  30. C. Li, K. C.-C. Chang, I. F. Ilyas, and S. Song. RankSQL: Query algebra and optimization for relational top-k queries. In SIGMOD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Marian, N. Bruno, and L. Gravano. Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst., 29(2), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. McDonald and F. Pereira. Online learning of approximate dependency parsing algorithms. In European Association for Computational Linguistics, 2006.Google ScholarGoogle Scholar
  33. P. Mork, R. Shaker, A. Halevy, and P. Tarczy-Hornoch. PQL: A declarative query language over dynamic biological schemata. In AMIA Symposium, November 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A.-M. Popescu, O. Etzioni, and H. Kautz. Towards a theory of natural language interfaces to databases. In IUI '03, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB J., 10(4), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. R. Tuchinda and C. A. Knoblock. Building mashups by example. In IUI '08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. R. Varadarajan, V. Hristidis, and L. Raschid. Explaining and reformulating authority flow queries. In ICDE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. P. Winter. Steiner problem in networks: a survey. Netw., 17(2), 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. P. Winter and J. M. Smith. Path-distance heuristics for the steiner problem in undirected networks. Algorithmica, 7(2&3):309--327, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  40. L. Wolsey. Integer Programming. Wiley-Interscience, 1998.Google ScholarGoogle Scholar
  41. R. T. Wong. A dual ascent approach for steiner tree problems on a directed graph. Mathematical Programming, 28(3):271--287, October 1981.Google ScholarGoogle ScholarCross RefCross Ref
  42. J. Y. Yen. Finding the k shortest loopless paths in a network. Management Science, 18(17), 1971.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader