Abstract
The dramatic growth of the Internet has created a new problem for users: location of the relevant sources of documents. This article presents a framework for (and experimentally analyzes a solution to) this problem, which we call the text-source discovery problem. Our approach consists of two phases. First, each text source exports its contents to a centralized service. Second, users present queries to the service, which returns an ordered list of promising text sources. This article describes GlOSS, Glossary of Servers Server, with two versions: bGlOSS, which provides a Boolean query retrieval model, and vGlOSS, which provides a vector-space retrieval model. We also present hGlOSS, which provides a decentralized version of the system. We extensively describe the methodology for measuring the retrieval effectiveness of these systems and provide experimental evidence, based on actual data, that all three systems are highly effective in determining promising text sources for a given query.
- BARBAR , D. AND CLIFTON, C. 1992. Information brokers: Sharing knowledge in a heterogeneous distributed system. Tech. Rep. MITL-TR-31-92. Matsushita Information Technology Laboratory.Google Scholar
- BOWMAN, C. M., DANZIG, P. B., HARDY, D. R., MANBER, U., AND SCHWARTZ, M. F. 1994. Harvest: A scalable, customizable discovery and access system. Tech. Rep. CU-CS-732-94. Dept. Computer Science, Univ. of Colorado, Boulder.Google Scholar
- CALLAN, J. P., Lu, Z., AND CROFT, W. B. 1995. Searching distributed collections with inference networks. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 21-28. Google ScholarDigital Library
- CHAMIS, A.Y. 1988. Selection of online databases using switching vocabularies. J. Am. Soc. Inf. Sci. 39, 3, 217-218.Google ScholarCross Ref
- CHANG, C.-C. K., GARC A-MOLINA, H., AND PAEPCKE, A. 1996. Boolean query mapping across heterogeneous information sources. IEEE Trans. Knowl. Data Eng. 8, 4 (Aug.), 515-521. Google ScholarDigital Library
- DANZIG, P. B., AHN, J., NOLL, J., AND OBRACZKA, K. 1991. Distributed indexing: a scalable mechanism for distributed information retrieval. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '91, Chicago, IL, Oct. 13-16), E. Fox, Ed. ACM Press, New York, NY, 220 -229. Google ScholarDigital Library
- DANZIG, P. B., LI, S. -H., AND OBRACZKA, K. 1992. Distributed indexing of autonomous internet services. Comput. Syst. 5, 4, 433-459.Google Scholar
- DOLIN, R., AGRAWAL, D., DILLON, L., AND EL ABBADI, A. 1996. Pharos: A scalable distributed architecture for locating heterogeneous information sources. Tech. Rep. TRCS96-05. Department of Computer Science, University of California at Santa Barbara, Santa Barbara, CA. Google ScholarDigital Library
- DUDA, A. AND SHELDON, M.A. 1994. Content routing in a network of WAIS servers. In Proceedings of the 14th IEEE International Conference on Distributed Computing Systems (Poznan, Poland, June). IEEE Computer Society Press, Los Alamitos, CA.Google ScholarCross Ref
- FLATER, D. W. AND YESHA, Y. 1993. An information retrieval system for network resources. In Proceedings of the International Workshop on Next Generation Information Technologies and Systems (June).Google Scholar
- FRENCH, J. C., POWELL, A. L., VILES, C. L., EMMITT, T., AND PREY, K.J. 1998. Evaluating database selection techniques: a testbed and experiment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98, Melbourne, Australia, Aug. 24-28), W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Eds. ACM Press, New York, NY, 121-129. Google ScholarDigital Library
- FULLTON, J. AND WARNOCK, A. ET AL. 1993. Release Notes for Free WAIS 0.2.Google Scholar
- GRAVANO, L., CHANG, C.-C. K., GARC A-MOLINA, H., AND PAEPCKE, A. 1997. STARTS: Stanford proposal for Internet meta-searching. In Proceedings of the International ACM Conference on Management of Data (SIGMOD '97, May). ACM, New York, NY. Google ScholarDigital Library
- GRAVANO, L. AND GARC A-MOLINA, H. 1995a. Generalizing GLOSS to vector-space databases and broker hierarchies. In Proceedings of the 21st International Conference on Very Large Databases (VLDB'95, Sept.). 78-89. Google ScholarDigital Library
- GRAVANO, L. AND GARC A-MOLINA, H. 1995b. Generalizing GLOSS to vector-space databases and broker hierarchies. Tech. Rep. STAN-CS-TN-95-21. Computer Systems Laboratory, Stanford Univ., Stanford, CA. Google ScholarDigital Library
- GRAVANO, L. AND GARC A-MOLINA, H. 1997. Merging ranks from heterogeneous Internet sources. In Proceedings of the 23rd International Conference on Very Large Databases (VLDB '97, Athens, Greece, Aug.). VLDB Endowment, Berkeley, CA. Google ScholarDigital Library
- GRAVANO, L., GARC A-MOLINA, H., AND TOMASIC, A. 1993. The efficacy of GLOSS for the text-database discovery problem. Tech. Rep. STAN-CS-TN-93-002. Computer Systems Laboratory, Stanford Univ., Stanford, CA. Google ScholarDigital Library
- GRAVANO, L., GARC A-MOLINA, H., AND TOMASIC, A. 1994a. The effectiveness of GLOSS for the text-database discovery problem. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data (SIGMOD '94, Minneapolis, MN, May 24-27), R. T. Snodgrass and M. Winslett, Eds. ACM Press, New York, NY. Google ScholarDigital Library
- GRAVANO, L., GARC A-MOLINA, H., AND TOMASIC, A. 1994b. Precision and recall of GLOSS estimators for database discovery. In Proceedings of the 3rd International Conference on Parallel and Distributed Information Systems (PDIS, Austin, TX, Sept.). Google ScholarDigital Library
- KAHLE, B. AND MEDLAR, A. 1991. An information system for corporate users: Wide area information servers. Online 15, 5 (Sept. 1991), 56-60. Google ScholarDigital Library
- MORRIS, A., DRENTH, H., AND TSENG, G. 1993. The development of an expert system for online company database selection. Expert Syst. 10, 2 (May), 47-60.Google ScholarCross Ref
- MORRIS, A., TSENG, G., AND DRENTH, H. 1992. Expert systems for online business database selection: the problem of choosing online business sources. Libr. Hi Tech 10, 1-2 (1992), 65-68. Google ScholarDigital Library
- NEUMAN, B. C. 1992. The Prospero file system: A global file system based on the virtual system model. Comput. Syst. 5, 4, 407-432.Google Scholar
- OBRACZKA, K., DANZIG, P. B., AND LI, S.-H. 1993. Internet resource discovery services. IEEE Comput. 26, 9 (Sept.), 8-22. Google ScholarDigital Library
- ORDILLE, J. J. AND MILLER, B. P. 1992. Distributed active catalogs and meta-data caching in descriptive name services. Tech. Rep. 1118. University of Wisconsin at Madison, Madison, WI.Google Scholar
- SALTON, G. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Series in Computer Science. Addison-Wesley Longman Publ. Co., Inc., Reading, MA. Google ScholarDigital Library
- SALTON, G., FOX, E. A., AND VOORHEES, E. M. 1983. A comparison of two methods for Boolean query relevance feedback. Tech. Rep. TR 83-564. Department of Computer Science, Cornell University, Ithaca, NY. Google ScholarDigital Library
- SALTON, G. AND MCGILL, M. J. 1983. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., Hightstown, NJ. Google ScholarDigital Library
- SCHWARTZ, M. F. 1990. A scalable, non-hierarchical resource discovery mechanism based on probabilistic protocols. Tech. Rep. CU-CS-474-90. Department of Computer Science, University of Colorado at Boulder, Boulder, CO.Google Scholar
- SCHWARTZ, M. F. 1993. Internet resource discovery at the University of Colorado. IEEE Comput. 26, 9 (Sept.), 25-35. Google ScholarDigital Library
- SCHWARTZ, M. F., EMTAGE, A., KAHLE, B., AND NEUMAN, B. C. 1992. A comparison of Internet resource discovery approaches. Comput. Syst. 5, 4, 461-493.Google Scholar
- SELBERG, E. AND ETZIONI, O. 1995. Multi-service search and comparison using the MetaCrawler. In Proceedings of the Fourth International Conference on World-Wide Web (Dec.).Google Scholar
- SHELDON, M. A., DUDA, A., WEISS, R., O'TOOLE, J. W., AND GIFFORD, D. K. 1994. Content routing for distributed information servers. In Proceedings of the Fourth International Conference on Extending Database Technology: Advances in Database Technology (EDBT '94, Cambridge, UK, Mar. 28-31, 1994), M. Jarke, J. Bubenko, and K. Jeffery, Eds. Springer Lecture Notes in Computer Science, vol. 779. Springer-Verlag, New York, NY, 109-122. Google ScholarDigital Library
- SIMPSON, P. AND ALONSO, R. 1989. Querying a network of autonomous databases. Tech. Rep. CS-TR-202-89. Department of Computer Science, Princeton Univ., Princeton, NJ.Google Scholar
- TOMASIC, A., GRAVANO, L., LUE, C., SCHWARZ, P., AND HAAS, L. 1997. Data structures for efficient broker implementation. ACM Trans. Inf. Syst. 15, 3, 223-253. Google ScholarDigital Library
- VOORHEES, E. M., GUPTA, N. K., AND JOHNSON-LAIRD, B. 1995. The collection fusion problem. In Proceedings of the Third Conference on Text Retrieval (TREC-3, Mar.).Google Scholar
- YAN, T. W. AND GARC A-MOLINA, H. 1995. SIFT--a tool for wide-area information dissemination. In Proceedings of the 1995 USENIX Technical Conference (Jan.). USENIX Assoc., Berkeley, CA, 177-186. Google ScholarDigital Library
- ZAHIR, S. AND CHANG, C. L. 1992. Online-Expert: An expert system for online database selection. J. Am. Soc. Inf. Sci. 43, 5, 340-357.Google ScholarCross Ref
Index Terms
- GlOSS: text-source discovery over the Internet
Recommendations
Efficient passage ranking for document databases
Queries to text collections are resolved by ranking the documents in the collection and returning the highest-scoring documents to the user. An alternative retrieval method is to rank passages, that is, short fragments of documents, a strategy that can ...
Testing the cluster hypothesis in distributed information retrieval
How to merge and organise query results retrieved from different resources is one of the key issues in distributed information retrieval. Some previous research and experiments suggest that cluster-based document browsing is more effective than a single ...
An application of swarm intelligence to distributed image retrieval
In this article, we introduce an application of swarm intelligence to distributed visual information retrieval distributed over networks. Based on the relevance feedback scheme, we use ant-like agents to crawl the network and to retrieve relevant ...
Comments