skip to main content
article
Free Access

GlOSS: text-source discovery over the Internet

Published:01 June 1999Publication History
Skip Abstract Section

Abstract

The dramatic growth of the Internet has created a new problem for users: location of the relevant sources of documents. This article presents a framework for (and experimentally analyzes a solution to) this problem, which we call the text-source discovery problem. Our approach consists of two phases. First, each text source exports its contents to a centralized service. Second, users present queries to the service, which returns an ordered list of promising text sources. This article describes GlOSS, Glossary of Servers Server, with two versions: bGlOSS, which provides a Boolean query retrieval model, and vGlOSS, which provides a vector-space retrieval model. We also present hGlOSS, which provides a decentralized version of the system. We extensively describe the methodology for measuring the retrieval effectiveness of these systems and provide experimental evidence, based on actual data, that all three systems are highly effective in determining promising text sources for a given query.

References

  1. BARBAR , D. AND CLIFTON, C. 1992. Information brokers: Sharing knowledge in a heterogeneous distributed system. Tech. Rep. MITL-TR-31-92. Matsushita Information Technology Laboratory.Google ScholarGoogle Scholar
  2. BOWMAN, C. M., DANZIG, P. B., HARDY, D. R., MANBER, U., AND SCHWARTZ, M. F. 1994. Harvest: A scalable, customizable discovery and access system. Tech. Rep. CU-CS-732-94. Dept. Computer Science, Univ. of Colorado, Boulder.Google ScholarGoogle Scholar
  3. CALLAN, J. P., Lu, Z., AND CROFT, W. B. 1995. Searching distributed collections with inference networks. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 21-28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. CHAMIS, A.Y. 1988. Selection of online databases using switching vocabularies. J. Am. Soc. Inf. Sci. 39, 3, 217-218.Google ScholarGoogle ScholarCross RefCross Ref
  5. CHANG, C.-C. K., GARC A-MOLINA, H., AND PAEPCKE, A. 1996. Boolean query mapping across heterogeneous information sources. IEEE Trans. Knowl. Data Eng. 8, 4 (Aug.), 515-521. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. DANZIG, P. B., AHN, J., NOLL, J., AND OBRACZKA, K. 1991. Distributed indexing: a scalable mechanism for distributed information retrieval. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '91, Chicago, IL, Oct. 13-16), E. Fox, Ed. ACM Press, New York, NY, 220 -229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. DANZIG, P. B., LI, S. -H., AND OBRACZKA, K. 1992. Distributed indexing of autonomous internet services. Comput. Syst. 5, 4, 433-459.Google ScholarGoogle Scholar
  8. DOLIN, R., AGRAWAL, D., DILLON, L., AND EL ABBADI, A. 1996. Pharos: A scalable distributed architecture for locating heterogeneous information sources. Tech. Rep. TRCS96-05. Department of Computer Science, University of California at Santa Barbara, Santa Barbara, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. DUDA, A. AND SHELDON, M.A. 1994. Content routing in a network of WAIS servers. In Proceedings of the 14th IEEE International Conference on Distributed Computing Systems (Poznan, Poland, June). IEEE Computer Society Press, Los Alamitos, CA.Google ScholarGoogle ScholarCross RefCross Ref
  10. FLATER, D. W. AND YESHA, Y. 1993. An information retrieval system for network resources. In Proceedings of the International Workshop on Next Generation Information Technologies and Systems (June).Google ScholarGoogle Scholar
  11. FRENCH, J. C., POWELL, A. L., VILES, C. L., EMMITT, T., AND PREY, K.J. 1998. Evaluating database selection techniques: a testbed and experiment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98, Melbourne, Australia, Aug. 24-28), W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Eds. ACM Press, New York, NY, 121-129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. FULLTON, J. AND WARNOCK, A. ET AL. 1993. Release Notes for Free WAIS 0.2.Google ScholarGoogle Scholar
  13. GRAVANO, L., CHANG, C.-C. K., GARC A-MOLINA, H., AND PAEPCKE, A. 1997. STARTS: Stanford proposal for Internet meta-searching. In Proceedings of the International ACM Conference on Management of Data (SIGMOD '97, May). ACM, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. GRAVANO, L. AND GARC A-MOLINA, H. 1995a. Generalizing GLOSS to vector-space databases and broker hierarchies. In Proceedings of the 21st International Conference on Very Large Databases (VLDB'95, Sept.). 78-89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. GRAVANO, L. AND GARC A-MOLINA, H. 1995b. Generalizing GLOSS to vector-space databases and broker hierarchies. Tech. Rep. STAN-CS-TN-95-21. Computer Systems Laboratory, Stanford Univ., Stanford, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. GRAVANO, L. AND GARC A-MOLINA, H. 1997. Merging ranks from heterogeneous Internet sources. In Proceedings of the 23rd International Conference on Very Large Databases (VLDB '97, Athens, Greece, Aug.). VLDB Endowment, Berkeley, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. GRAVANO, L., GARC A-MOLINA, H., AND TOMASIC, A. 1993. The efficacy of GLOSS for the text-database discovery problem. Tech. Rep. STAN-CS-TN-93-002. Computer Systems Laboratory, Stanford Univ., Stanford, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. GRAVANO, L., GARC A-MOLINA, H., AND TOMASIC, A. 1994a. The effectiveness of GLOSS for the text-database discovery problem. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data (SIGMOD '94, Minneapolis, MN, May 24-27), R. T. Snodgrass and M. Winslett, Eds. ACM Press, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. GRAVANO, L., GARC A-MOLINA, H., AND TOMASIC, A. 1994b. Precision and recall of GLOSS estimators for database discovery. In Proceedings of the 3rd International Conference on Parallel and Distributed Information Systems (PDIS, Austin, TX, Sept.). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. KAHLE, B. AND MEDLAR, A. 1991. An information system for corporate users: Wide area information servers. Online 15, 5 (Sept. 1991), 56-60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. MORRIS, A., DRENTH, H., AND TSENG, G. 1993. The development of an expert system for online company database selection. Expert Syst. 10, 2 (May), 47-60.Google ScholarGoogle ScholarCross RefCross Ref
  22. MORRIS, A., TSENG, G., AND DRENTH, H. 1992. Expert systems for online business database selection: the problem of choosing online business sources. Libr. Hi Tech 10, 1-2 (1992), 65-68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. NEUMAN, B. C. 1992. The Prospero file system: A global file system based on the virtual system model. Comput. Syst. 5, 4, 407-432.Google ScholarGoogle Scholar
  24. OBRACZKA, K., DANZIG, P. B., AND LI, S.-H. 1993. Internet resource discovery services. IEEE Comput. 26, 9 (Sept.), 8-22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. ORDILLE, J. J. AND MILLER, B. P. 1992. Distributed active catalogs and meta-data caching in descriptive name services. Tech. Rep. 1118. University of Wisconsin at Madison, Madison, WI.Google ScholarGoogle Scholar
  26. SALTON, G. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Series in Computer Science. Addison-Wesley Longman Publ. Co., Inc., Reading, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. SALTON, G., FOX, E. A., AND VOORHEES, E. M. 1983. A comparison of two methods for Boolean query relevance feedback. Tech. Rep. TR 83-564. Department of Computer Science, Cornell University, Ithaca, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. SALTON, G. AND MCGILL, M. J. 1983. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., Hightstown, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. SCHWARTZ, M. F. 1990. A scalable, non-hierarchical resource discovery mechanism based on probabilistic protocols. Tech. Rep. CU-CS-474-90. Department of Computer Science, University of Colorado at Boulder, Boulder, CO.Google ScholarGoogle Scholar
  30. SCHWARTZ, M. F. 1993. Internet resource discovery at the University of Colorado. IEEE Comput. 26, 9 (Sept.), 25-35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. SCHWARTZ, M. F., EMTAGE, A., KAHLE, B., AND NEUMAN, B. C. 1992. A comparison of Internet resource discovery approaches. Comput. Syst. 5, 4, 461-493.Google ScholarGoogle Scholar
  32. SELBERG, E. AND ETZIONI, O. 1995. Multi-service search and comparison using the MetaCrawler. In Proceedings of the Fourth International Conference on World-Wide Web (Dec.).Google ScholarGoogle Scholar
  33. SHELDON, M. A., DUDA, A., WEISS, R., O'TOOLE, J. W., AND GIFFORD, D. K. 1994. Content routing for distributed information servers. In Proceedings of the Fourth International Conference on Extending Database Technology: Advances in Database Technology (EDBT '94, Cambridge, UK, Mar. 28-31, 1994), M. Jarke, J. Bubenko, and K. Jeffery, Eds. Springer Lecture Notes in Computer Science, vol. 779. Springer-Verlag, New York, NY, 109-122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. SIMPSON, P. AND ALONSO, R. 1989. Querying a network of autonomous databases. Tech. Rep. CS-TR-202-89. Department of Computer Science, Princeton Univ., Princeton, NJ.Google ScholarGoogle Scholar
  35. TOMASIC, A., GRAVANO, L., LUE, C., SCHWARZ, P., AND HAAS, L. 1997. Data structures for efficient broker implementation. ACM Trans. Inf. Syst. 15, 3, 223-253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. VOORHEES, E. M., GUPTA, N. K., AND JOHNSON-LAIRD, B. 1995. The collection fusion problem. In Proceedings of the Third Conference on Text Retrieval (TREC-3, Mar.).Google ScholarGoogle Scholar
  37. YAN, T. W. AND GARC A-MOLINA, H. 1995. SIFT--a tool for wide-area information dissemination. In Proceedings of the 1995 USENIX Technical Conference (Jan.). USENIX Assoc., Berkeley, CA, 177-186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. ZAHIR, S. AND CHANG, C. L. 1992. Online-Expert: An expert system for online database selection. J. Am. Soc. Inf. Sci. 43, 5, 340-357.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. GlOSS: text-source discovery over the Internet

            Recommendations

            Reviews

            Edward Y. Lee

            The idea of a Glossary of Servers Server (GlOSS) has been discussed in several other publications. This paper expands the idea into three versions of the implementation as applied to the discovery of text-source documents available over the Internet. In order for GlOSS to work, the text sources must first export their contents to a centralized GlOSS server, which can service many users who would submit queries to the server. The GlOSS then presents a list of sources that are deemed by the system to be the best sources of documents desired by the users, who must then go to those sources and perform additional queries to retrieve the necessary document. Therefore, GlOSS is a database of metadata about other databases, but it provides a measure or metric to show that those sources it returns merit the subsequent visit by the users for the relevant documents. It should be pointed out that in order for GlOSS to work, each of the collection of source documents must be willing to cooperate by submitting the GlOSS-compatible summaries to the metadata server(s). Such cooperation may or may not be viable in a competitive commercial environment. The first version of GlOSS is called vGlOSS, which is a version of GlOSS based on the vector-space retrieval model. In this model, the document is represented as a normalized m -dimensional vector where the weight assigned to each word represents the dimension of the vector. This is the basis for evaluating the goodness (G) measure that is used to rank each of the databases containing the documents. G is a sum of similarities ranging from 0 to 1 for each database for a given vector retrieval. However, to simplify the collection of the goodness measure, an estimated value is used instead of the ideal value. GlOSS then provides the users with a list of databases, ranked by G. The authors use both theoretical analyses and experiments with actual databases to show the validity of the G metrics. They show that G provides the correct representation by GlOSS of the list of databases, as compared to using the actual vector-space retrieval against the databases themselves. The second version of GlOSS is called bGlOSS, which is a version of GlOSS based on the Boolean retrieval model. In order to come up with a goodness (G) metric, bGlOSS uses an independent estimation scheme to estimate the number of documents in a given database for a given query. The authors used the United States patents for 1991 as the experiment database. They divided the patents into 500 databases and used the actual Boolean queries from the real-user queries issued at Stanford University to the INSPEC databases of physics, electrical engineering, and computer science bibliographic records. Their analyses of the experiment show that the G metric gives a very good measure of the effectiveness of the bGlOSS estimation scheme. They also show that the storage requirement for the bGlOSS is only a little over 2 percent as compared with the full index storage of the INSPEC database. The third version of the GlOSS is called hGlOSS, which is the hierarchical version of vGlOSS. In this case, a number of vGlOSS servers are reporting to the centralized hGlOSS that will contain information about the other servers under its control. The centralized version of hGlOSS keeps track of the goodness (G) measures using two numbers from each of the databases as reported from each of the vGlOSS servers. Only those vGlOSS databases reporting to the centralized hGlOSS can be searched or reached by the users. Using these two numbers associated with the databases, the hGlOSS is able to return a goodness metric to the users. The list of vGlOSS servers with the associated G will allow the users to pick the best vGlOSS server that they should contact and, in turn, obtain the list of the most effective databases for their target documents. Since the storage requirement for the hGlOSS servers is very small, they can be replicated over the Internet and allow many users to use them. The authors mention that a similar hGlOSS could be constructed for the bGlOSS. The authors have demonstrated the usefulness of the GlOSS system of text-source discovery. However, it remains to be tested in a real-world environment in which all the text-source databases have to report each updated document with the related goodness measures to the proper GlOSS. In addition, the reporting databases must be inclusive to allow the use of a full complement of text-source databases. The scheme could be more user friendly if the links to the other GlOSS or databases were also stored with the returned information, but this would greatly increase the storage requirement. This paper should be of interest to anyone interested in document retrieval. It includes more than 38 references.

            Access critical reviews of Computing literature here

            Become a reviewer for Computing Reviews.

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader