article

Free Access

GlOSS: text-source discovery over the Internet

Authors:
Luis Gravano

Columbia Univ., New York, NY

Columbia Univ., New York, NY
View Profile

,
Héctor García-Molina

Stanford Univ., Stanford, CA

Stanford Univ., Stanford, CA
View Profile

,
Anthony Tomasic

INRIA Rocquencourt, Le Chesnay, France

INRIA Rocquencourt, Le Chesnay, France
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 24 Issue 2pp 229–264https://doi.org/10.1145/320248.320252

Published:01 June 1999Publication History

ACM Transactions on Database Systems

Abstract

The dramatic growth of the Internet has created a new problem for users: location of the relevant sources of documents. This article presents a framework for (and experimentally analyzes a solution to) this problem, which we call the text-source discovery problem. Our approach consists of two phases. First, each text source exports its contents to a centralized service. Second, users present queries to the service, which returns an ordered list of promising text sources. This article describes GlOSS, Glossary of Servers Server, with two versions: bGlOSS, which provides a Boolean query retrieval model, and vGlOSS, which provides a vector-space retrieval model. We also present hGlOSS, which provides a decentralized version of the system. We extensively describe the methodology for measuring the retrieval effectiveness of these systems and provide experimental evidence, based on actual data, that all three systems are highly effective in determining promising text sources for a given query.

References

BARBAR , D. AND CLIFTON, C. 1992. Information brokers: Sharing knowledge in a heterogeneous distributed system. Tech. Rep. MITL-TR-31-92. Matsushita Information Technology Laboratory.Google Scholar
BOWMAN, C. M., DANZIG, P. B., HARDY, D. R., MANBER, U., AND SCHWARTZ, M. F. 1994. Harvest: A scalable, customizable discovery and access system. Tech. Rep. CU-CS-732-94. Dept. Computer Science, Univ. of Colorado, Boulder.Google Scholar
CALLAN, J. P., Lu, Z., AND CROFT, W. B. 1995. Searching distributed collections with inference networks. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 21-28. Google ScholarDigital Library
CHAMIS, A.Y. 1988. Selection of online databases using switching vocabularies. J. Am. Soc. Inf. Sci. 39, 3, 217-218.Google ScholarCross Ref
CHANG, C.-C. K., GARC A-MOLINA, H., AND PAEPCKE, A. 1996. Boolean query mapping across heterogeneous information sources. IEEE Trans. Knowl. Data Eng. 8, 4 (Aug.), 515-521. Google ScholarDigital Library
DANZIG, P. B., AHN, J., NOLL, J., AND OBRACZKA, K. 1991. Distributed indexing: a scalable mechanism for distributed information retrieval. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '91, Chicago, IL, Oct. 13-16), E. Fox, Ed. ACM Press, New York, NY, 220 -229. Google ScholarDigital Library
DANZIG, P. B., LI, S. -H., AND OBRACZKA, K. 1992. Distributed indexing of autonomous internet services. Comput. Syst. 5, 4, 433-459.Google Scholar
DOLIN, R., AGRAWAL, D., DILLON, L., AND EL ABBADI, A. 1996. Pharos: A scalable distributed architecture for locating heterogeneous information sources. Tech. Rep. TRCS96-05. Department of Computer Science, University of California at Santa Barbara, Santa Barbara, CA. Google ScholarDigital Library
DUDA, A. AND SHELDON, M.A. 1994. Content routing in a network of WAIS servers. In Proceedings of the 14th IEEE International Conference on Distributed Computing Systems (Poznan, Poland, June). IEEE Computer Society Press, Los Alamitos, CA.Google ScholarCross Ref
FLATER, D. W. AND YESHA, Y. 1993. An information retrieval system for network resources. In Proceedings of the International Workshop on Next Generation Information Technologies and Systems (June).Google Scholar
FRENCH, J. C., POWELL, A. L., VILES, C. L., EMMITT, T., AND PREY, K.J. 1998. Evaluating database selection techniques: a testbed and experiment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98, Melbourne, Australia, Aug. 24-28), W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Eds. ACM Press, New York, NY, 121-129. Google ScholarDigital Library
FULLTON, J. AND WARNOCK, A. ET AL. 1993. Release Notes for Free WAIS 0.2.Google Scholar
GRAVANO, L., CHANG, C.-C. K., GARC A-MOLINA, H., AND PAEPCKE, A. 1997. STARTS: Stanford proposal for Internet meta-searching. In Proceedings of the International ACM Conference on Management of Data (SIGMOD '97, May). ACM, New York, NY. Google ScholarDigital Library
GRAVANO, L. AND GARC A-MOLINA, H. 1995a. Generalizing GLOSS to vector-space databases and broker hierarchies. In Proceedings of the 21st International Conference on Very Large Databases (VLDB'95, Sept.). 78-89. Google ScholarDigital Library
GRAVANO, L. AND GARC A-MOLINA, H. 1995b. Generalizing GLOSS to vector-space databases and broker hierarchies. Tech. Rep. STAN-CS-TN-95-21. Computer Systems Laboratory, Stanford Univ., Stanford, CA. Google ScholarDigital Library
GRAVANO, L. AND GARC A-MOLINA, H. 1997. Merging ranks from heterogeneous Internet sources. In Proceedings of the 23rd International Conference on Very Large Databases (VLDB '97, Athens, Greece, Aug.). VLDB Endowment, Berkeley, CA. Google ScholarDigital Library
GRAVANO, L., GARC A-MOLINA, H., AND TOMASIC, A. 1993. The efficacy of GLOSS for the text-database discovery problem. Tech. Rep. STAN-CS-TN-93-002. Computer Systems Laboratory, Stanford Univ., Stanford, CA. Google ScholarDigital Library
GRAVANO, L., GARC A-MOLINA, H., AND TOMASIC, A. 1994a. The effectiveness of GLOSS for the text-database discovery problem. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data (SIGMOD '94, Minneapolis, MN, May 24-27), R. T. Snodgrass and M. Winslett, Eds. ACM Press, New York, NY. Google ScholarDigital Library
GRAVANO, L., GARC A-MOLINA, H., AND TOMASIC, A. 1994b. Precision and recall of GLOSS estimators for database discovery. In Proceedings of the 3rd International Conference on Parallel and Distributed Information Systems (PDIS, Austin, TX, Sept.). Google ScholarDigital Library
KAHLE, B. AND MEDLAR, A. 1991. An information system for corporate users: Wide area information servers. Online 15, 5 (Sept. 1991), 56-60. Google ScholarDigital Library
MORRIS, A., DRENTH, H., AND TSENG, G. 1993. The development of an expert system for online company database selection. Expert Syst. 10, 2 (May), 47-60.Google ScholarCross Ref
MORRIS, A., TSENG, G., AND DRENTH, H. 1992. Expert systems for online business database selection: the problem of choosing online business sources. Libr. Hi Tech 10, 1-2 (1992), 65-68. Google ScholarDigital Library
NEUMAN, B. C. 1992. The Prospero file system: A global file system based on the virtual system model. Comput. Syst. 5, 4, 407-432.Google Scholar
OBRACZKA, K., DANZIG, P. B., AND LI, S.-H. 1993. Internet resource discovery services. IEEE Comput. 26, 9 (Sept.), 8-22. Google ScholarDigital Library
ORDILLE, J. J. AND MILLER, B. P. 1992. Distributed active catalogs and meta-data caching in descriptive name services. Tech. Rep. 1118. University of Wisconsin at Madison, Madison, WI.Google Scholar
SALTON, G. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Series in Computer Science. Addison-Wesley Longman Publ. Co., Inc., Reading, MA. Google ScholarDigital Library
SALTON, G., FOX, E. A., AND VOORHEES, E. M. 1983. A comparison of two methods for Boolean query relevance feedback. Tech. Rep. TR 83-564. Department of Computer Science, Cornell University, Ithaca, NY. Google ScholarDigital Library
SALTON, G. AND MCGILL, M. J. 1983. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., Hightstown, NJ. Google ScholarDigital Library
SCHWARTZ, M. F. 1990. A scalable, non-hierarchical resource discovery mechanism based on probabilistic protocols. Tech. Rep. CU-CS-474-90. Department of Computer Science, University of Colorado at Boulder, Boulder, CO.Google Scholar
SCHWARTZ, M. F. 1993. Internet resource discovery at the University of Colorado. IEEE Comput. 26, 9 (Sept.), 25-35. Google ScholarDigital Library
SCHWARTZ, M. F., EMTAGE, A., KAHLE, B., AND NEUMAN, B. C. 1992. A comparison of Internet resource discovery approaches. Comput. Syst. 5, 4, 461-493.Google Scholar
SELBERG, E. AND ETZIONI, O. 1995. Multi-service search and comparison using the MetaCrawler. In Proceedings of the Fourth International Conference on World-Wide Web (Dec.).Google Scholar
SHELDON, M. A., DUDA, A., WEISS, R., O'TOOLE, J. W., AND GIFFORD, D. K. 1994. Content routing for distributed information servers. In Proceedings of the Fourth International Conference on Extending Database Technology: Advances in Database Technology (EDBT '94, Cambridge, UK, Mar. 28-31, 1994), M. Jarke, J. Bubenko, and K. Jeffery, Eds. Springer Lecture Notes in Computer Science, vol. 779. Springer-Verlag, New York, NY, 109-122. Google ScholarDigital Library
SIMPSON, P. AND ALONSO, R. 1989. Querying a network of autonomous databases. Tech. Rep. CS-TR-202-89. Department of Computer Science, Princeton Univ., Princeton, NJ.Google Scholar
TOMASIC, A., GRAVANO, L., LUE, C., SCHWARZ, P., AND HAAS, L. 1997. Data structures for efficient broker implementation. ACM Trans. Inf. Syst. 15, 3, 223-253. Google ScholarDigital Library
VOORHEES, E. M., GUPTA, N. K., AND JOHNSON-LAIRD, B. 1995. The collection fusion problem. In Proceedings of the Third Conference on Text Retrieval (TREC-3, Mar.).Google Scholar
YAN, T. W. AND GARC A-MOLINA, H. 1995. SIFT--a tool for wide-area information dissemination. In Proceedings of the 1995 USENIX Technical Conference (Jan.). USENIX Assoc., Berkeley, CA, 177-186. Google ScholarDigital Library
ZAHIR, S. AND CHANG, C. L. 1992. Online-Expert: An expert system for online database selection. J. Am. Soc. Inf. Sci. 43, 5, 340-357.Google ScholarCross Ref

Index Terms

GlOSS: text-source discovery over the Internet
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems

Recommendations

Efficient passage ranking for document databases

Queries to text collections are resolved by ranking the documents in the collection and returning the highest-scoring documents to the user. An alternative retrieval method is to rank passages, that is, short fragments of documents, a strategy that can ...
Read More
Testing the cluster hypothesis in distributed information retrieval

How to merge and organise query results retrieved from different resources is one of the key issues in distributed information retrieval. Some previous research and experiments suggest that cluster-based document browsing is more effective than a single ...
Read More
An application of swarm intelligence to distributed image retrieval

In this article, we introduce an application of swarm intelligence to distributed visual information retrieval distributed over networks. Based on the relevance feedback scheme, we use ant-like agents to crawl the network and to retrieve relevant ...
Read More

Reviews

Reviewer: Edward Y. Lee

The idea of a Glossary of Servers Server (GlOSS) has been discussed in several other publications. This paper expands the idea into three versions of the implementation as applied to the discovery of text-source documents available over the Internet. In order for GlOSS to work, the text sources must first export their contents to a centralized GlOSS server, which can service many users who would submit queries to the server. The GlOSS then presents a list of sources that are deemed by the system to be the best sources of documents desired by the users, who must then go to those sources and perform additional queries to retrieve the necessary document. Therefore, GlOSS is a database of metadata about other databases, but it provides a measure or metric to show that those sources it returns merit the subsequent visit by the users for the relevant documents. It should be pointed out that in order for GlOSS to work, each of the collection of source documents must be willing to cooperate by submitting the GlOSS-compatible summaries to the metadata server(s). Such cooperation may or may not be viable in a competitive commercial environment. The first version of GlOSS is called vGlOSS, which is a version of GlOSS based on the vector-space retrieval model. In this model, the document is represented as a normalized m -dimensional vector where the weight assigned to each word represents the dimension of the vector. This is the basis for evaluating the goodness (G) measure that is used to rank each of the databases containing the documents. G is a sum of similarities ranging from 0 to 1 for each database for a given vector retrieval. However, to simplify the collection of the goodness measure, an estimated value is used instead of the ideal value. GlOSS then provides the users with a list of databases, ranked by G. The authors use both theoretical analyses and experiments with actual databases to show the validity of the G metrics. They show that G provides the correct representation by GlOSS of the list of databases, as compared to using the actual vector-space retrieval against the databases themselves. The second version of GlOSS is called bGlOSS, which is a version of GlOSS based on the Boolean retrieval model. In order to come up with a goodness (G) metric, bGlOSS uses an independent estimation scheme to estimate the number of documents in a given database for a given query. The authors used the United States patents for 1991 as the experiment database. They divided the patents into 500 databases and used the actual Boolean queries from the real-user queries issued at Stanford University to the INSPEC databases of physics, electrical engineering, and computer science bibliographic records. Their analyses of the experiment show that the G metric gives a very good measure of the effectiveness of the bGlOSS estimation scheme. They also show that the storage requirement for the bGlOSS is only a little over 2 percent as compared with the full index storage of the INSPEC database. The third version of the GlOSS is called hGlOSS, which is the hierarchical version of vGlOSS. In this case, a number of vGlOSS servers are reporting to the centralized hGlOSS that will contain information about the other servers under its control. The centralized version of hGlOSS keeps track of the goodness (G) measures using two numbers from each of the databases as reported from each of the vGlOSS servers. Only those vGlOSS databases reporting to the centralized hGlOSS can be searched or reached by the users. Using these two numbers associated with the databases, the hGlOSS is able to return a goodness metric to the users. The list of vGlOSS servers with the associated G will allow the users to pick the best vGlOSS server that they should contact and, in turn, obtain the list of the most effective databases for their target documents. Since the storage requirement for the hGlOSS servers is very small, they can be replicated over the Internet and allow many users to use them. The authors mention that a similar hGlOSS could be constructed for the bGlOSS. The authors have demonstrated the usefulness of the GlOSS system of text-source discovery. However, it remains to be tested in a real-world environment in which all the text-source databases have to report each updated document with the related goodness measures to the proper GlOSS. In addition, the reporting databases must be inclusive to allow the use of a full complement of text-source databases. The scheme could be more user friendly if the links to the other GlOSS or databases were also stored with the returned information, but this would greatly increase the storage requirement. This paper should be of interest to anyone interested in document retrieval. It includes more than 38 references.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Database Systems Volume 24, Issue 2
June 1999
142 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/320248
Issue’s Table of Contents

Copyright © 1999 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 1999
Published in tods Volume 24, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Internet search and retrieval
digital libraries
distributed information retrieval
text databases
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 203
  Total Citations
  View Citations
- 1,272
  Total Downloads
- Downloads (Last 12 months)52
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

GlOSS: text-source discovery over the Internet

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Efficient passage ranking for document databases

Testing the cluster hypothesis in distributed information retrieval

An application of swarm intelligence to distributed image retrieval

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

GlOSS: text-source discovery over the Internet

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Efficient passage ranking for document databases

Testing the cluster hypothesis in distributed information retrieval

An application of swarm intelligence to distributed image retrieval

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media