skip to main content
research-article

Accuracy estimate and optimization techniques for SimRank computation

Authors Info & Claims
Published:01 August 2008Publication History
Skip Abstract Section

Abstract

The measure of similarity between objects is a very useful tool in many areas of computer science, including information retrieval. SimRank is a simple and intuitive measure of this kind, based on graph-theoretic model. SimRank is typically computed iteratively, in the spirit of PageRank. However, existing work on SimRank lacks accuracy estimation of iterative computation and has discouraging time complexity.

In this paper we present a technique to estimate the accuracy of computing SimRank iteratively. This technique provides a way to find out the number of iterations required to achieve a desired accuracy when computing SimRank. We also present optimization techniques that improve the computational complexity of the iterative algorithm from O(n4) to O(n3) in the worst case. We also introduce a threshold sieving heuristic and its accuracy estimation that further improves the efficiency of the method.

As a practical illustration of our techniques we computed SimRank scores on a subset of English Wikipedia corpus, consisting of the complete set of articles and category links.

References

  1. S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Cohen and S. Havlin. Scale-free networks are ultrasmall. Physical Review Letter, 90(5):058701, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  3. D. Fogaras and B. Rácz. Scaling link-based similarity search. In WWW '05: Proceedings of the 14th international conference on World Wide Web, pages 641--650, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of The Twentieth International Joint Conference for Artificial Intelligence, pages 1606--1611, Hyderabad, India, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Ganesan, H. Garcia-Molina, and J. Widom. Exploiting hierarchical domain structure to compute similarity. ACM Transactions on Information Systems, 21(1):64--93, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. Geerts, H. Mannila, and E. Terzi. Relational link-based ranking. In VLDB '2004: Proceedings of the Thirtieth international conference on Very large data bases, pages 552--563. VLDB Endowment, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 538--543. ACM Press, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Li, D. Alderson, R. Tanaka, J. C. Doyle, and W. Willinger. Towards a theory of scale-free graphs: Definition, properties, and implications (extended version). CoRR, abs/cond-mat/0501169, 2005.Google ScholarGoogle Scholar
  9. D. Lin. An information-theoretic definition of similarity. In Proc. 15th International Conf. on Machine Learning, pages 296--304. Morgan Kaufmann, San Francisco, CA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Z. Lin, I. King, and M. R. Lyu. Pagesim: A novel link-based similarity measure for the world wide web. In WI '06: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pages 687--693, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Lu, J. Janssen, E. E. Milios, and N. Japkowicz. Node similarity in networked information spaces. In D. A. Stewart and J. H. Johnson, editors, CASCON, page 11. IBM, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. G. Maguitman, F. Menczer, F. Erdinc, H. Roinestad, and A. Vespignani. Algorithmic computation and approximation of semantic similarity. World Wide Web, 9(4):431--456, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web, 1998.Google ScholarGoogle Scholar
  14. A. R. Schmidt, F. Waas, M. L. Kersten, M. J. Carey, I. Manolescu, and R. Busse. XMark: A Benchmark for XML Data Management. In Proceedings of the International Conference on Very Large Data Bases (VLDB), pages 974--985, Hong Kong, China, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. Small. Co-citation in the scientific literature: a new measure of the relationship between two documents. Journal of the American Society for Information Science, 24(4):265--269, 1973.Google ScholarGoogle ScholarCross RefCross Ref
  16. M. Strube and S. Ponzetto. WikiRelate! Computing semantic relatedness using Wikipedia. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI-06), pages 1419--1424, Boston, Mass., July 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. W. Xi, E. A. Fox, W. Fan, B. Zhang, Z. Chen, J. Yan, and D. Zhuang. Simfusion: measuring similarity using unified relationship matrix. In SIGIR '05: Proceedings of the 28th international ACM SIGIR conference on Research and development in information retrieval, pages 130--137, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Zesch and I. Gurevych. Analysis of the Wikipedia Category Graph for NLP Applications. In Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), pages 1--8, 2007.Google ScholarGoogle Scholar

Index Terms

  1. Accuracy estimate and optimization techniques for SimRank computation

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader