skip to main content
10.1145/2020408.2020430acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Semi-supervised ranking on very large graphs with rich metadata

Authors Info & Claims
Published:21 August 2011Publication History

ABSTRACT

Graph ranking plays an important role in many applications, such as page ranking on web graphs and entity ranking on social networks. In applications, besides graph structure, rich information on nodes and edges and explicit or implicit human supervision are often available. In contrast, conventional algorithms (e.g., PageRank and HITS) compute ranking scores by only resorting to graph structure information. A natural question arises here, that is, how to effectively and efficiently leverage all the information to more accurately calculate graph ranking scores than the conventional algorithms, assuming that the graph is also very large. Previous work only partially tackled the problem, and the proposed solutions are also not satisfying. This paper addresses the problem and proposes a general framework as well as an efficient algorithm for graph ranking. Specifically, we define a semi-supervised learning framework for ranking of nodes on a very large graph and derive within our proposed framework an efficient algorithm called Semi-Supervised PageRank. In the algorithm, the objective function is defined based upon a Markov random walk on the graph. The transition probability and the reset probability of the Markov model are defined as parametric models based on features on nodes and edges. By minimizing the objective function, subject to a number of constraints derived from supervision information, we simultaneously learn the optimal parameters of the model and the optimal ranking scores of the nodes. Finally, we show that it is possible to make the algorithm efficient to handle a billion-node graph by taking advantage of the sparsity of the graph and implement it in the MapReduce logic. Experiments on real data from a commercial search engine show that the proposed algorithm can outperform previous algorithms on several tasks.

References

  1. J. Abernethy, O. Chapelle, and C. Castillo. Web spam identification through content and hyperlinks. In the proceedings of AIRWeb'08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Agarwal and S. Chakrabarti. Learning random walks to rank nodes in graphs. In the proceedings of the 24th International Conference on Machine Learning (ICML), pages 9--16, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Agarwal, S. Chakrabarti, and S. Aggarwal. Learning to rank networked entites. In the proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 14--23, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Agarwal. Ranking on graph data. In the proceedings of the 23th International Conference on Machine Learning (ICML), pages 25--32, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Backstrom and J. Leskovec. Supervised Random Walks: Predicting and Recommending Links in Social Networks. In the proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM), pages 635--644, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, ISBN-13: 978-0201398298, May 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Computer Networks and ISDN Systems, volume 30, issue 1--7, pages 107--117, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection for web spam. In SIGIR Forum, volumn 40, issue 2, pages 11--24, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Chakrabarti and A. Agarwal. Learning parameters in entity relationship graphs from ranking preferences. In the proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), volume 4213, pages 91--102, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Chang, D. Cohn, and A. K. McCallum. Learning to create customized authority lists. In the proceedings of the 17th International Conference on Machine Learning (ICML), pages 127--134, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Culp and G. Michailidis. Graph-based semisupervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, VOL. 30, No. 1, pages 174--179, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In the proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. H. Haveliwala. Topic-sensitive PageRank. In the proceedings of the 11th International World Wide Web Conference (WWW), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Haveliwala, S. D. Kamvar, and G. Jeh. An Analytical Comparison of Approaches to Personalizing PageRank. Stanford University, Preprint, 2003.Google ScholarGoogle Scholar
  15. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In the proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 668--677, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Kolda and B. Bader. The TOPHITS Model for Higher-Order Web Link Analysis. In the proceedings of the 4th Workshop on Link Analysis, Counterterrorism and Security, in conjunction with the 6th SIAM International Conference on Data Mining (SDM), 2006.Google ScholarGoogle Scholar
  17. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.Google ScholarGoogle Scholar
  18. D. Rao, and D. Yarowsky. Ranking and semi-supervised classification on large scale graphs using map-reduce. In the proceedings of the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Richardson and P. Domingos. The intelligent surfer: probabilistic combination of link and content information in PageRank. In the proceedings of the 16th Annual Conference on Neural Information Processing Systems (NIPS), 2002.Google ScholarGoogle Scholar
  20. S. E. Robertson. Overview of okapi projects. Journal of Documentatioin, volumn 53, issue 1, pages 3--7, 1997.Google ScholarGoogle Scholar
  21. T. Sarlos, A. A. Benczur, K. Csalogany, D. Fogaras, and B. Racz. To randomize or not to randomize: space optimal summaries for hyperlink analysis. In the proceedings of the 15th International World Wide Web Conference (WWW), pages 297--306, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. C. Tsoi, G. Morini, F. Scarselli, M. Hagenbuchner, and M. Maggini. Adaptive ranking of Web pages. In the proceedings of the 12th International World Wide Web Conference (WWW), pages 356--365, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Xie, J. Liu, N. Zheng, D. Li, Y. Huang, and Y. Wang. Semi-supervised graph-ranking for text retrieval. In the proceedings of the 4th Asia Infomation Retrieval Symposium (AIRS), pages 256--263, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Zhou, J. Huang, and B. Scholkopf. Learning from labeled and unlabeled data on a directed graph. In the proceedings of the 22th International Conference on Machine Learning (ICML), pages 1041--1048, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. http://www.yr-bcn.es/webspam/datasets/uk2007/Google ScholarGoogle Scholar

Index Terms

  1. Semi-supervised ranking on very large graphs with rich metadata

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
          August 2011
          1446 pages
          ISBN:9781450308137
          DOI:10.1145/2020408

          Copyright © 2011 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 21 August 2011

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,133of8,635submissions,13%

          Upcoming Conference

          KDD '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader