ABSTRACT
Graph ranking plays an important role in many applications, such as page ranking on web graphs and entity ranking on social networks. In applications, besides graph structure, rich information on nodes and edges and explicit or implicit human supervision are often available. In contrast, conventional algorithms (e.g., PageRank and HITS) compute ranking scores by only resorting to graph structure information. A natural question arises here, that is, how to effectively and efficiently leverage all the information to more accurately calculate graph ranking scores than the conventional algorithms, assuming that the graph is also very large. Previous work only partially tackled the problem, and the proposed solutions are also not satisfying. This paper addresses the problem and proposes a general framework as well as an efficient algorithm for graph ranking. Specifically, we define a semi-supervised learning framework for ranking of nodes on a very large graph and derive within our proposed framework an efficient algorithm called Semi-Supervised PageRank. In the algorithm, the objective function is defined based upon a Markov random walk on the graph. The transition probability and the reset probability of the Markov model are defined as parametric models based on features on nodes and edges. By minimizing the objective function, subject to a number of constraints derived from supervision information, we simultaneously learn the optimal parameters of the model and the optimal ranking scores of the nodes. Finally, we show that it is possible to make the algorithm efficient to handle a billion-node graph by taking advantage of the sparsity of the graph and implement it in the MapReduce logic. Experiments on real data from a commercial search engine show that the proposed algorithm can outperform previous algorithms on several tasks.
- J. Abernethy, O. Chapelle, and C. Castillo. Web spam identification through content and hyperlinks. In the proceedings of AIRWeb'08, 2008. Google ScholarDigital Library
- A. Agarwal and S. Chakrabarti. Learning random walks to rank nodes in graphs. In the proceedings of the 24th International Conference on Machine Learning (ICML), pages 9--16, 2007. Google ScholarDigital Library
- A. Agarwal, S. Chakrabarti, and S. Aggarwal. Learning to rank networked entites. In the proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 14--23, 2006. Google ScholarDigital Library
- S. Agarwal. Ranking on graph data. In the proceedings of the 23th International Conference on Machine Learning (ICML), pages 25--32, 2006. Google ScholarDigital Library
- L. Backstrom and J. Leskovec. Supervised Random Walks: Predicting and Recommending Links in Social Networks. In the proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM), pages 635--644, 2011. Google ScholarDigital Library
- R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, ISBN-13: 978-0201398298, May 1999. Google ScholarDigital Library
- S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Computer Networks and ISDN Systems, volume 30, issue 1--7, pages 107--117, 1998. Google ScholarDigital Library
- C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection for web spam. In SIGIR Forum, volumn 40, issue 2, pages 11--24, 2006. Google ScholarDigital Library
- S. Chakrabarti and A. Agarwal. Learning parameters in entity relationship graphs from ranking preferences. In the proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), volume 4213, pages 91--102, 2006. Google ScholarDigital Library
- H. Chang, D. Cohn, and A. K. McCallum. Learning to create customized authority lists. In the proceedings of the 17th International Conference on Machine Learning (ICML), pages 127--134, 2000. Google ScholarDigital Library
- M. Culp and G. Michailidis. Graph-based semisupervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, VOL. 30, No. 1, pages 174--179, 2008. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In the proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI), 2004. Google ScholarDigital Library
- T. H. Haveliwala. Topic-sensitive PageRank. In the proceedings of the 11th International World Wide Web Conference (WWW), 2002. Google ScholarDigital Library
- T. Haveliwala, S. D. Kamvar, and G. Jeh. An Analytical Comparison of Approaches to Personalizing PageRank. Stanford University, Preprint, 2003.Google Scholar
- J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In the proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 668--677, 1998. Google ScholarDigital Library
- T. Kolda and B. Bader. The TOPHITS Model for Higher-Order Web Link Analysis. In the proceedings of the 4th Workshop on Link Analysis, Counterterrorism and Security, in conjunction with the 6th SIAM International Conference on Data Mining (SDM), 2006.Google Scholar
- L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.Google Scholar
- D. Rao, and D. Yarowsky. Ranking and semi-supervised classification on large scale graphs using map-reduce. In the proceedings of the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP), 2009. Google ScholarDigital Library
- M. Richardson and P. Domingos. The intelligent surfer: probabilistic combination of link and content information in PageRank. In the proceedings of the 16th Annual Conference on Neural Information Processing Systems (NIPS), 2002.Google Scholar
- S. E. Robertson. Overview of okapi projects. Journal of Documentatioin, volumn 53, issue 1, pages 3--7, 1997.Google Scholar
- T. Sarlos, A. A. Benczur, K. Csalogany, D. Fogaras, and B. Racz. To randomize or not to randomize: space optimal summaries for hyperlink analysis. In the proceedings of the 15th International World Wide Web Conference (WWW), pages 297--306, 2006. Google ScholarDigital Library
- A. C. Tsoi, G. Morini, F. Scarselli, M. Hagenbuchner, and M. Maggini. Adaptive ranking of Web pages. In the proceedings of the 12th International World Wide Web Conference (WWW), pages 356--365, 2003. Google ScholarDigital Library
- M. Xie, J. Liu, N. Zheng, D. Li, Y. Huang, and Y. Wang. Semi-supervised graph-ranking for text retrieval. In the proceedings of the 4th Asia Infomation Retrieval Symposium (AIRS), pages 256--263, 2008. Google ScholarDigital Library
- D. Zhou, J. Huang, and B. Scholkopf. Learning from labeled and unlabeled data on a directed graph. In the proceedings of the 22th International Conference on Machine Learning (ICML), pages 1041--1048, 2005. Google ScholarDigital Library
- http://www.yr-bcn.es/webspam/datasets/uk2007/Google Scholar
Index Terms
- Semi-supervised ranking on very large graphs with rich metadata
Recommendations
Link analysis using time series of web graphs
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge managementLink analysis is a key technology in contemporary web search engines. Most of the previous work on link analysis only used information from one snapshot of web graph. Since commercial search engines crawl the Web periodically, they will naturally obtain ...
MapReduce Based Information Retrieval Algorithms for Efficient Ranking of Webpages
In this paper, the authors discuss the MapReduce implementation of crawler, indexer and ranking algorithms in search engines. The proposed algorithms are used in search engines to retrieve results from the World Wide Web. A crawler and an indexer in a ...
Node ranking in labeled directed graphs
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge managementOur work is motivated by the problem of ranking hyper-linked documents for a given query. Given an arbitrary directed graph with edge and node labels, we present a new flow-based model and an efficient method to dynamically rank the nodes of this graph ...
Comments