Article

Combining content and link for classification using matrix factorization

Authors:
Shenghuo Zhu

NEC Laboratories America: Inc., Cupertino, CA

NEC Laboratories America: Inc., Cupertino, CA
View Profile

,
Kai Yu

NEC Laboratories America: Inc., Cupertino, CA

NEC Laboratories America: Inc., Cupertino, CA
View Profile

,
Yun Chi

NEC Laboratories America: Inc., Cupertino, CA

NEC Laboratories America: Inc., Cupertino, CA
View Profile

,
Yihong Gong

NEC Laboratories America: Inc., Cupertino, CA

NEC Laboratories America: Inc., Cupertino, CA
View Profile

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalJuly 2007Pages 487–494https://doi.org/10.1145/1277741.1277825

Published:23 July 2007Publication History

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 487–494

ABSTRACT

The world wide web contains rich textual contents that areinterconnected via complex hyperlinks. This huge database violates the assumption held by most of conventional statistical methods that each web page is considered as an independent and identical sample. It is thus difficult to apply traditional mining or learning methods for solving web mining problems, e.g., web page classification, by exploiting both the content and the link structure. The research in this direction has recently received considerable attention but are still in an early stage. Though a few methods exploit both the link structure or the content information, some of them combine the only authority information with the content information, and the others first decompose the link structure into hub and authority features, then apply them as additional document features. Being practically attractive for its great simplicity, this paper aims to design an algorithm that exploits both the content and linkage information, by carrying out a joint factorization on both the linkage adjacency matrix and the document-term matrix, and derives a new representation for web pages in a low-dimensional factor space, without explicitly separating them as content, hub or authority factors. Further analysis can be performed based on the compact representation of web pages. In the experiments, the proposed method is compared with state-of-the-art methods and demonstrates an excellent accuracy in hypertext classification on the WebKB and Cora benchmarks.

References

CMU world wide knowledge base (WebKB) project. Available at http://www.cs.cmu.edu/?WebKB/.Google Scholar
D. Achlioptas, A. Fiat, A. R. Karlin, and F. McSherry. Web search via hub synthesis. In IEEE Symposium on Foundations of Computer Science, pages 500--509, 2001. Google ScholarDigital Library
S. Chakrabarti, B. E. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In L. M. Haas and A. Tiwary, editors, Proceedings of SIGMOD-98, ACM International Conference on Management of Data, pages 307--318, Seattle, US, 1998. ACM Press, New York, US. Google ScholarDigital Library
C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/?cjlin/libsvm. Google ScholarDigital Library
D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. Proc. ICML 2000. pp.167--174., 2000. Google ScholarDigital Library
D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 430--436. MIT Press, 2001.Google Scholar
C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273, 1995. Google ScholarDigital Library
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarCross Ref
X. He, H. Zha, C. Ding, and H. Simon. Web document clustering using hyperlink structures. Computational Statistics and Data Analysis, 41(1):19--45, 2002. Google ScholarDigital Library
T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the Twenty-Second Annual International SIGIR Conference, 1999. Google ScholarDigital Library
T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite kernels for hypertext categorisation. In C. Brodley and A. Danyluk, editors, Proceedings of ICML-01, 18th International Conference on Machine Learning, pages 250--257, Williams College, US, 2001. Morgan Kaufmann Publishers, San Francisco, US. Google ScholarDigital Library
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 48:604--632, 1999. Google ScholarDigital Library
P. Kolari, T. Finin, and A. Joshi. SVMs for the Blogosphere: Blog Identification and Splog Detection. In AAAI Spring Symposium on Computational Approaches to Analysing Weblogs, March 2006.Google Scholar
O. Kurland and L. Lee. Pagerank without hyperlinks: structural re-ranking using links induced by language models. In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 306--313, New York, NY, USA, 2005. ACM Press. Google ScholarDigital Library
A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the contruction of internet portals with machine learning. Information Retrieval Journal, 3(127--163), 2000. Google ScholarDigital Library
H.-J. Oh, S. H. Myaeng, and M.-H. Lee. A practical hypertext catergorization method using links and incrementally available class information. In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 264--271, New York, NY, USA, 2000. ACM Press. Google ScholarDigital Library
L. Page, S. Brin, R. Motowani, and T. Winograd. PageRank citation ranking: bring order to the web. Stanford Digital Library working paper 1997--0072, 1997.Google Scholar
C. Spearman. "General Intelligence," objectively determined and measured. The American Journal of Psychology, 15(2):201--292, Apr 1904.Google ScholarCross Ref
B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proceedings of 18th International UAI Conference, 2002. Google ScholarDigital Library
W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 267--273. ACM Press, 2003. Google ScholarDigital Library
Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2-3):219--241, 2002. Google ScholarDigital Library
K. Yu, S. Yu, and V. Tresp. Multi-label informed latent semantic indexing. In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 258--265, New York, NY, USA, 2005. ACM Press. Google ScholarDigital Library
T. Zhang, A. Popescul, and B. Dom. Linear prediction models with graph regularization for web-page categorization. In KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 821--826, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
D. Zhou, J. Huang, and B. Schölkopf. Learning from labeled and unlabeled data on a directed graph. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 2005. Google ScholarDigital Library
D. Zhou, B. Schölkopf, and T. Hofmann. Semi-supervised learning on directed graphs. Proc. Neural Info. Processing Systems, 2004.Google Scholar

Index Terms

Combining content and link for classification using matrix factorization
1. Information systems
  1. Information retrieval

Recommendations

Co-manifold Matrix Factorization
ICCPR '20: Proceedings of the 2020 9th International Conference on Computing and Pattern Recognition

Matrix factorization plays a fundamental role in collaborative filtering. In collaborative filtering setting, the rating matrix R is very sparse. Thus, infinite number of matrices can fit the observed entries in the rating matrix. Without additional ...
Read More
Two Purposes for Matrix Factorization: A Historical Appraisal

Matrix factorization in numerical linear algebra (NLA) typically serves the purpose of restating some given problem in such a way that it can be solved more readily; for example, one major application is in the solution of a linear system of equations. ...
Read More
A Fast Randomized Algorithm for Computing a Hierarchically Semiseparable Representation of a Matrix

Randomized sampling has recently been proven a highly efficient technique for computing approximate factorizations of matrices that have low numerical rank. This paper describes an extension of such techniques to a wider class of matrices that are not ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
July 2007
946 pages
ISBN:9781595935977
DOI:10.1145/1277741
General Chairs:
Wessel Kraaij
TNO, The Netherlands
,
Arjen P. de Vries
CWI, The Netherlands
,
Program Chairs:
Charles L. A. Clarke
University of Waterloo, Canada
,
Norbert Fuhr
University of Duisburg-Essen, Germany
,
Noriko Kando
National Institute of Informatics, Japan
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 July 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
analysis
factor
link structure
matrix factorization
text content
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 184
  Total Citations
  View Citations
- 1,901
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Combining content and link for classification using matrix factorization

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Co-manifold Matrix Factorization

Two Purposes for Matrix Factorization: A Historical Appraisal

A Fast Randomized Algorithm for Computing a Hierarchically Semiseparable Representation of a Matrix