skip to main content
10.1145/1277741.1277825acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Combining content and link for classification using matrix factorization

Authors Info & Claims
Published:23 July 2007Publication History

ABSTRACT

The world wide web contains rich textual contents that areinterconnected via complex hyperlinks. This huge database violates the assumption held by most of conventional statistical methods that each web page is considered as an independent and identical sample. It is thus difficult to apply traditional mining or learning methods for solving web mining problems, e.g., web page classification, by exploiting both the content and the link structure. The research in this direction has recently received considerable attention but are still in an early stage. Though a few methods exploit both the link structure or the content information, some of them combine the only authority information with the content information, and the others first decompose the link structure into hub and authority features, then apply them as additional document features. Being practically attractive for its great simplicity, this paper aims to design an algorithm that exploits both the content and linkage information, by carrying out a joint factorization on both the linkage adjacency matrix and the document-term matrix, and derives a new representation for web pages in a low-dimensional factor space, without explicitly separating them as content, hub or authority factors. Further analysis can be performed based on the compact representation of web pages. In the experiments, the proposed method is compared with state-of-the-art methods and demonstrates an excellent accuracy in hypertext classification on the WebKB and Cora benchmarks.

References

  1. CMU world wide knowledge base (WebKB) project. Available at http://www.cs.cmu.edu/?WebKB/.Google ScholarGoogle Scholar
  2. D. Achlioptas, A. Fiat, A. R. Karlin, and F. McSherry. Web search via hub synthesis. In IEEE Symposium on Foundations of Computer Science, pages 500--509, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Chakrabarti, B. E. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In L. M. Haas and A. Tiwary, editors, Proceedings of SIGMOD-98, ACM International Conference on Management of Data, pages 307--318, Seattle, US, 1998. ACM Press, New York, US. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/?cjlin/libsvm. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. Proc. ICML 2000. pp.167--174., 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 430--436. MIT Press, 2001.Google ScholarGoogle Scholar
  7. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  9. X. He, H. Zha, C. Ding, and H. Simon. Web document clustering using hyperlink structures. Computational Statistics and Data Analysis, 41(1):19--45, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the Twenty-Second Annual International SIGIR Conference, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite kernels for hypertext categorisation. In C. Brodley and A. Danyluk, editors, Proceedings of ICML-01, 18th International Conference on Machine Learning, pages 250--257, Williams College, US, 2001. Morgan Kaufmann Publishers, San Francisco, US. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 48:604--632, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Kolari, T. Finin, and A. Joshi. SVMs for the Blogosphere: Blog Identification and Splog Detection. In AAAI Spring Symposium on Computational Approaches to Analysing Weblogs, March 2006.Google ScholarGoogle Scholar
  14. O. Kurland and L. Lee. Pagerank without hyperlinks: structural re-ranking using links induced by language models. In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 306--313, New York, NY, USA, 2005. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the contruction of internet portals with machine learning. Information Retrieval Journal, 3(127--163), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H.-J. Oh, S. H. Myaeng, and M.-H. Lee. A practical hypertext catergorization method using links and incrementally available class information. In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 264--271, New York, NY, USA, 2000. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. Page, S. Brin, R. Motowani, and T. Winograd. PageRank citation ranking: bring order to the web. Stanford Digital Library working paper 1997--0072, 1997.Google ScholarGoogle Scholar
  18. C. Spearman. "General Intelligence," objectively determined and measured. The American Journal of Psychology, 15(2):201--292, Apr 1904.Google ScholarGoogle ScholarCross RefCross Ref
  19. B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proceedings of 18th International UAI Conference, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 267--273. ACM Press, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2-3):219--241, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Yu, S. Yu, and V. Tresp. Multi-label informed latent semantic indexing. In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 258--265, New York, NY, USA, 2005. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. Zhang, A. Popescul, and B. Dom. Linear prediction models with graph regularization for web-page categorization. In KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 821--826, New York, NY, USA, 2006. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Zhou, J. Huang, and B. Schölkopf. Learning from labeled and unlabeled data on a directed graph. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Zhou, B. Schölkopf, and T. Hofmann. Semi-supervised learning on directed graphs. Proc. Neural Info. Processing Systems, 2004.Google ScholarGoogle Scholar

Index Terms

  1. Combining content and link for classification using matrix factorization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
      July 2007
      946 pages
      ISBN:9781595935977
      DOI:10.1145/1277741

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 July 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader