ABSTRACT
A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a highly scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as l1 and l2 penalties. At present, our solver implements 20 different estimation problems, can be easily extended, scales to millions of observations, and is up to 10 times faster than specialized solvers for many applications. The open source code is freely available as part of the ELEFANT toolbox.
- G. Bakir, T. Hofmann, B. Schölkopf, A. Smola, B. Taskar, and S. V. N. Vishwanathan. Predicting Structured Data. MIT Press, Cambridge, Massachusetts, 2007. Google ScholarDigital Library
- S. Balay, K. Buschelman, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc users manual. Technical Report ANL-95/11, Argonne National Laboratory, 2006.Google Scholar
- O. E. Barndorff-Nielsen. Information and Exponential Families in Statistical Theory. John Wiley and Sons, New York, 1978.Google Scholar
- K. P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets. Optim. Methods Softw., 1:23--34, 1992.Google ScholarCross Ref
- S. Benson, L. Curfman-McInnes, J. Moré, and J. Sarich. TAO user manual. Technical Report ANL/MCS-TM-242, Argonne National Laboratory, 2004.Google Scholar
- R. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190--1208, 1995. Google ScholarDigital Library
- L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines. In Proc. of ACM conference on info. and knowledge mgmt., pages 78--87, New York, NY, USA, 2004. ACM Press. Google ScholarDigital Library
- E. Candes and T. Tao. Decoding by linear programming. IEEE Trans. Info Theory, 51(12):4203--4215, 2005. Google ScholarDigital Library
- C. Chang and C. Lin. LIBSVM: a library for support vector machines, 2001.Google Scholar
- O. Chapelle. Training a support vector machine in the primal. Technical Report TR.147, Max Planck Institute for Biological Cybernetics, 2006.Google Scholar
- C. Chu, S. Kim, Y. A. Lin, Y. Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS 19, 2007.Google Scholar
- C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, , and M. Zhu. Tools for Privacy Preserving Distributed Data Mining. ACM SIGKDD Explorations, 4(2), December 2002. Google ScholarDigital Library
- M. Collins, R. E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. In COLT, pages 158--169. Morgan Kaufmann, San Francisco, 2000. Google ScholarDigital Library
- R. Cowell, A. David, S. Lauritzen, and D. Spiegelhalter. Probabilistic Networks and Expert Sytems. Springer, New York, 1999. Google ScholarDigital Library
- K. Crammer and Y. Singer. Online ranking by projecting. Neural Computation, 17(1):145--175, 2005. Google ScholarDigital Library
- N. A. C. Cressie. Statistics for Spatial Data. John Wiley and Sons, New York, 1993.Google Scholar
- L. Fahrmeir and G. Tutz. Multivariate Statistical Modelling Based on Generalized Linear Models. Springer, 1994.Google ScholarCross Ref
- S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representation. Technical report, IBM Watson Research Center, New York, 2000.Google Scholar
- S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representations. JMLR, 2001. Google ScholarDigital Library
- C. Gentile and M. K. Warmuth. Linear hinge loss and average margin. In NIPS 11, pages 225--231, Cambridge, MA, 1999. Google ScholarDigital Library
- R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 115--132, Cambridge, MA, 2000. MIT Press.Google Scholar
- J. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms, I and II. 305 and 306. Springer-Verlag, 1993.Google ScholarCross Ref
- T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods. Support Vector Learning, pages 169--184, Cambridge, MA, 1999. MIT Press. Google ScholarDigital Library
- T. Joachims. A support vector method for multivariate performance measures. In ICML, pages 377--384, San Francisco, California, 2005. Morgan Kaufmann Publishers. Google ScholarDigital Library
- T. Joachims. Training linear SVMs in linear time. In KDD, 2006. Google ScholarDigital Library
- S. S. Keerthi and D. DeCoste. A modified finite Newton method for fast solution of large scale linear SVMs. JMLR, 6:341--361, 2005. Google ScholarDigital Library
- R. Koenker. Quantile Regression. Cambridge University Press, 2005.Google Scholar
- J. D. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic modeling for segmenting and labeling sequence data. In ICML, volume 18, pages 282--289, 2001. Google ScholarDigital Library
- Q. Le and A. Smola. Direct optimization of ranking measures. JMLR, 2007. submitted.Google Scholar
- O. L. Mangasarian. Linear and nonlinear separation of patterns by linear programming. Oper. Res., 13:444--452, 1965.Google ScholarDigital Library
- K.-R. Müller, A. J. Smola, G. Rätsch, B. Schölkopf, J. Kohlmorgen, and V. Vapnik. Predicting time series with support vector machines. In ICANN'97, pages 999--1004, 1997. Google ScholarDigital Library
- B. Schölkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. TR 87, Microsoft Research, Redmond, WA, 1999.Google Scholar
- B. Schölkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Comput., 13(7):1443--1471, 2001. Google ScholarDigital Library
- F. Sha and F. Pereira. Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL, pages 213--220, 2003. Google ScholarDigital Library
- S. Shalev-Shwartz and Y. Singer. Online learning optimization in the dual. In COLT, 2006. extendedGoogle Scholar
- V. Sindhwani and S. Keerthi. Large scale semi-supervised linear svms. In SIGIR '06, pages 477--484, New York, USA, 2006. ACM Press. Google ScholarDigital Library
- I. Takeuchi, Q. Le, T. Sears, and A. Smola. Nonparametric quantile estimation. JMLR, 2006. Google ScholarDigital Library
- B. Taskar, C. Guestrin, and D. Koller. Max-margin networks. In NIPS, pages 25--32, 2004.Google Scholar
- R. Tibshirani. Regression shrinkage and selection via lasso. J. R. Stat. Soc. Ser. B Stat. Methodol., 58:267--288 1996.Google ScholarCross Ref
- I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Large margin methods for structured and interdependent output variables. JMLR, 6:1453--1484, 2005. Google ScholarDigital Library
- V. Vapnik, S. Golowich, and A. J. Smola. Support method for function approximation, regression estimation, and signal processing. In NIPS, pages 281--287, 1997.Google Scholar
- S. V. N. Vishwanathan and A. J. Smola. Fast kernels string and tree matching. In NIPS, pages 569--576, 2003Google Scholar
- C. K. I. Williams. Prediction with Gaussian processes: From linear regression to linear prediction and beyond. M. I. Jordan, editor, Learning and Inference in Graphical Models, pages 599--621. Kluwer Academic, 1998. Google ScholarDigital Library
Index Terms
- A scalable modular convex solver for regularized risk minimization
Recommendations
Image compressive sensing recovery using adaptively learned sparsifying basis via L0 minimization
From many fewer acquired measurements than suggested by the Nyquist sampling theory, compressive sensing (CS) theory demonstrates that, a signal can be reconstructed with high probability when it exhibits sparsity in some domain. Most of the ...
Regularized bundle methods for convex and non-convex risks
Machine learning is most often cast as an optimization problem. Ideally, one expects a convex objective function to rely on efficient convex optimizers with nice guarantees such as no local optima. Yet, non-convexity is very frequent in practice and it ...
Bundle Methods for Regularized Risk Minimization
A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), ...
Comments