Skip to main content
Log in

A framework of constraint preserving update schemes for optimization on Stiefel manifold

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

This paper considers optimization problems on the Stiefel manifold \(X^{\mathsf{T}}X=I_p\), where \(X\in \mathbb {R}^{n \times p}\) is the variable and \(I_p\) is the \(p\)-by-\(p\) identity matrix. A framework of constraint preserving update schemes is proposed by decomposing each feasible point into the range space of \(X\) and the null space of \(X^{\mathsf{T}}\). While this general framework can unify many existing schemes, a new update scheme with low complexity cost is also discovered. Then we study a feasible Barzilai–Borwein-like method under the new update scheme. The global convergence of the method is established with an adaptive nonmonotone line search. The numerical tests on the nearest low-rank correlation matrix problem, the Kohn–Sham total energy minimization and a specific problem from statistics demonstrate the efficiency of the new method. In particular, the new method performs remarkably well for the nearest low-rank correlation matrix problem in terms of speed and solution quality and is considerably competitive with the widely used SCF iteration for the Kohn–Sham total energy minimization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. Manton [35] used the SVD decomposition of \(X - \tau D\) to obtain the projection, and the cost is higher than that of the polar decomposition.

  2. It can be downloaded from http://www.math.nus.edu.sg/~matsundf/#Codes.

  3. It can be downloaded from http://optman.blogs.rice.edu/.

  4. Dr. Qingna Li provided us this matrix kindly.

References

  1. Abrudan, T., Eriksson, J., Koivunen, V.: Steepest descent algorithms for optimization under unitary matrix constraint. IEEE Trans. Signal Process. 56(3), 1134–1147 (2008)

    Article  MathSciNet  Google Scholar 

  2. Abrudan, T., Eriksson, J., Koivunen, V.: Conjugate gradient algorithm for optimization under unitary matrix constraint. Sig. Process. 89(9), 1704–1714 (2009)

    Article  MATH  Google Scholar 

  3. Absil, P.A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2008)

    Book  MATH  Google Scholar 

  4. Absil, P.A., Malick, J.: Projection-like retractions on matrix manifolds. SIAM J. Optim. 22(1), 135–158 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  5. Andreani, R., Birgin, E.G., Martínez, J.M., Schuverdt, M.L.: On augmented Lagrangian methods with general lower-level constraints. SIAM J. Optim. 18(4), 1286–1309 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  6. Balogh, J., Csendes, T., Rapcsák, T.: Some global optimization problems on Stiefel manifolds. J. Global Optim. 30(1), 91–101 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  7. Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  8. Birgin, E.G., Martínez, J.M., Raydan, M.: Nonmonotone spectral projected gradient methods on convex sets. SIAM J. Optim. 10(4), 1196–1211 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  9. Bolla, M., Michaletzky, G., Tusnády, G., Ziermann, M.: Extrema of sums of heterogeneous quadratic forms. Linear Algebra Appl. 269(1), 331–365 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  10. Borsdorf, R.: An Algorithm for Finding the Optimal Embedding of a Symmetric Matrix Into the Set of Diagonal Matrices. Technical report, University of Manchester (2012)

  11. Dai, Y.H., Fletcher, R.: Projected Barzilai–Borwein methods for large-scale box-constrained quadratic programming. Numer. Math. 100(1), 21–47 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  12. Dai, Y.H., Fletcher, R.: New algorithms for singly linearly constrained quadratic programs subject to lower and upper bounds. Math. Program. 106(3), 403–421 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  13. Dai, Y.H., Hager, W.W., Schittkowski, K., Zhang, H.C.: The cyclic Barzilai–Borwein method for unconstrained optimization. IMA J. Numer. Anal. 26(3), 604–627 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  14. Dai, Y.H., Liao, L.Z.: R-linear convergence of the Barzilai and Borwein gradient method. IMA J. Numer. Anal. 22, 1–10 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  15. Dai, Y.H., Zhang, H.C.: Adaptive two-point stepsize gradient algorithm. Numer. Algorithms 27(4), 377–385 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  16. d’Aspremont, A., Ei Ghaoui, L., Jordan, M.I., Lanckriet, G.R.: A direct formulation for sparse PCA using semidefinite programming. SIAM Rev. 49(3), 434–448 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  17. Davis, T., Hu, Y.F.: University of Florida sparse matrix collection. Technical report, University of Florida (2009)

  18. Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 20(2), 303–353 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  19. Eldén, L., Park, H.: A Procrustes problem on the Stiefel manifold. Numer. Math. 82(4), 599–619 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  20. Fletcher, R.: On the Barzilai–Borwein method. In: Qi, L.Q., Teo, K., Yang, X.Q. (eds.) Optimization and Control with Applications, Applied Optimization, vol. 96, pp. 235–256. Springer, US (2005)

    Chapter  Google Scholar 

  21. Flury, B.: Common principal components & related multivariate models. Wiley, New York (1988)

    MATH  Google Scholar 

  22. Gao, Y., Sun, D.F.: A majorized penalty approach for calibrating rank constrained correlation matrix problems. Technical report, National University of Signapore (2010)

  23. Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins Univerisity Press, Maryland (1996)

    MATH  Google Scholar 

  24. Grippo, L., Lampariello, F., Lucidi, S.: A nonmonotone line search technique for Newton’s method. SIAM J. Numer. Anal. 23, 707–716 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  25. Grubišić, I., Pietersz, R.: Efficient rank reduction of correlation matrices. Linear Algebra Appl. 422(2), 629–653 (2007)

    MathSciNet  MATH  Google Scholar 

  26. Jiang, B., Dai, Y.H.: A framework of constraint preserving update schemes for optimization on Stiefel manifold. Technical report, Institue of Computational Mathematics and Scientific/Engineering Computing, Academy of Mathematics and Systems Sciences, Chinese Academy of Sicences (2012)

  27. Jiang, B., Dai, Y.H.: Feasible Barzilai–Borwein-like methods for extreme symmetric eigenvalue problems. Optim. Methods Softw. 28(4), 756–784 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  28. Jiang, K.F., Sun, D.F., Toh, K.C.: An inexact accelerated proximal gradient method for large scale linearly constrained convex SDP. SIAM J. Optim. 22(3), 1042–1064 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  29. Joho, M., Mathis, H.: Joint diagonalization of correlation matrices by using gradient methods with application to blind signal separation. In: IEEE Proceedings of the Sensor Array and Multichannel Signal Processing Workshop, 2002, pp. 273–277 (2002)

  30. Journée, M., Nesterov, Y., Richtárik, P., Sepulchre, R.: Generalized power method for sparse principal component analysis. J. Mach. Learn. Res. 11, 517–553 (2010)

    MathSciNet  MATH  Google Scholar 

  31. Lai, R., Wen, Z., Yin, W., Gu, X., Lui, L.M.: Folding-free global conformal mapping for genus-0 surfaces by harmonic energy minimization. J. Sci. Comput. 58(3), 705–725 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  32. Li, L., Toh, K.C.: An inexact interior point method for L1-regularized sparse covariance selection. Math. Program. Comput. 2, 291–315 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  33. Li, Q.N., Qi, H.D.: A sequential semismooth Newton method for the nearest low-rank correlation matrix problem. SIAM J. Optim. 21(4), 1–26 (2011)

    MathSciNet  Google Scholar 

  34. Liu, Y.F., Dai, Y.H., Luo, Z.Q.: On the complexity of leakage interference minimization for interference alignment. In: 2011 IEEE 12th International Workshop on Signal Processing Advances in Wireless Communications, pp. 471–475 (2011)

  35. Manton, J.H.: Optimization algorithms exploiting unitary constraints. IEEE Trans. Signal Process. 50(3), 635–650 (2002)

    Article  MathSciNet  Google Scholar 

  36. Nishimori, Y., Akaho, S.: Learning algorithms utilizing quasi-geodesic flows on the Stiefel manifold. Neurocomputing 67, 106–135 (2005)

    Article  Google Scholar 

  37. Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer Series in Operations Research and Financial Engineering. Springer, New York (2006)

  38. Peters, S.W., Heath, R.W.: Interference alignment via alternating inimization. In: Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2445–2448. IEEE Computer Society (2009)

  39. Pietersz, R., Groenen, P.J.: Rank reduction of correlation matrices by majorization. Quant. Financ. 4(6), 649–662 (2004)

    Article  MathSciNet  Google Scholar 

  40. Qi, H.D., Sun, D.F.: A quadratically convergent Newton method for computing the nearest correlation matrix. SIAM J. Matrix Anal. Appl. 28(2), 360–385 (2007)

    Article  MathSciNet  Google Scholar 

  41. Rapcsák, T.: On minimization of sums of heterogeneous quadratic functions on Stiefel manifolds. In: Migdalas, A., Pardalos, P.M., Värbrand, P. (eds.) From Local to Global Optimization, vol. 53, pp. 277–290. Kluwer Academic Publishers (2001)

  42. Rapcsák, T.: On minimization on Stiefel manifolds. Eur. J. Oper. Res. 143(2), 365–376 (2002)

    Article  MATH  Google Scholar 

  43. Raydan, M.: On the Barzilai and Borwein choice of steplength for the gradient method. IMA J. Numer. Anal. 13, 321–326 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  44. Raydan, M.: The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM J. Optim. 7(1), 26–33 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  45. Rebonato, R., Jäckel, P.: The most general methodology to creating a valid correlation matrix for risk management and option pricing purposes. J. Risk 2, 17–27 (1999)

    Google Scholar 

  46. Saad, Y.: Numerical Methods for Large Eigenvalue Problems. Manchester University Press, Manchester (1992)

    MATH  Google Scholar 

  47. Savas, B., Lim, L.H.: Quasi-Newton methods on Grassmannians and multilinear approximations of tensors. SIAM J. Sci. Comput. 32(6), 3352–3393 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  48. Schönemann, P.H.: A generalized solution of the orthogonal Procrustes problem. Psychometrika 31(1), 1–10 (1966)

    Article  MathSciNet  MATH  Google Scholar 

  49. Stiefel, E.: Richtungsfelder und fernparallelismus in n-dimensionalen mannigfaltigkeiten. Comment. Math. Helv. 8(1), 305–353 (1935)

    Article  MathSciNet  Google Scholar 

  50. Sun, W.Y., Yuan, Y.X.: Optimization Theory and Methods, Springer Optimization and Its Applications, vol. 1. Springer, New York (2006)

    Google Scholar 

  51. Theis, F., Cason, T., Absil, P.A.: Soft dimension reduction for ica by joint diagonalization on the stiefel manifold. In: Adali, T., Jutten, C., Romano, J.M.T., Barros, A. (eds.) Independent Component Analysis and Signal Separation. Lecture Notes in Computer Science, vol. 5441, pp. 354–361. Springer, Berlin (2009)

    Chapter  Google Scholar 

  52. Toint, P.L.: Non-monotone trust-region algorithms for nonlinear optimization subject to convex constraints. Math. Program. 77(3), 69–94 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  53. Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math. Program. 142(1–2), 397–434 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  54. Yang, C., Meza, J.C., Lee, B., Wang, L.W.: KSSOLV—a Matlab toolbox for solving the Kohn–Sham equations. ACM Trans. Math. Softw. 36, 1–35 (2009)

    Article  MathSciNet  Google Scholar 

  55. Yang, C., Meza, J.C., Wang, L.W.: A constrained optimization algorithm for total energy minimization in electronic structure calculations. J. Comput. Phys. 217(2), 709–721 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  56. Yang, C., Meza, J.C., Wang, L.W.: A trust region direct constrained minimization algorithm for the Kohn–Sham equation. SIAM J. Sci. Comput. 29(5), 1854–1875 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  57. Zhang, H.C., Hager, W.W.: A nonmonotone line search technique and its application to unconstrained optimization. SIAM J. Optim. 14(4), 1043–1056 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  58. Zhang, L.H., Liao, L.Z.: An alternating variable method for the maximal correlation problem. J. Global Optim. 54(1), 199–218 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  59. Zhou, B., Gao, L., Dai, Y.H.: Gradient methods with adaptive step-sizes. Comput. Optim. Appl. 35(1), 69–86 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  60. Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Stat. 15(2), 265–286 (2006)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

We are very grateful to Zaiwen Wen for useful discussions on the optimization on the Stiefel manifold and Qingna Li for discussions on the nearest low-rank correlation matrix problem. We thank Zaiwen Wen and Wotao Yin for sharing their code FOptM, Houduo Qi and Qingna Li for sharing their code SemiNewton, Defeng Sun and Yan Gao for sharing their code PenCorr. We also thank Kaifeng Jiang, Defeng Sun and Kim-Chuan Toh for sharing the gene correlation matrices for the nearest correlation matrix problem. Many thanks are also due to the associate editor, two anonymous referees and the editor, Prof. Alexander Shapiro, whose suggestions and comments greatly improved the quality of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu-Hong Dai.

Additional information

This work was partly supported by the National Key Basic Research Program of China (Grant No. 2015CB856000), the China National Funds for Distinguished Young Scientists (no. 11125107), the Chinese NSF grants (nos. 10831106 and 81211130105) and the CAS grant (no. kjcx-yw-s7-03).

Appendices

Appendix 1: Details of approach I and II in Sect. (3)

1.1 Details of approach I

Using the condition that \(Z(\tau ) \equiv I_p\) and \(\widetilde{R}'(0) = - X^{\mathsf{T}} E\) and denoting \({\mathcal {A}}(\tau ) = \left[ \begin{array}{c} \widetilde{R}(\tau ) \\ W \widetilde{N}(\tau ) \end{array}\right] \), it follows from (3.4) and (3.6) that

$$\begin{aligned} {\mathcal {A}}'(0) = \left[ \begin{array}{cc} -X^{\mathsf{T}}E &{}~B \\ W &{} ~F \end{array}\right] \left[ \begin{array}{c} I_p \\ 0\end{array}\right] \ \mathrm {and} \ {\mathcal {A}}(0) = \left[ \begin{array}{c} I_p \\ 0\end{array}\right] , \end{aligned}$$

where \(B \in {\mathbb {R}}^{n\times p}\), \(F \in {\mathbb {R}}^{p \times p}\). Solving the above ordinary differential equations, we get that

$$\begin{aligned} {\mathcal {A}}(\tau ) = \exp \left( \tau \left[ \begin{array}{cc} -X^{\mathsf{T}}E &{}~B \\ W &{} ~F \end{array}\right] \right) \left[ \begin{array}{c} I_p \\ 0\end{array}\right] . \end{aligned}$$
(8.1)

Since \(Z(\tau ) \equiv I_p\), we know from (3.3) and the definition of \({\mathcal {A}}(\tau )\) that

$$\begin{aligned} Y(\tau ) = \big [\begin{array}{l} X, \ I_p\end{array} \big ] {\mathcal {A}}(\tau ), \end{aligned}$$

and \({\mathcal {A}}(\tau )^{\mathsf{T}} {\mathcal {A}}(\tau ) = I_p\). It follows that the matrix inside the brackets of (8.1) will be skew-symmetric. This means that \(B = -W^{\mathsf{T}}\), \(F + F^{\mathsf{T}} = 0\) and \(X^{\mathsf{T}}E+E^{\mathsf{T}}X=0\) (i.e., \(E \in {\mathcal {T}}_X\)). Let \(W = QR\) be the unique QR factorization of \(W\) and assume that \(F\) takes the form \(F = Q{\widetilde{F}}Q^{\mathsf{T}}\), where \({\widetilde{F}}\in {\mathbb {R}}^{p\times p}\) is any skew-symmetric matrix. Then we can write

$$\begin{aligned} Y(\tau ) =&\big [\begin{array}{l} X, \ I_p\end{array} \big ] \exp \left( \tau \left[ \begin{array}{cc} I_p &{}~0 \\ 0 &{} ~Q \end{array}\right] \left[ \begin{array}{cc} -X^{\mathsf{T}}E &{}~-R^{\mathsf{T}} \\ R &{} ~{\widetilde{F}}\end{array}\right] \left[ \begin{array}{cc} I_p &{}~0 \\ 0 &{} ~Q^{\mathsf{T}} \end{array}\right] \right) \left[ \begin{array}{c} I_p \\ 0\end{array}\right] \nonumber \\ =&\big [\begin{array}{l} X, \ Q \end{array} \big ] \exp \left( \tau \left[ \begin{array}{cc} -X^{\mathsf{T}}E &{}~-R^{\mathsf{T}} \\ R &{} ~{\widetilde{F}}\end{array}\right] \right) \left[ \begin{array}{c} I_p \\ 0\end{array}\right] , \end{aligned}$$
(8.2)

where the second equality is due that \(Q\) has orthonormal columns. The update scheme (8.2) can be regarded as a generalized geodesic update scheme, since letting \({\widetilde{F}}= 0\), (8.2) reduces to the geodesic update scheme (2.1).

1.2 Details of approach II

Based on the choices that \(\widetilde{R}(\tau ) = I_p + \tau \widetilde{R}'(0)\) and \(\widetilde{N}(\tau ) = \tau I_p\), we can get \(Z(\tau )\) from (3.3) by the polar decomposition or the Cholesky factorization.

If by the polar decomposition, \(Z(\tau )\) and \(Z'(\tau )\) are always symmetric. In this case, it is easy to show that \(Y(\tau ; X) = \mathcal {P}_{{\mathcal {S}}_{n,p}}{{\big (X \widetilde{R}(\tau ) + W \widetilde{N}(\tau ) \big )}}\), which with (3.6), (3.7) and (3.8) further implies that

$$\begin{aligned} Y(\tau ; X) = \mathcal {P}_{{\mathcal {S}}_{n,p}}\big (X - \tau E - \tau X Z'(0) \big ), \end{aligned}$$
(8.3)

where \(Z'(0)\) is any \(p\)-by-\(p\) symmetric matrix. If \(Z'(0) = 0\), (8.3) reduces to the ordinary polar decomposition or Manton’s projection update scheme. If \(Z'(0) = \text{ sym }(X^{\mathsf{T}}G)\) and \(E = G - X \text{ sym }(X^{\mathsf{T}}G),\) it becomes the ordinary gradient projection update scheme.

If by the Cholesky factorization, \(Z(\tau )\) and \(Z'(\tau )\) are always upper triangular. Similarly, we can derive

$$\begin{aligned} Y(\tau ; X) = \text{ qr }\big (X - \tau E - \tau X Z'(0)\big ), \end{aligned}$$
(8.4)

where \(Z'(0)\) is any \(p\)-by-\(p\) upper triangular matrix. When \(Z'(0) = 0\), (8.4) reduces to the ordinary QR factorization update scheme.

Appendix 2: Proof of lemma 3.1

The fact that \(X^{\mathsf{T}}D_{\rho }\) is skew-symmetric implies that \(J\) is invertible, thus \(Y(\tau )\) is well-defined. (i) follows from the construction of the update scheme (3.11).

Meanwhile, we know that \(Y'(0) = -D_{\rho }\), which with the chain rule shows that

$$\begin{aligned} {\mathcal {F}}'_{\tau }(Y(0)) = - \langle G, D_{\rho }\rangle = -\langle \nabla {\mathcal {F}}, P_X D_{\rho }\rangle , \end{aligned}$$

where the second equality is due to \(D_{\rho }\in {\mathcal {T}}_X\) and the definition (2.8) of \(\nabla {\mathcal {F}}\). Recall that \(P_X = I_n - \frac{1}{2} XX^{\mathsf{T}}\). Substituting (3.12) into the above equation yields

$$\begin{aligned} {\mathcal {F}}'_{\tau }(Y(0))= -\langle \nabla \mathcal {F}, (I_n + (\rho -1 )XX^{\mathsf{T}})\nabla \mathcal {F}\rangle \le - \min \{\rho ,1\} \Vert \nabla \mathcal {F}\Vert _{\mathsf{F}}^2. \end{aligned}$$

So (ii) is true.

We prove (iii) by contradiction. Multiplying \(X^{\mathsf{T}}\) from both sides the expression of \(Y(\tau )\) in (3.11) and using \(X^{\mathsf{T}}W = 0\), we get that \(2J^{-1} - I_p = X^{\mathsf{T}}Y\). Assume that there exists a \(p\)-by-\(p\) orthogonal matrix \(Q_p\) such that \(Y = XQ_p\). Then we have \(2J^{-1} - I_p = Q_p\); i.e., \(2I_p - J = Q_pJ\). It follows from \((2I_p - J)^{\mathsf{T}} (2I_p-J) = (Q_pJ)^{\mathsf{T}}Q_pJ\) and \(Q_p^\mathsf{T}Q_p = I_p\) that

$$\begin{aligned} J^{\mathsf{T}} + J = 4I_p, \end{aligned}$$

which is a contradiction to the definition of \(J\). The contradiction shows the truth of (iii).

For (iv), it is obvious that \(\Vert J\Vert _2 \le 1 + \frac{\tau ^2}{4}\Vert W\Vert _{\mathsf{F}}^2 + \frac{\tau }{2}\Vert X^{\mathsf{T}}D_{\rho }\Vert _{\mathsf{F}}\), where \(\Vert \cdot \Vert _2\) is the spectral norm. It follows from \(-D_{\rho }= W - XX^{\mathsf{T}} D_{\rho }\) and \(X^{\mathsf{T}} W=0\) that \(\Vert D_{\rho }\Vert _{\mathsf{F}}^2 = \Vert W\Vert _{\mathsf{F}}^2 + \Vert X^{\mathsf{T}}D_{\rho }\Vert _{\mathsf{F}}^2.\) Notice that \(\upsilon = \tau \Vert D_{\rho }\Vert _{\mathsf{F}}\). Then we have

$$\begin{aligned} \Vert J\Vert _2&\le 1 + \frac{\upsilon ^2}{4} \cdot \frac{\Vert W\Vert _{\mathsf{F}}^2}{\Vert D_{\rho }\Vert _{\mathsf{F}}^2 } + \frac{\upsilon }{2} \cdot \frac{\Vert X^{\mathsf{T}} D_{\rho }\Vert _{\mathsf{F}}^{}}{\Vert D_{\rho }\Vert _{\mathsf{F}}^{}}\\&\le 1 + \max _{0 \le t \le 1}\left( \frac{\upsilon ^2}{4} (1 - t^2) + \frac{\upsilon }{2} t\right) \le (5 + \upsilon ^2)/4. \end{aligned}$$

On the other hand, the relation \(2J^{-1} = X^{\mathsf{T}} Y + I_p\) indicates that \(\Vert J^{-1}\Vert _2 \le 1\). Thus (iv) is true.

In the case that \(p=1\) or \(n\), it is not difficult to simplify the corresponding update schemes and we omit the details here. Thus, we complete the proof.

Appendix 3: Proof of lemma 3.2

The relation \(\langle G, D^{(i)} \rangle = e_i^{\mathsf{T}}\left( G^{\mathsf{T}} \nabla {\mathcal {F}}\right) e_i\) can be easily verified from the definition of \(D^{(i)}\). With the definition of \(D^{(q)}\) and \(X^{\mathsf{T}} X = I_p\), we have that

$$\begin{aligned} {{X^{\mathsf{T}} D^{(q)} = X^{\mathsf{T}} G_{(q)}^{} e_q^{\mathsf{T}} - e_q G_{(q)}^{\mathsf{T}}X = \frac{2}{\tau }\left( be_q^{\mathsf{T}} - e_q b^{\mathsf{T}}\right) \in {\mathcal {T}}_X}}. \end{aligned}$$

Thus \(Y(\tau )\) is well-defined. Moreover, we have that

$$\begin{aligned} W = -(I_n - XX^{\mathsf{T}}) D^{(q)} = -(I_n - XX^{\mathsf{T}}) G_{(q)}^{}e_q^{\mathsf{T}} \end{aligned}$$

Hence, the matrix \(J = I_p + \frac{\tau ^2}{4} W^{\mathsf{T}}W + \frac{\tau }{2}X^{\mathsf{T}} D^{(q)}\) can be expressed as

$$\begin{aligned} J = I_p + \xi e_q^{}e_q^{\mathsf{T}} + b e_q^{\mathsf{T}} - e_q b^{\mathsf{T}} = {{I_p + \big [\begin{array}{cc} e_q,&\ b \end{array} \big ] \left( \left[ \begin{array}{cc} \xi &{}\ -1 \\ 1 &{} \ 0\end{array} \right] \left[ \begin{array}{c}e_q^{\mathsf{T}}\\ b^{\mathsf{T}}\end{array}\right] \right) }}, \end{aligned}$$
(10.1)

where \(\xi = \frac{\tau ^2}{4}G_{(q)}^{\mathsf{T}}(I_n - XX^{\mathsf{T}}) G_{(q)}\). By the formulations of \(b\) and \(\xi \), we easily see that

$$\begin{aligned} {{e_q^{\mathsf{T}} b = 0, \ \xi =\alpha \, - \, b^{\mathsf{T}}b.}} \end{aligned}$$

Applying the SMW formula to (10.1) and using the above relations, we can obtain (3.15).

Moreover, notice that \(\nabla {\mathcal {F}}= \sum _{i=1}^{p} D^{(i)}\). Thus we have

$$\begin{aligned} p\, {\mathcal {F}}'_{\tau }(Y(0)) = -p\, \langle G, D^{(q)} \rangle \le - \langle G, \nabla {\mathcal {F}}\rangle \le -\frac{1}{2}\Vert \nabla {\mathcal {F}}\Vert _{\mathsf{F}}^2, \end{aligned}$$

where the first inequality follows from the choice of \(q\) in (3.14) and the second one is due to (ii) of Lemma 3.1. The proof is completed.

Appendix 4: Proof of proposition 3.1

Before going into the details of the proof, we recall the following fact on the inverse of a \(2 \times 2\) block matrix.

Fact 11.1

Assume that \(T = \left[ \begin{array}{cc} T_{11} &{} ~T_{12}\\ T_{21} &{}~T_{22} \end{array} \right] \), where \(T_{22}^{}\) and \(T_{11}^{} - T_{12}^{} T_{22}^{-1}T_{21}^{}\) are invertible. Then \(T\) is invertible and

$$\begin{aligned} T^{-1} =\left[ \begin{array}{ll} ( T_{11}^{} - T_{12}^{} T_{22}^{-1}T_{21}^{})^{-1} &{} - ( T_{11}^{} - T_{12}^{} T_{22}^{-1}T_{21}^{})^{-1} T_{12}^{}T_{22}^{-1} \\ -T_{22}^{-1}T_{21}^{} ( T_{11}^{} - T_{12}^{} T_{22}^{-1}T_{21}^{})^{-1} &{}~ ~ T_{22}^{-1}T_{21}^{} ( T_{11}^{} - T_{12}^{} T_{22}^{-1}T_{21}^{})^{-1} T_{12}^{}T_{22}^{-1} + T_{22}^{-1}\end{array} \right] . \end{aligned}$$

First, we show that the update scheme (2.3) is well-defined, provided that \(I_p+\frac{\tau }{4}X^{\mathsf{T}}D\) is invertible. Consider the update scheme (2.3) with \(U=[P_X D, \,X]\) and \(V=[X,\, -P_XD]\). It follows from \(P_X = I_n - \frac{1}{2}XX^{\mathsf{T}}\) that \(X^{\mathsf{T}}P_X D = \frac{1}{2} X^{\mathsf{T}}D\). Combining this with \(X^{\mathsf{T}}X = I_p\) and \(X^{\mathsf{T}}D\) being skew-symmetric, we can rewrite

$$\begin{aligned} I_{2p}+\frac{\tau }{2}V^{\mathsf{T}}U=\left[ \begin{array}{cc}I_p+\frac{\tau }{4}X^{\mathsf{T}}D &{} \frac{\tau }{2}I_p\\ -\frac{\tau }{2}D^{\mathsf{T}}P_X^{\mathsf{T}}P_X^{}D &{}~ I_p +\frac{\tau }{4}X^{\mathsf{T}}D\end{array}\right] . \end{aligned}$$
(11.1)

Moreover, we derive that

$$\begin{aligned} {{W^{\mathsf{T}}W = D^{\mathsf{T}}(I_n - XX^{\mathsf{T}})D = D^{\mathsf{T}}P_X^{\mathsf{T}} P_X^{} D + \frac{1}{4} \left( X^{\mathsf{T}}D\right) ^2,}} \end{aligned}$$

which with (3.11) implies that

$$\begin{aligned} J = \Big (I_p + \frac{\tau }{4}X^{\mathsf{T}} D \Big )^2 + \frac{\tau ^2}{4}D^{\mathsf{T}}P_X^{\mathsf{T}}P_X^{} D. \end{aligned}$$

Plugging this into (11.1) yields

$$\begin{aligned} I_{2p}+\frac{\tau }{2}V^{\mathsf{T}}U:= \left[ \begin{array}{cc}T_{11} &{} ~T_{12} \\ T_{21} &{}~ T_{22} \end{array}\right] =\left[ \begin{array}{cc}I_p+\frac{\tau }{4}X^{\mathsf{T}}D &{} \frac{\tau }{2}I_p\\ \frac{2}{\tau } \left( \left( I_p +\frac{\tau }{4}X^{\mathsf{T}}D\right) ^2 - J\right) &{}~ I_p +\frac{\tau }{4}X^{\mathsf{T}}D\end{array}\right] , \end{aligned}$$
(11.2)

where \(T_{ij} \in {\mathbb {R}}^{p\times p}\) \((i,j \in \{1,2\})\). By simple calculations, we know that

$$\begin{aligned} {{T_{11}^{} - T_{12}^{} T_{22}^{-1} T_{21}^{} = \left( I_p + \frac{\tau }{4}X^{\mathsf{T}}D\right) ^{-1} J}}. \end{aligned}$$

Thus it follows from Fact 11.1 that \(I_{2p}+\frac{\tau }{2}V^{\mathsf{T}}U\) is invertible and the update scheme (2.3) is well-defined.

We now prove the equivalence. With Lemma 4.1 and (11.2), some tedious manipulations yield that \((I_{2p}+\frac{\tau }{2}V^{\mathsf{T}}U)^{-1}=\left[ \begin{array}{cc}M_{11} &{} ~M_{12}\\ M_{21} &{} ~M_{22}\end{array}\right] \), where

$$\begin{aligned}\begin{array}{ll} M_{11} = J^{-1}(I_p + \frac{\tau }{4}X^{\mathsf{T}} D), &{} ~ M_{12} = -\frac{\tau }{2}J^{-1}, \\ M_{21} = \frac{2}{\tau }\big (I_p- {{M_{22}}}(I_p+\frac{\tau }{4}X^{\mathsf{T}}D)\big ), &{}~ M_{22}=(I_p+\frac{\tau }{4}X^{\mathsf{T}}D)J^{-1}. \end{array} \end{aligned}$$

Direct calculations show that

$$\begin{aligned} M_{11}+ \frac{1}{2}M_{12}X^{\mathsf{T}}D = J^{-1}, \quad \tau \Big (M_{21} + \frac{1}{2} M_{22}X^{\mathsf{T}}D\Big ) = 2I_p - 2\Big (I_p+ \frac{\tau }{4}X^{\mathsf{T}}D\Big )J^{-1}. \end{aligned}$$
(11.3)

Finally, we can obtain

$$\begin{aligned} Y_{\mathrm {wy}}(\tau ) ={}&X - \tau \left[ \begin{array}{l}P_XD,\ X\end{array}\right] \Bigg (\left[ \begin{array}{cc}M_{11} &{} ~M_{12}\\ M_{21} &{} ~M_{22}\end{array}\right] \left[ \begin{array}{c} I_p\\ \frac{1}{2}X^{\mathsf{T}}D \end{array}\right] \Bigg ) \nonumber \\ ={}&X-\tau \Big (\frac{1}{2} XX^{\mathsf{T}}D -W\Big )\Big (M_{11}+ \frac{1}{2}M_{12}X^{\mathsf{T}}D\Big ) -\tau X\Big (M_{21}+\frac{1}{2}M_{22}X^{\mathsf{T}}D\Big )\nonumber \\ ={}&(2X + \tau W)J^{-1}- X, \end{aligned}$$

where the first equality uses (2.3) and \(X^{\mathsf{T}}X = I_p\), the second and third equalities are due to \(P_XD = \frac{1}{2} XX^{\mathsf{T}}D -W\) and (11.3), respectively. The proof is completed.

Appendix 5: Some details of Table 1

We first review the computational costs of some basic matrix operations. Given \(A \in {\mathbb {R}}^{n_1 \times n_2}\) and \(B \in {\mathbb {R}}^{n_2 \times n_2}\), computing \(AB\) needs \(2n_1^{}n_2^2\) flops while computing \(A^{\mathsf{T}} A\) only needs \(n_1^{}n_2^2\) flops. Computing \(\text{ qr }(A)\) by the modified Gram-Schmidt algorithm needs \(2n_1^{}n_2^2\) flops (here \(n_1\ge n_2\)). If \(B\) is symmetric, computing the eigenvalue decomposition \(B = P\Sigma P^{\mathsf{T}}\) with orthogonal \( P\in {\mathbb {R}}^{n_2 \times n_2}\) and diagonal \(\Sigma \in {\mathbb {R}}^{n_2 \times n_2}\) by the symmetric QR algorithm needs \(9n_2^3\) flops. If \(B\) is skew-symmetric, computing the exponential of \(B\) needs \(10n_2^3\) flops. If \(B\) is symmetric positive definite, computing \(B^{1/2}\) needs \(10n_2^3\) flops. In addition, if \(B\) is nonsingular, solving the matrix equation \(B T = A^{\mathsf{T}}\) by the Gauss elimination with partial pivoting needs \(2n_1^{}n_2^2 +2n_2^3/3\) flops. It follows that computing \(B^{-1}\) needs \(8n_2^3/3\) flops. We refer interested readers to [23] for more details.

To verify the computational costs for the corresponding update scheme in Table 1, we take the new update scheme (3.11) and the Wen-Yin update scheme (2.3) as examples. Notice that comparing with \(O(np^2)\) or \(O(n^3)\), the \(O(p^2)\) term is omitted for \(1\le p \le n\). For simplicity, we only consider the case that \(1< p <n\). In the case that \(p=1\) or \(n\), the cost for the update schemes (3.11) and (2.3) can be easily obtained by a similar analysis. We omit its details here.

To analyze the computational cost for (3.11), since \(E = D_{\rho }\) and hence \(W = -G + XX^{\mathsf{T}}G\), we can rewrite the feasible curve as

$$\begin{aligned} Y(\tau ; X) = \left( X\Big (\frac{2}{\tau }I_p + X^{\mathsf{T}}G\Big ) - G\right) {\left( \frac{J(\tau )}{\tau }\right) ^{-1}} - X, \end{aligned}$$

and

$$\begin{aligned} J(\tau ) = I_p + \frac{\tau ^2}{4} (G^{\mathsf{T}}G - G^{\mathsf{T}}XX^{\mathsf{T}}G) + {{\rho \tau }}(X^{\mathsf{T}}G - G^{\mathsf{T}}X). \end{aligned}$$

Forming \(J(\tau )\) needs \(3np^2 + p^3\) flops involving computing \(X^{\mathsf{T}}G\), \(G^{\mathsf{T}}G\) and \((G^{\mathsf{T}}X )(X^{\mathsf{T}}G )\). The final assembly for \(Y(\tau )\) consists of involving solving one matrix equation and performing one matrix multiplication and two matrix subtractions, and hence needs \(4np^2 + 2np + 2p^3/3\) flops. Therefore, the total cost of forming \(Y(\tau ; X)\) in (3.11) for any \(\tau \) is \(7np^2 + 2np + 5p^3/3\). While doing backtracking line searches, we need to update \(Y(\tau ; X)\) for a different \(\tau \). Denote the first trial and the new trial stepsizes by \(\tau _{{{\mathrm {first}}}}\) and \(\tau _{{{\mathrm {new}}}}\), respectively. It is easy to see that

$$\begin{aligned} Y(\tau _{{{\mathrm {new}}}}; X) = \left( X\Big (\frac{2}{\tau _{{{\mathrm {first}}}}}I_p + X^{\mathsf{T}}G\Big ) - G + \Big (\frac{2}{\tau _{{{\mathrm {new}}}}} -\frac{2}{\tau _{{{\mathrm {first}}}}}\Big )X\right) {{\left( \frac{J(\tau _{{{\mathrm {new}}}})}{\tau _{\mathrm {new}}}\right) ^{-1}}} - X. \end{aligned}$$

As \(X\big (\frac{2}{\tau _{{{\mathrm {first}}}}}I_p + X^{\mathsf{T}}G\big ) - G\) is already computed in \(Y(\tau _{{{\mathrm {first}}}}; X)\), we store this matrix and hence only need \(2np^2 + 3np + 2p^3/3\) to compute \(Y(\tau _{{{\mathrm {new}}}}; X).\)

To analyze the computational cost for the Wen-Yin update scheme (2.3), we see that it takes \(3np^2\) and \(2np^2 + np\) flops to form \(I_{2p} + \frac{\tau }{2} V^{\mathsf{T}} U\) and \(P_X D_{\rho }\), respectively. The final assembly for \(Y(\tau )\) needs \(4np^2 + np + 40p^3/3\) flops. Hence, the total cost for the Wen-Yin update scheme is \(9np^2 + 2np + 40p^3/3 \). When \(\rho = 1/2,\) we see from [53] that \(U = [G, X]\), \(V = [X, -G]\) in (2.3) which implies that there is no need to compute \(P_XD_{1/2}\) any more, and the total cost for forming the Wen-Yin update scheme can be reduced to \(7np^2 + np + 40p^3/3\). The cost for updating \(Y(\tau )\) with a new \(\tau \) is \(4np^2 + np + 40p^3/3\) since \(V^{\mathsf{T}}U\) can be stored and the main work involves solving a matrix equation and the final assembly.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, B., Dai, YH. A framework of constraint preserving update schemes for optimization on Stiefel manifold. Math. Program. 153, 535–575 (2015). https://doi.org/10.1007/s10107-014-0816-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-014-0816-7

Keywords

Mathematics Subject Classification

Navigation