A constrained singular value decomposition method that integrates sparsity and orthogonality

Vincent Guillemot; Derek Beaton; Arnaud Gloaguen; Tommy Löfstedt; Brian Levine; Nicolas Raymond; Arthur Tenenhaus; Hervé Abdi

doi:10.1371/journal.pone.0211463

Abstract

We propose a new sparsification method for the singular value decomposition—called the constrained singular value decomposition (CSVD)—that can incorporate multiple constraints such as sparsification and orthogonality for the left and right singular vectors. The CSVD can combine different constraints because it implements each constraint as a projection onto a convex set, and because it integrates these constraints as projections onto the intersection of multiple convex sets. We show that, with appropriate sparsification constants, the algorithm is guaranteed to converge to a stable point. We also propose and analyze the convergence of an efficient algorithm for the specific case of the projection onto the balls defined by the norms L₁ and L₂. We illustrate the CSVD and compare it to the standard singular value decomposition and to a non-orthogonal related sparsification method with: 1) a simulated example, 2) a small set of face images (corresponding to a configuration with a number of variables much larger than the number of observations), and 3) a psychometric application with a large number of observations and a small number of variables. The companion R-package, csvd, that implements the algorithms described in this paper, along with reproducible examples, are available for download from https://github.com/vguillemot/csvd.

Citation: Guillemot V, Beaton D, Gloaguen A, Löfstedt T, Levine B, Raymond N, et al. (2019) A constrained singular value decomposition method that integrates sparsity and orthogonality. PLoS ONE 14(3): e0211463. https://doi.org/10.1371/journal.pone.0211463

Editor: Shyamal D. Peddada, University of Pittsburgh Graduate School of Public Health, UNITED STATES

Received: June 30, 2018; Accepted: January 15, 2019; Published: March 13, 2019

Copyright: © 2019 Guillemot et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The datasets whose analyses are presented in the article are publicly available on GitHub (https://github.com/HerveAbdi/data4PCCAR). Due to confidentiality issues, the original OSIQ dataset cannot be made public. Instead, a simulated example is provided based on the original dataset. The simulation process is described in the paper.

Funding: This article benefitted from a EURIAS fellowship at the Paris Institute for Advanced Studies (France), co-funded by Marie Skłodowska-Curie Actions, under the European Union’s 7th Framework Programme for research, and from a funding from the French State programme “Investissements d’avenir”, managed by the Agence Nationale de la Recherche (ANR-11-LABX-0027-01 Labex RFIEA+). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The singular value decomposition (SVD) [1–3]—the tool “par excellence” of multivariate statistics—constitutes the core of many multivariate methods such as, to name but a few, principal component analysis [4], canonical correlation analysis [5], multiple correspondence analysis [6], and partial least squares methods [7]. To analyze data tables whose rows typically correspond to observations and columns to variables, these statistical methods use the SVD to generate orthogonal optimal linear combinations of the variables—called components or factor scores—that extract the most important information in the original data. In most cases, only the components that explain the largest proportion of the data variance are kept for further investigation. The coefficients—called loadings—of the linear combination used to compute a component are also often used to understand or “interpret” the corresponding components and this interpretation is greatly facilitated (particularly when the number of variables is large) when, for a given component, only a few variables have large loadings and the other variables have negligible loadings. If this pattern does not naturally hold, several procedures can be used to select the variables that are important for a component. The early psychometric school, for example, would use rotations, such as VARIMAX, [8] of the components in a low dimensional subspace; whereas recent approaches favor computationally based methods such as bootstrap ratios [7], or select important variables using an explicit non-linear optimization method such as the LASSO [9]. Closely related to the current work, in the specific case of principal component analysis (for an extensive review of sparsification for PCA see [10]), Witten et al. (see Section 3.2 in [11]) propose to implement either an orthogonality constraint, or a sparsity constraint (but not both simultaneously, see also, for related ideas, [12, 13]). Along the same lines, Benidis et al. [14] proposed, recently, an algorithm, based on Procrustes approach, for sparse principal component analysis that includes an orthogonality constraint on the loadings.

Unfortunately, in the more general case of having concurrently the sparsity and the orthogonality constraints active on both left and right pseudo-singular vectors, the components obtained from the LASSO and its derivatives are not orthogonal and this often makes their interpretation difficult. To palliate this problem, we present and illustrate a new LASSO-like sparsification method for the SVD, called constrained singular value decomposition (CSVD), that incorporates orthogonality constraints on both the rows and the columns of a matrix.

1 Notations

Matrices are denoted by uppercase bold letters (e.g., X), vectors by lowercase bold (e.g., x), and their elements by lower case italic (e.g., x_i,j). Matrices, vectors, and elements from the same matrix all use the same letter (e.g., A, a, a). The transpose operation is denoted by the superscript “^⊤”, the inverse of a square matrix A is denoted by A⁻¹. The identity matrix is denoted I, vectors or matrices of ones are denoted by 1, matrices or vectors of zeros are denoted by 0 (when multiplied with or added to other matrices, matrices I, 1, and 0 are assumed to be conformable, in case of doubt their size is specified). When provided with a square matrix, the diag(⋅) operator returns a vector with the diagonal elements of the matrix, and when provided with a vector, the diag(⋅) operator returns a diagonal matrix with the elements of the vector as the diagonal elements of this matrix. When provided with a square matrix, the trace(⋅) operator gives the sum of the diagonal elements of this matrix. The L₂ norm of vector x, denoted ‖x‖₂ is defined as , The L₁ norm of vector x, denoted ‖x‖₁ is defined as ‖x‖₁ = ∑(|x_n|). A vector x is normalized by dividing this vector by its L₂ norm (and so a normalized vector has an L₂ norm equal to 1). The Frobenius norm of matrix X, denoted ‖X‖_F is defined as . The Frobenius inner product of two rectangular matrices A and B of same dimensions, denoted 〈A, B〉_F is defined as . The concatenation of an I by J matrix X and an I by 1 vector y is the I by J + 1 matrix denoted [X, y] obtained by the juxtaposition of y on the right side of matrix X. The orthogonal complement of the space linearly spanned by the columns of a rectangular matrix M is denoted M^⊥. Two rectangular matrices A and B of same dimensions are said to be orthogonal if and only if 〈A, B〉_F = 0.

With x being a real scalar and γ a non-negative real number, the scalar soft-thresholding function denoted s(x, γ) is defined as (1) and the vector soft-thresholding function denoted S(x, γ) is defined as: (2) This function shrinks all the components of x toward 0, and set the smallest components to 0.

The projection of a vector x onto a space is denoted by proj(x, ). The L₂-ball of radius ρ, denoted , is defined as (3) and the L₁-ball of radius ρ, denoted denoted , is defined as (4)

Singular values are denoted δ, eigenvalues are denoted λ = δ².

2 Unconstrained singular value decomposition

The SVD of a data matrix of rank L ≤ min(I, J) gives the solution of the following problem: Find a least-squares optimal, rank R (with R ≤ L) approximation of X, denoted . Specifically, the SVD solves the following optimization problem [1, 2, 15]: (5) where is the set of all real I × J matrices of rank R.

Recall that the SVD decomposes X as (6) where P^⊤P = Q^⊤Q = I and with δ₁ ≥ δ₂ ≥ ⋯ ≥ δ_L > 0, and L is the rank of X. The matrix (respectively ) contains the left (respectively right) singular vectors of X and the diagonal matrix Δ contains the singular values of X. If p_ℓ (respectively q_ℓ) denotes the ℓ-th column of P (respectively Q), and δ_ℓ the ℓ-th element of δ, then, for any R ≤ L, the optimal matrix is obtained as: (7) with , and , ∀ ℓ ≠ ℓ′.

A classic, albeit non-optimal and potentially numerically unstable, algorithm (described in Algorithm 1) to obtain the unconstrained singular value decomposition of X is based on the “power iteration method.” This algorithm—originally developed for the eigen-decomposition of a square matrix—provides the first singular triplet that comprises the first singular value and first left and right singular vectors. In order to ensure orthogonality between successive singular vectors, the first rank one approximation of X, computed as (8) is subtracted from X. This procedure—called deflation (see Appendix A)—gives a new matrix (9) which is orthogonal to . The power iteration method is then applied to the deflated matrix X⁽²⁾, giving a second rank one approximation denoted . The deflation is then applied to X⁽²⁾ to give the new residual matrix X⁽³⁾ orthogonal to X⁽²⁾, and so on, until nothing is left to subtract because, then, X has been completely decomposed. This way, the optimization problem from Eq 5 can be re-expressed as: (10)

Algorithm 1: The power iteration method for the unconstrained SVD. The algorithm consists in alternating the multiplication of the data matrix by the left and right vectors followed by a normalization step. After convergence, the data matrix is deflated and the process is re-iterated.

Data: X, ε, R

Results: SVD of X

Define X⁽¹⁾ = X;

for ℓ = 1, …, R do

p⁽⁰⁾ and q⁽⁰⁾ are randomly initialized;

δ⁽⁰⁾ = 0;

δ⁽¹⁾ = p⁽⁰⁾^⊤Xq⁽⁰⁾;

s = 0;

while |δ^(s+1) − δ^(s)| ≥ ε do

p^(s+1) ← normalize (Xq^(s));

q^(s+1) ← normalize (X^⊤p^(s+1));

δ^(s+1) = p^(s+1)⊤Xq^(s+1);

s ← s + 1;

end

X^(ℓ+1) = X^(ℓ) − δ^(s) p^(s) q^(s)⊤;

end

As an alternative to the deflation approach used in Algorithm 1, the orthogonality constraint can be eliminated and integrated into the power iteration algorithm by replacing the normalization steps by the projection of the result of the current iteration onto the intersection of the L₂ ball and the space orthogonal to the previously found left or right singular vectors (see Algorithm 2). This projection onto the intersection of these two spaces can be implemented in a number of ways [16], we chose here to use the projection onto convex sets algorithm (POCS, see, e.g., [17, Page 101])—an iterative algorithm easily implementable and generalizable to the projection onto the intersection of more than two convex sets (see Algorithm 3). Recall that, when normalized, a vector x is projected onto the L₂ ball and that a vector x is projected onto V^⊥ by multiplying it by I − V^⊤ V. In POCS, these two projection steps are iterated until convergence.

Algorithm 2 An alternative algorithm of the power iteration for the unconstrained SVD: The deflation step is replaced by a projection onto the space orthogonal to the space defined by the already computed lower rank version of the data matrix. Note that 0^⊥ corresponds to the whole space, so it is either or .

Data: X, ε, R

Result: SVD of X

Define P = 0;

Define Q = 0;

for ℓ = 1, …, R do

p⁽⁰⁾ and q⁽⁰⁾ are randomly initialized;

δ⁽⁰⁾ ← 0;

δ⁽¹⁾ ← p^(0)⊤Xq⁽⁰⁾;

s ← 0;

while |δ^(s+1) − δ^(s)| ≥ ε do

p^(s+1) ← proj(Xq^(s), );

q^(s+1) ← proj(X^⊤ p^(s+1), );

δ^(s+1) ← p^(s+1)⊤ Xq^(s+1);

s ← s+1;

end

δ_ℓ ← δ^(s+1);

P ← [P, p^(s+1)];

Q ← [Q, q^(s+1)];

end

Algorithm 3 Projection onto the intersection of K convex sets (POCS).

Data: x, , ε

Results: Projection of x onto

Define x⁽⁰⁾ = x;

while |x^(s+1) − x^(s)| ≥ ε do

x^(s+1) ← proj(x^(s), ));

for k = 2, …, K do

x^(s+1) ← proj(x^(s+1), ));

end

s ← s + 1

end

Algorithm 1 is obviously faster than Algorithm 2 as implemented using Algorithm 3, because the orthogonality constraint in Algorithm 1 is performed with one operation whereas Algorithm 2 always requires several operations. However, the main benefit of Algorithm 2 is that it easily can be extended to include additional constraints, as illustrated below.

3 Constrained singular value decomposition

3.1 Previous work

Algorithm 4: The Algorithm of Witten et al. [11]: The penalized matrix decomposition (PMD) approach.

Data: X, ε, R

Resulet: SVD of X

Define X⁽¹⁾ = X;

for ℓ = 1, …, R do

p⁽⁰⁾ and q⁽⁰⁾ are randomly initialized;

δ⁽⁰⁾ = 0;

δ⁽¹⁾ = p^(0)⊤Xq⁽⁰⁾;

s = 0;

while |δ^(s+1) − δ^(s)| ≥ ε do

p^(s+1) ← normalize (S(Xq^(s), λ_1,ℓ)), with λ_1,ℓ such that ‖p^(s+1)‖₁ = c_1,ℓ;

q^(s+1) ← normalize (S(X^⊤ p^(s+1), λ_2,ℓ)), with λ_2,ℓ such that ‖q^(s+1)‖₁ = c_2,ℓ;

δ^(s+1) = p^(s+1)⊤ Xq^(s+1);

s ← s + 1;

end

X^(ℓ+1) = X^(ℓ) − δ^(s) p^(s) q^(s)⊤;

end

Recently, several authors have proposed sparse variants of the SVD (see e.g., [4, 11, 15, 18, 19] for reviews), or, more specifically, of PCA [14, 20]. For most of these sparse variants, the sparsification is obtained by adding new constraints to Eq 10. For example, the penalized matrix decomposition (PMD) approach by Witten et al. [11] solves the following optimization problem for the first pair of left and right singular vectors: (11) where C₁ and C₂ are convex penalty functions from (respectively ) to (such as, e.g., the LASSO, or the fused LASSO constraints) and with c_1,ℓ and c_2,ℓ being positive constants. The PMD procedure is described in Algorithm 4. In PMD, the next pseudo-singular triplet is estimated by solving the same optimization problem where X is replaced by a deflated matrix. In contrast to Eq 11, however, the added constraints create a nonlinear optimization problem and this makes the deflated matrices non-orthogonal to the previous rank one optimal matrix [21]. This lack of orthogonality makes the interpretation of the components somewhat difficult because the conclusions about one component involve all correlated components and because the same information is explained (to different degrees) by all correlated components. In the specific case of PCA, Witten et al. proposed, alternatively, to impose an orthogonality constraint on the left singular vectors, without the sparsity constraint, and to leave the sparsity constraint active only on the right vectors (i.e., the loadings). However, this procedure does not solve the problem of having both constraints simultaneously active on the left and right singular vectors (See Equation 3.17 and the subsequent algorithm in [11] for more details).

In order to eliminate the problems created by the non-orthogonality of the singular vectors, we present below a new optimal sparsification method, called the constrained singular value decomposition (CSVD) that implements orthogonality constraints on successive sparsified singular vectors.

3.2 Current work: The constrained SVD (CSVD)

The constrained SVD still decomposes the matrix X into singular values and vectors, but imposes, in addition, on the singular vectors constraints that induce sparsity of the weights. Such sparsity-inducing constraints are common in fields where the data comprise large numbers of variables [22] (e.g., tens of thousands, as in genomics [23], to millions, as in neuroimaging [24, 25]). Although the theory of sparsity-inducing constraints is well documented, we are interested in a general formulation that could also be applied for several types of sparsification, as well as more sophisticated constraints.

Specifically, we consider the following optimization problem: (12) where C₁ and C₂ are convex penalty functions from (respectively ) to , (which could be, e.g., the LASSO, group-LASSO, or fused LASSO constraints) and with c_1,ℓ and c_2,ℓ being positive constants: smaller values of c_1,ℓ and c_2,ℓ lead to solutions that are more sparse. See, however, e.g., [11, 26], and, as developed in Appendix B, only some ranges of values of c_1,ℓ and c_2,ℓ will lead to solutions.

An equivalent, but more convenient, form of the problem described in Eq 12 can be derived by considering two orthogonal matrices P, and Q, and a diagonal matrix Δ = diag{δ} such that (13)

The term is constant and does not depend on p_ℓ or q_ℓ, and so for a given δ_ℓ positive, the solutions of are the same as the solutions of . In addition the maximum is reached when (see, e.g., [11]). Consequently, minimizing from Eq 13 is equivalent to maximizing each term of the sum . Therefore, Eq 12 is equivalent to the following 1-dimensional maximization problem for ℓ ≥ 1, given the previous vectors p_ℓ′ and q_ℓ′ for all 0 ≤ ℓ′ < ℓ: (14) where we set (for convenience) p₀ and q₀ to 0 (note that for the first pseudo-singular vectors, the orthogonality constraint is not active because all vectors are orthogonal to 0).

To facilitate the resolution of this optimization problem, the unicity constraint on the L₂ norm of the singular vectors (i.e., p^⊤p = 1 and q^⊤q = 1) needs to be relaxed and replaced by an equivalent inequality. Specifically, the optimization problem in Eq 14 is reframed as (15)

Eq 15 defines a biconcave maximization problem with convex constraints. This problem can be solved using block relaxation [27]: An iterative algorithm that consists in a series of two-part iterations in which (Part 1) the expression in Eq 15 is maximized for p with q being fixed, and is then (Part 2) maximized for q with p being fixed. Part 1 of the iteration can be re-expressed as the following optimization problem: (16) Eq 16 shows that finding the optimal value for p (i.e., Part 1 of the alternating procedure) is equivalent to finding the projection of the vector Xq onto the subspace of defined by the intersection of all the convex spaces involved by the constraints. During Part 2 of the alternating procedure, the vector p is fixed and therefore Part 2 can be expressed as the following minimization problem: (17) Eqs 16 and 17 replace the L₁ and L₂ constraints in the minimization problem expressed in Eq 11 by projections onto the intersection of the convex sets (POCS) defined by these constraints. Because the intersection of several convex sets is also a convex set [28], the block relaxation algorithm from Eq 15 is essentially composed of sequential series of operations applied until convergence of the two projections onto their respective convex sets. This strategy can obviously be extended to incorporate multiple additional constraints as long as these constraints define convex subspaces. As shown in Appendix D, the CSVD algorithm is guaranteed to converge to a stable point because it is a member of the more general class of the block successive upper-bound minimization (BSUM) algorithms.

In the specific case of the projection on the intersection of the balls L₁ and L₂, POCS can be replaced by a fast and exact algorithm called PL1L2 ([26], see Appendix C for details). This algorithm (see Algorithm 5) differs from the more general Algorithm 4 only by the specification of the projection method onto the L₁ ball which is implemented as a simple and fast algorithm based on the soft-thresholding operator.

Note that, Algorithm 2—presented in Section 2 for the unconstrained SVD and using POCS for the projection onto an intersection of convex sets—can be easily generalized to incorporate a new sparsity constraint, by simply applying POCS to the intersection of 3 convex sets (instead of just 2 convex sets): the L₂-ball of radius 1, an L₁-ball, and the orthogonal subspace to the space defined by the previously found left or right pseudo-singular vectors.

Algorithm 5: General algorithm for the Constrained Singular Value Decomposition. The projection onto the L₁-ball can be replaced by another projection onto a convex set, making it possible to adapt this algorithm to other purposes.

Data: X, ε, R, c_1,ℓ and c_2,ℓ for ℓ in 1, …, R

Results: CSVD of X

Define P = 0;

Define Q = 0;

for ℓ = 1, …, R do

p⁽⁰⁾ and q⁽⁰⁾ are randomly initialized;

δ⁽⁰⁾ ← 0;

δ⁽¹⁾ ← p^(0)⊤Xq⁽⁰⁾;

s ← 0;

while |δ^(s+1) − δ^(s)| ≥ ε do

p^(s+1) ← proj(Xq^(s), );

q^(s+1) ← proj(X^⊤p^(s+1), );

δ^(s+1) ← p^(s+1)⊤Xq^(s+1);

s ← s + 1;

end

δ_ℓ ← δ^(s+1);

P ← [P, p^(s+1)];

Q ← [Q, q^(s+1)];

end

In the following sections, we illustrate—using simulated and real data—the effect and importance of the orthogonality constraint and show how this constraint improves the interpretability of the analysis.

4 Empirical comparative evaluation of the CSVD

In this section, we empirically evaluate, illustrate the constrained singular value decomposition (CSVD), and compare its performance to the performance of the plain SVD and the closely related sparsification method of Witten et al. [11], the PMD method. To do so, we used: 1) some simulated datasets, 2) one simulated dataset mimicking a real dataset, and 3) one real dataset (the characteristics of these datasets are listed in Table 1).

Download:

Table 1. Characteristics of the various datasets used to assess the performance of the CSVD and related methods.

https://doi.org/10.1371/journal.pone.0211463.t001

With the first (simulated) dataset we evaluated how the SVD, PMD, and CSVD recover the ground truth for a relatively large dataset with more variables than observations contains a mixture of signal and Gaussian noise.

Datasets two and three were chosen to each illustrate a particular aspect of the data. The second dataset investigates the N ≪ P problem and comprises six face images consisting of 230 × 240 = 55, 200 pixels—with each pixel measuring light intensity on a scale going from 0 to 255. The third dataset illustrates the effects of sparsification on a dataset corresponding to a traditional psychometric problem. This simulated has been created to match the pattern of loadings of a real dataset that was collected from 2, 100 participants who—as part of a larger project on memory—answered an online version of the “object-spatial imagery questionnaire” (OSIQ, [29])—a psychometric instrument measuring mental imagery for objects and places. Using a 5-point rating scale, participants rated their agreement for 30 items that should span a 2-dimensional space corresponding to the spatial and object imagery psychometric factors. This dataset is used to compare sparsification and the standard traditional psychometric approach relying on (Varimax) rotation to recover a two dimensional factor structure.

For Datasets two and three, we applied three degrees of sparsity (low, medium, and high). As detailed in Appendix B, only some values of c₁ and c₂ will lead to solutions (specifically, c₁ has to be chosen between 1 and and c₂ between 1 and ). Table 2 lists the values chosen for c₁ and c₂ and their interpretation for PMD and the CSVD. Also, for technical reasons, the values of c₁ and c₂ corresponding to the maximum sparsity for P and Q are set, respectively, to 1 + ε₁ and 1 + ε₂ (instead of 1) with ε₁ and ε₂ being two small real positive values.

Download:

Table 2. The different values of the sparsity parameters for the CSVD and PMD.

https://doi.org/10.1371/journal.pone.0211463.t002

4.1 Simulated data

With these simulated data, we evaluate the ability of the CSVD to recover known singular triplets, their sparsity structure, and the orthogonality of the estimated left and right singular vectors. These simulated data were created by adding a matrix of Gaussian noise to a 150 by 600 matrix of rank 5 built from its SVD decomposition.

Specifically, the data matrix X was created as (18) where

is the rank 5 matrix of the ground truth with:
- Δ_M a 5 × 5 the diagonal matrix of the singular values equal to Δ_M = diag (δ) = diag([15, 14, 13, 12, 11])
- P_M an I = 150 × 5 = 750 by 5, orthogonal matrix with P^⊤P = I,
- Q_M a J = 600 × 5 = 3, 000 by 5 orthogonal matrix with Q^⊤Q = I,
E is an I = 150 × J = 600 matrix containing I × J = 90, 000 independent realizations of a Gaussian variable with mean equal to 0 and standard deviation equal to 0.01.

Matrices P_M and Q_M were designed to be both sparse and orthogonal. Specifically, matrix P_M was generated with the following model (19) where 0 denotes 25 × 1 vectors of 0s, and where all p and p′ were 25 × 1 vectors with norm equal to and such that where . A similar model was used for Q_M which was generated with the following model (20) where 0 were 120 × 1 vectors of 0s, and where all q q′ vectors were 120 × 1 vectors with norm equal to and such that .

With the structure described in Eqs 19 and 20, the low-rank matrix to recover from X is then composed of 2 parts: 1) a common part (no sparsity) to all 5 components (i.e., the part corresponding to the p and q vectors), and 2) one part specific to each component (i.e., the p′ and the q′ vectors) with the corresponding part in the other 4 components being sparse.

Figs 1 and 2 show heatmaps of the true left-singular vectors (P_M) and right-singular vectors (Q_M).

Download:

Fig 1. Simulated data.

Heatmap of : the true left singular vectors (in an horizontal representation: one line equals one singular vector).

https://doi.org/10.1371/journal.pone.0211463.g001

Download:

Fig 2. Simulated data.

Heatmap of : the true right singular vectors (in an horizontal representation: one line equals one singular vector).

https://doi.org/10.1371/journal.pone.0211463.g002

We analyzed X with the CSVD and PMD. For both methods, the L₁ constraint was set to c₁ = 5 for the left-singular vectors and c₂ = 11 for the right-singular vectors, based on the sparsity of the ground truth.

We asked each method to return 7 vectors in order to evaluate if the methods could recover the ground truth (i.e., the 5-dimensional sub-space) but also how they would behave after this 5-dimensional subspace had been recovered. Fig 3 shows the boxplots of the distribution of the squared difference between the estimated singular vectors and the ground truth for PMD and the CSVD: Both methods correctly uncover the singular vectors. Fig 4 shows the correlations between the first 7 estimated singular vectors compared to the ground truth: Although the first 5 singular vectors are correctly estimated, and, roughly, orthogonal to the previous singular vectors, the 2 last vectors estimated by PMD are correlated to some of the previously computed vectors. This demonstrates the failure of the standard deflation technique to impose the orthogonality constraint when a non-linear optimization methods is used. By contrast with the PMD approach, the orthogonality constraint of the CSVD prevented this problem.

Download:

Fig 3. Simulated data.

Boxplots of the squared difference between the estimated singular vectors and the ground truth.

https://doi.org/10.1371/journal.pone.0211463.g003

Download:

Fig 4. Simulated data.

Heatmap of the correlations between the estimated left (P) and right (Q) singular vectors with the ground truth for the CSVD and PMD. Each cell of the heatmap represents the correlation between one of the 7 estimated (left or right) vectors with the 5 true vectors. Each heatmap contains 5 rows (the ground truth) and 7 columns (the estimated vectors). On the left, are the results obtained with the CSVD and on the right, the results obtained with PMD.

https://doi.org/10.1371/journal.pone.0211463.g004

Fig 5 displays the computational times of the CSVD and PMD as a function of the number of computed components: the CSVD is faster than PMD for the estimation of the first component, but this advantage diminishes when the number of computed components increases. This pattern indicates that the orthogonality constraint increases the computational time.

Download:

Fig 5. Simulated data.

Computational time of PMD and the CSVD when estimating one sparse singular triplet (left) and two sparse singular triplets (right).

https://doi.org/10.1371/journal.pone.0211463.g005

Table 3 contains the estimated singular values for the CSVD and PMD as well as the ground-truth singular values. The estimated pseudo-singular values are comparable for both methods and behave in a similar way compared to the ground truth and to the regular SVD: The first singular values are slightly smaller than their ground truth values, but at some point—which varies depending on the imposed degree of sparsity—the estimated number of pseudo-singular values is larger than the ground truth.

Download:

Table 3. Estimated singular values and ground truth.

In the “ground truth” column, a value of 0 indicates that the corresponding real singular value does not exist (i.e., because the underlying matrix has rank 5).

https://doi.org/10.1371/journal.pone.0211463.t003

Additionally, broader settings were considered for the comparison of SVD, CSVD and PMD on simulated data. Specifically, we considered a case with a low signal to noise ratio, and a case where the noise is structured: PMD and CSVD performed equally poorly on noisy data, but were unaffected by a structured noise. These additional results are reported in S1 Table.

To sum up: 1) the CSVD and PMD produce highly similar estimates of the first singular vectors; 2) the CSVD and PMD both recover the true sparsity structure of the ground-truth data; 3) for singular vectors of an order higher than the rank of the matrix, PMD produces singular vectors correlated with the previous ones; and 4) the CSVD is computationally more efficient than PMD but this advantage shrinks as the number of computed components increases.

4.2 The face data

The dataset consists in six 240 × 230 = 55, 200 gray scale digitized (range from 0 to 255) face images (three men and three women) that were extracted from a larger face database (see [30, 31]) and are available as the dataset sixFaces from the R-package data4PCCAR (obtained from the Github repository HerveAbdi/data4PCCAR).

Each image was unfolded (i.e., “vectorized”) into a 240 × 230 = 55, 200 element vector, which was re-scaled to norm one. A plain SVD was then performed on the 6 × 55, 200 matrix (see Fig 6) obtained from the concatenation of the 6 face vectors. The SVD extracted 6 components with the first one extracting a very large proportion of the total variance (i.e., λ₁ = 5.616, τ₁ = 94%, see also Fig 7 left panel). This very large first eigenvalue indicates that these face images are highly correlated (see Fig 8)—An interpretation confirmed by the very similar values of the coordinates of the six faces for the first component. But this large first eigenvalue reflects also, in part, that the data were not centered, because, with all entries of the matrix being positive, the first left (respectively right) singular vector (i.e., the first component) is respectively, the 6-element long vector of a weighted mean of the pixels across the faces (respectively the 55, 200-element long vector of a weighted mean of the faces across the pixels) and so all elements of the first pair of singular vectors have the same sign, see Table 4 and the picture of the first “eigen-face” [32] in the left of the top row of Fig 9). The second component differentiates females and males (see the map of the faces for Components 1 versus 2 in Fig 9, and the picture of the second “eigen-face” in the right of the top row of Fig 9).

Download:

Fig 6. Face data.

The data matrix of the face dataset: 6 faces by 55,200 voxels. The female faces are denoted F1, F2, and F3, the male faces M1, M2, and M3.

https://doi.org/10.1371/journal.pone.0211463.g006

Download:

Fig 7. Face data.

Eigenvalues and pseudo-eigenvalues per dimension. Left column: eigenvalues obtained from regular SVD; middle column: pseudo-eigenvalues (i.e., variance of the factor scores) for the CSVD; right column: pseudo-eigenvalues (i.e., variance of the factor scores) for the PMD. For PMD and the CSVD, each line represents a level of sparsity: none (No), low (L), medium (M), and high (H).

https://doi.org/10.1371/journal.pone.0211463.g007

Download:

Fig 8. Face data.

Cosine matrix for the 6 faces of the face dataset. The female faces are denoted F1, F2, and F3, the male faces M1, M2, and M3.

https://doi.org/10.1371/journal.pone.0211463.g008

Download:

Fig 9. Face data.

First two sparse left singular vectors (P) for the CSVD (left) and PMD (right) for three levels of sparsity: low (L), medium (M), and high (H). The results for the CSVD are reported on the left, and the results for PMD are reported on the right.

https://doi.org/10.1371/journal.pone.0211463.g009

Download:

Table 4. Face example.

Left singular vectors (i.e., face loadings) and associated eigenvalues.

https://doi.org/10.1371/journal.pone.0211463.t004

We applied the CSVD and PMD to the face set using three different sparsity levels (low, medium, and high). Fig 9 shows the plot of the first two components for these three levels of sparsity. Both the CSVD and PMD tend to isolate the woman faces on the first dimension and the male faces on the second dimension. The corresponding first two eigen-faces, (see Figs 10 and 11) show that both the CSVD and PMD tend to extract characteristic features of the female faces (first eigen-face) or the male faces (second eigen-face). In contrast, the first and second eigen-faces for the plain SVD (plotted in Fig 12) show respectively a weighted average face (i.e., a linear combination with positive coefficients of the faces) and a mixture between male (with positive coefficients) and female (with negative coefficients) faces. This interpretation for the plain SVD is confirmed by Fig 13 where all faces load almost identically on Dimension 1, and where Dimension 2 separates men from women.

Download:

Fig 10. Face data.

The pseudo-eigenfaces for Dimension 1 on the left column and Dimension 2 on the right column. For this graph, only the CSVD was applied, with three different levels of sparsity: low on the top row (L), medium on the middle row (M), and high on the bottom row (H).

https://doi.org/10.1371/journal.pone.0211463.g010

Download:

Fig 11. Face data.

The pseudo-eigenfaces for Dimension 1 on the left column and Dimension 2 on the right column. For this graph, only PMD was applied, with three different levels of sparsity: low on the top row (L), medium on the middle row (M), and high on the bottom row (H).

https://doi.org/10.1371/journal.pone.0211463.g011

Download:

Fig 12. Face data.

The six eigenfaces obtained from the plain SVD.

https://doi.org/10.1371/journal.pone.0211463.g012

Download:

Fig 13. Face data.

The two first left singular vectors of the plain SVD of the non-centered data.

https://doi.org/10.1371/journal.pone.0211463.g013

Overall the CSVD and PMD behave similarly and both show (compared to the plain SVD), that introducing sparsity can make the results easier to interpret because groups of individuals (men or women) can be identified and linked to small subset of variables (i.e., here pixels). However CSVD and PMD differ in the number of components suggested by their scree plot as indicated by Fig 7. The differences between the loadings estimated by both methods are also seen in Fig 14, which depicts the cross-product between the 6 right singular vectors for three different sparsity levels and for both methods. This figure shows that the risk of re-injecting variability, that was already described in previous components, increases with the sparsity parameter and the number of required components.

Download:

Fig 14. Face data.

Cross-product matrix of the 6 right pseudo-singular vectors for three levels of sparsity: low (L), medium (M), and high (H). The results for the CSVD are reported on the left, and the results for PMD are reported on the right.

https://doi.org/10.1371/journal.pone.0211463.g014

4.3 Psychometric example: The mental imagery questionnaire

These simulated data were created to match the loading structure of an original dataset obtained from 2,100 participants who—as part of a larger project on memory—answered an online version of the “object-spatial imagery questionnaire” (OSIQ, [29])—a psychometric instrument measuring mental imagery for objects and places. Using a 5-point rating scale, participants rated their agreement for 30 items (e.g., “I am a good Tetris player”) that should span a 2-dimensional space corresponding to the hypothesized spatial and object imagery psychometric factors.

The simulated data were obtained from an original data set by first performing a (centered and un-scaled) PCA on the original dataset and keeping only the loadings and the eigenvalues. Random pseudo-observations were generated by randomly sampling (with a uniform probability distribution) points in the factor space and then building back the corresponding data matrix from these random factor scores and the loadings. The final simulated data matrix was then obtained by scaling this new data matrix so that it only contains integer values whose distribution match, as best as possible, the original data matrix. This way, the simulated data matrix contains random values whose means, variances, and loadings roughly match the original data matrix. The R-code used to create the simulated data can be found from the R-package data4PCCAR (available from from the Github repository HerveAbdi/data4PCCAR; the simulated data matrix can also be found in the same R-package.

The 2,100 (participants) by 30 (items) data matrix was pre-processed by centering and normalizing each variable and was then analyzed by PCA (i.e., an SVD of the pre-processed matrix). Fig 15 plots the loadings for the 30 items for the first two components of the PCA. In this figure, each item is labeled by its number in the questionnaire (see [29] for details and list of questions), and its a priori category (i.e., “object” vs “spatial”) is indicated with the first letter (o vs s) and color (blue for “object” vs gold for “spatial”). The scree plot (see Fig 16 left) and the plot of the loadings for the first two dimensions (Fig 15) supports a two factor model (with the plane created by Dimensions 1 and 2 explaining 44% of the total variance). The pattern of the loadings, however, reveals that some items load, as predicted, on only one factor (most object items and some spatial items) but that roughly half of spatial items (i.e., s02, s03, s03, s05, s06, s11, s20, s23, s24) and, at least, one object item (i.e., o15) load on both Dimensions 1 and 2. These items are ambiguous because they can reflect either only one of the hypothesized factors or a combination of both factors. To simplify the interpretation, a standard psychometric approach would keep only the unambiguous items, re-run the analysis with these items, and “prettify” the solution with an orthogonal rotation such as Varimax [33].

Download:

Fig 15. OSIQ data.

Loadings of the first two principal components.

https://doi.org/10.1371/journal.pone.0211463.g015

Download:

Fig 16. OSIQ data.

Scree plots for SVD, the CSVD and PMD for different values of the sparsity parameter.

https://doi.org/10.1371/journal.pone.0211463.g016

To evaluate the effects of sparsification, we used three levels of sparsity (in addition to the “no sparsity” condition corresponding to the plain SVD): Low (L), Medium (M), and High (H). As expected, and illustrated by the scree plots (see Fig 16), sparsification reduced the amount of variance (i.e., the pseudo-eigenvalues) explained by the sparsified components.

Fig 17 plots the item loadings for Dimensions 1 and 2 for both the CSVD (left column) and PMD (right column) as a function of the levels of sparsity (L/M/H). For the low and intermediate levels of sparsity. For the first two levels of sparsity (L and M) the CSVD and PMD give similar results, possibly because the factor structure of the items on the first dimension is strong enough to be recovered without the orthogonality constraints. For the highest level of sparsity, the CSVD and PMD single out the same item (o28) on the first dimension (an unsurprising result because the maximized criteria are equivalent) but single out different items (s29 vs s18) on the second dimension.

Download:

Fig 17. OSIQ data.

Loadings of Dimensions 1 and 2 with an increasing degree of sparsity for both the CSVD (left column) and PMD (right column).

https://doi.org/10.1371/journal.pone.0211463.g017

Fig 18 plots the correlations between the loadings estimated with the CSVD or PMD for all 30 dimensions and shows again that the components extracted by PMD are correlated with other components.

Download:

Fig 18. OSIQ data.

Plot of the cross-product between the loadings obtained for up to 30 dimensions. Left: CSVD. Right: PMD. Top to bottom: the three different levels of sparsity.

https://doi.org/10.1371/journal.pone.0211463.g018

Visual inspection of the plain PCA analysis suggests that the items are roughly clustered into three groups (pure object, pure spatial, and mixed spatial/objects). To better characterize these three groups of items, we ran an additional analysis in which we set the sparsity parameters to values (specifically and for, respectively the left and right singular vectors) that would generate three pure dimensions for the item loadings (see Fig 19). With this analysis, the first two dimensions isolate the pure items and the third dimension extracts the mixed items. To confirm this interpretation, we ran a plain PCA on the pure items (see Fig 20 left) followed by a Varimax rotation for two dimensions (see Fig 20 right). The Varimax rotated space recovered a solution equivalent to the CSVD with, however, the caveat, that Varimax required the a priori knowledge of the dimensionality of the space whereas the CSVD did not require this information.

Download:

Fig 19. OSIQ data.

Loadings for Dimensions 1, 2, and 3 with sparsity parameters set to c₁ ≈ 15.11 and c₂ ≈ 2.50. The sparsity parameters were empirically determined visually to create “pure” dimensions.

https://doi.org/10.1371/journal.pone.0211463.g019

Download:

Fig 20. OSIQ data.

PCA with the reduced set of 14 items from the OSIQ. Left: Dimensions 1 and 2 for plain PCA; Right: Dimensions 1 and 2 after a two-dimensional Varimax rotation.

https://doi.org/10.1371/journal.pone.0211463.g020

Discussion

The constrained singular value decomposition is a computationally efficient new method that sparsifies the SVD while preserving the orthogonality of the singular vectors. To do so, the CSVD expresses each constraint as a projection onto a convex set and integrates multiple constraints as the projection onto the intersection of the convex sets expressing the constraints (POCS). As shown in Appendix D, the CSVD algorithm is guaranteed to converge to a stable point because it is a member of the more general class of the block successive upper-bound minimization (BSUM) algorithms. The CSVD can easily be extended to incorporate additional constraints (e.g., group LASSO, metrics constraints of the rows or columns, spatial constraints) as long as these constraints can be expressed as projections onto convex sets.

To evaluate the relevance of the orthogonality constraints, we compared, on three examples, the plain SVD, the penalized matrix decomposition [11], and the CSVD. We found that, as could be expected, without the orthogonality constraint, higher singular vectors shared information with the earlier singular vectors—a problem likely to hinder the interpretation of these later components. The example using face images shows that the CSVD could extract, from the images, characteristic features defining clusters of observations (e.g., men vs women). The psychometric example illustrates how the CSVD can be used in lieu of rotation (e.g., Varimax) to identify psychometrically “pure” components without having to choose a priori the dimensionality of the space.

Of course, some questions remain open. For example, the choice of the sparsification constant is left to the user but this choice could be helped with some cross-validation schemes such as the one suggested by Witten et al. (see Algorithm 5 in [11]). In practice, this choice is likely to involve some trade-off between the interpretability of the pseudo singular vectors and the number of non-zeros loadings for a few of the first singular vectors. Along the same lines, and just like for the plain SVD, the number of (sparse) singular vectors to examine remains, in part, a subjective decision. Finally, the problem of the reliability of the sparsification, although a topic of great interest for sparse methods of prediction [34], remains open for sparse SVD or PCA and should be a topic for future research. Future directions should also include the integration of the CSVD into other methods that are traditionally based on the regular SVD—such as canonical correlation, or partial least squares correlation—or on the generalized SVD—such as correspondence analysis (see, e.g., [15] for previous relevant work along these lines).

The R package csvd implementing the constrained singular value decomposition is available for download from https://github.com/vguillemot/csvd.

A The deflation operation generates orthogonal singular vectors

In this section, we show that repeatedly using the deflation operation will generate a set of left (respectively right) orthogonal singular vectors ordered by their singular value.

Theorem 1 (Deflation). The first singular triplet of the (k + 1)th deflated matrix is the (k + 1)th singular triplet of the original matrix.

Proof. Assume that the k ≥ 1, singular values, left and right singular vectors of a given matrix X have been estimated and stored in matrices Δ_k, P_k and Q_k.

Let (21) be the (k + 1)th deflated matrix with: (22) Additionally, let δ, p and q be (respectively) the first singular value, left and right singular vectors of X^(k+1).

To prove Theorem 1, we show that δ, p and q are the (k + 1)th singular triplet of X.

First, because δ, p and q are a singular triplet of X^(k+1), we have: (23) and (24)

Second, to prove that q is orthogonal to each column of Q_k, we consider the quantity . By multiplying both sides of Eq 23 by , developing, and simplifying we obtain (25) Therefore, when δ is not null, q is orthogonal to each column of Q_k. A similar proof shows that p is orthogonal to each column of P_k.

Third, because p (respectively, q) is orthogonal to each column of P_k (respectively Q_k), we have (26) which, combined with Eqs (23) and (24), implies that (27) which, in turn, shows that δ, p, and q are a singular triplet of X. This proof also shows (mutatis mutandis) that any singular triplet of X^(k+1) is a singular triplet of X.

Finally, from the definition of X^(k+1), and because X^(k+1) is orthogonal to P_k Δ_k Q_k, the rank of X is equal to the rank of X^(k+1) plus the rank of , which, in run, implies that all the singular values of X^(k+1) are the remaining singular values of X and that the first singular value of X^(k+1) is the (k + 1)th singular value of X.

B Only some values of the constraints lead to solutions

As stated by Witten et al. ([11], page 519 ff.) the constraint parameters c₁ and c₂ lead to solutions only when they are in the range: (28) Fig 1 in [11] describes the geometric intuition behind this range in .

In this appendix, we provide a proof of Statement 28. First, we prove Lemma 1 (that directly implies Statement 28).

Lemma 1. Let , then (29) Proof. Assume that x belongs to and is different from 0. The left side of the inequality is a consequence of Hölder’s inequality, which states that if 0 < p < +∞, and q is a positive real number such that , then (30) With p = 1, this version of Hölder’s inequality becomes (31) Since ‖x‖_∞ ≤ ‖x‖₂, we have (32)

The right hand side of Eq 29 can be seen as a consequence of Cauchy-Schwarz inequality, which, in our case, would be formulated as follows: (33) If we set a to [|x₁|, …, |x_n|]^⊤ and b to 1, we obtain (34) which is equivalent to (35)

Putting together Eqs 32 and 35 gives: (36)

Lemma 1 implies that the constraints on the L₂ and L₁ norm of the left and right pseudo-singular vectors can be both active at the same time only if the sparsity parameter is chosen such that: (i) the L₁-ball of radius ρ (i.e., ) is entirely included in the L₂-ball of radius 1 (i.e., ) when ρ ≤ 1, and so that fulfilling the sparsity constraint implies that the L₂ constraint is also fulfilled; and (ii) is entirely included in when , and so fulfilling the normalization constraint implies that the sparsity constraint is also fulfilled. To fulfill the constraints on both rows and columns of the CSVD gives the following range for the values of c₁ and c₂: (37) which proves the assertion.

C A fast and exact algorithm for the projection onto the intersection of an L₁ and L₂ ball

In this section we describe a fast and exact algorithm for the projection onto the intersection of the L₁-ball of radius c (i.e., ) and the L₂-ball of radius 1 (i.e., ). This projection is defined by the following equation: (38) where , N is the number of variables of the dimension of interest (i.e., I or J), and c is the sparsity parameter (i.e., c₁ or c₂) with .

In [11], the solution of Eq 38 is computed using a binary search algorithm (BiSe). In the main part of our artcie, we propose to use the more general POCS algorithm. BiSe and POCS are iterative algorithms that give an approximate solution to Eq 38. In the case of the projection on the intersection of the L₁ and L₂ balls, the general POCS algorithm can be replaced by a fast and exact algorithm (see [26, 35]), that we call PL1L2 and detail in this appendix.

Projection onto the L₁-ball

The proposed approach implements an efficient algorithm for projecting a vector onto the L₁-ball [35].

Let be the vector containing the absolute value of the components of x with its elements sorted in decreasing order. Additionally, we define the function φ(λ) = ‖S(x, λ)‖₁. This function is continuous, piecewise linear and decreasing from to . Therefore, if ‖x‖₁ ≥ c, since φ is continuous, there is a positive number λ such that φ(λ) = c. From this, we can deduce the algorithm of the projection onto the L₁-ball of radius c that narrows down to 4 steps.

Algorithm 6: Fast projection onto the L₁-ball.

Data: x, c

Result:

1. Take the absolute value of the components of x and sort them in decreasing order into a new vector ;

2. Find i such that ;

3. Find δ such that . Since , then ;

4. Compute S(x, λ) with ;

At the end of the algorithm, we obtain S(x, λ) which is now the projection of x onto . A similar algorithm was proposed in [36], [37], and [38].

Projection onto the intersection of the L₁ and L₂-balls

In order to solve the optimization problem from Eq 38, we extend Algorithm 6 to the function (39)

We have the following Lemma:

Lemma 2. Let x be a vector of , composed of n ≤ N non-zero elements. Then (40) Proof. The proof of this Lemma is very similar to the proof given in Appendix B. Recall that as a consequence of the Cauchy-Schwarz inequality: (41) With a = [|x₁|, …, |x_N|] and b a vector such that (42) the previous inequality becomes (43) which is equivalent to (44)

Proposition 1. For , (45) verifies the 3 following properties:

ψ is continuous and decreasing.
There exist an integer i and a positive real number δ, smaller than , such that .
δ is the solution of a second degree polynomial equation.

Proof. (i). The numerator and denominator of ψ are continuous because there are compositions of continuous functions. Moreover, for any λ strictly smaller than , . Therefore, ψ is continuous because it is the ratio of a continuous function and a non-zero continuous function.

Assuming , for there exists k ∈ 1, …, N such that . For this specific λ, we have: (46) and (47) Together, Eqs 46 and 47 imply that the derivative of ψ has the form: (48) Moreover, because the number of non-zero elements of vector is equal to k, Lemma 2 implies that , and therefore ψ(λ)² ≤ k. As a consequence, ψ′(λ) ≤ 0, which, in turn, implies that ψ, being a continuous function with a negative derivative, is a decreasing function.

(ii). Let N_max be the number of elements of x equal to (the maximum of ) and . Then (49) Thus, ψ is decreasing from (Lemma 2) to . This implies that for , there is an integer i ∈ 1, …, N such that . Finally, because ψ is continuous, there is a real number δ in such that .

(iii). Using the notations and , with i (and δ) defined as previously stated in (ii), we have: (50) and (51) Moreover, since (52) the following equality holds: (53)

Incorporating Eqs 50 and 51 into Eq 53 gives: (54) The goal is now to find the positive root of this second degree polynomial equation. The discriminant Δ is equal to . It remains to show that Δ is positive.

First, the number of non-zero elements of is equal to i and Lemma 2 yields . Second, so . Combining the two previous inequalities yields which implies that i−c² > 0. Third, from , we deduce that which ensures that Δ is positive.

To conclude, the sign of corresponds to the sign of the product of the 2 roots. As this term is negative, the 2 roots have opposite signs. The single solution of is: (55) Using the fact that , the previous equation can be simplified as (56)

We deduce from this a four step algorithm, called PL1L2, for the projection onto the intersection of the L₁-ball of radius c and the L₂-ball of radius 1.

Algorithm 7: PL1L2: an algorithm for a fast and exact projection onto .

Data: x, c

Result:

1. Take the absolute value of x and sort its elements in decreasing order to get ;

2. Find i such that ;

3. Let ;

4. Compute S(x, λ) with ;

D Convergence of the CSVD algorithm

In this appendix we prove the convergence of the CSVD. To do so, we show that the CSVD is an instance of the block successive upper-bound minimization (BSUM) algorithm (introduced in [27]) and, as such, converges to a stationary point.

D.1 Definitions and notations

Define f, which is the negative of the objective function from Eq 15 (57) The functions u₁ and u₂ are two “approximations” of f, defined as (58) and (59) These two functions depend on the fixed given vectors p and q and vary according to and .

In the BSUM framework, f is minimized by iteratively minimizing u₁ over a convex set and u₂ over a convex set .

Definition 1 (BSUM algorithm [27]). The BSUM algorithm (in the present setting) is defined as:

Minimize u₁ over with q fixed, and update p with the solution;
Minimize u₂ over with p fixed, and update q with the solution,

and iterate until convergence.

Recall the following definitions.

Definition 2 (Directional derivative). Let g be a function with gradient at x denoted ∇_x g. The directional derivative of g in a direction d is (60)

Definition 3 (Regularity). Let f be a differentiable function defined over . Assume that (61) and (62) with and . If this implies that (63) then f is regular.

D.2 Equivalence of the BSUM algorithm and the CSVD

Algorithm 5 is equivalent to the BSUM algorithm because:

Minimizing f is the same as maximizing the objective function in Eq 15.
Minimizing u₁ over is equivalent to the left projection step.
Similarly, minimizing u₂ is equivalent to the right projection step.

D.3 Convergence of the BSUM algorithm

In order to converge to a stationary point, the BSUM algorithm needs to meet a few key assumptions that are specified in the following theorem (adapted from Theorem 2 in [27]).

Theorem 2 (Convergence). The BSUM algorithm converges to a stationary point under the following conditions:

f is regular,
u₁, u₂ and f coincide (condition (B1) in [27]) (64)
u₁, u₂ are upper bounds of f (condition (B2) in [27]), , and (65)
the directional derivatives of u₁, u₂ and f coincide (condition (B3) in [27]) (66) and (67)
u₁ and u₂ are continuous functions (condition (B4) in [27]).

D.3.1 Regularity.

We show in this section that f is regular. The gradient of f with respect to its arguments, p and q, is defined as (68) Thus, the directional derivative of f in the direction , with and is equal to (69) Hence, if and , then the directional derivative of f in the direction of d is also positive, which proves that f is regular.

D.3.2 (B1), (B2), and (B3) in Razaviyayn et al., 2013.

Because u₁ and u₂ are equal to the function f with either p or q fixed, u₁ and u₂ coincide with f, which proves (B1). Necessarily, so do their directional derivatives, which proves (B3). Finally because they coincide, u₁ and u₂ are upper bounds of f [in the sense of the condition (B2) in [27]].

D.3.3 Continuity: (B4) in Razaviyayn et al., 2013.

Being compositions of linear operations, the functions u₁ and u₂ are both continuous.

D.4 Conclusion

It follows from these properties that the CSVD method, as described in Algorithm 5 being based on alternating between applying the projected power method with respect to p and to q, is a particular instance of the block successive upper-bound minimization (BSUM) algorithm ([27]). Therefore, any limit point of the CSVD method is a stationary point and so the CSVD converges to a stationary solution of Eq 15.

Supporting information

S1 Table. Simulated data.

Additional results on broader settings.

https://doi.org/10.1371/journal.pone.0211463.s001

(PDF)

References

1. Abdi H. Singular Value Decomposition (SVD) and Generalized Singular Value Decomposition (GSVD). In: Salkind NJ, editor. Encyclopedia of Measurement and Statistic. Thousand Oaks (CA): Sage; 2007. p. 907–912.
2. Greenacre M. Correspondence analysis. New-York: Academic Press; 1984.
3. Lebart L, Morineau A, Warwick KM. Multivariate Descriptive Statistical Analysis: Correspondence Analysis and Related Techniques for Large Matrices. New-York: Wiley; 1984.
4. Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences. 2016;374 (2065).
- View Article
- Google Scholar
5. Hotelling H. Relations between two sets of variates. Biometrika. 1936 dec;28(3-4):321–377.
- View Article
- Google Scholar
6. Greenacre M. Correspondence analysis in practice. Boca Raton: Chapman and Hall; 2010.
7. Abdi H, Williams LJ. Partial least squares methods: Partial least squares correlation and partial least square regression. In: Reisfeld B, Mayeno A, editors. Methods in Molecular Biology: Computational Toxicology. New-York: Springer Verlag; 2013. p. 549–579.
8. Abdi H, Williams LJ. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistic. 2010;2:97–106.
- View Article
- Google Scholar
9. Efron B, Hastie T. Computer Age Statistical Inference. Cambridge: Cambridge University Press; 2016.
10. Trendafilov NT. From simple structure to sparse components: a review. Computational Statistics. 2014 jun;29(3-4):431–454. Available from: http://link.springer.com/10.1007/s00180-013-0434-5.
- View Article
- Google Scholar
11. Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10(3):515–534. pmid:19377034
- View Article
- PubMed/NCBI
- Google Scholar
12. Lu Z, Zhang Y. An augmented Lagrangian approach for sparse principal component analysis. Mathematical Programming. 2012 oct;135(1-2):149–193. Available from: http://link.springer.com/10.1007/s10107-011-0452-4.
- View Article
- Google Scholar
13. Genicot M, Huang W, Trendafilov NT. Weakly Correlated Sparse Components with Nearly Orthonormal Loadings. In: GSI: International Conference on Geometric Science of Information. Palaiseau, France: Springer, Cham; 2015. p. 484–490. Available from: http://link.springer.com/10.1007/978-3-319-25040-3_52.
14. Benidis K, Sun Y, Babu P, Palomar DP. Orthogonal Sparse PCA and Covariance Estimation via Procrustes Reformulation. IEEE Transactions on Signal Processing. 2016 dec;64(23):6211–6226. Available from: http://ieeexplore.ieee.org/document/7558183/.
- View Article
- Google Scholar
15. Allen GI, Grosenick L, Taylor J. A Generalized Least-Square Matrix Decomposition. Journal of the American Statistical Association. 2014 jan;109(505):145–159.
- View Article
- Google Scholar
16. Combettes PL. The foundations of set theoretic estimation. Proceedings of the IEEE. 1993;81(2):182–208.
- View Article
- Google Scholar
17. Bauschke HH, Combettes PL. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. 2nd ed. New-York, NY: Springer Verlag; 2017.
18. Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. Journal of computational and graphical statistics. 2006;15(2):265–286.
- View Article
- Google Scholar
19. Hastie T, Tibshirani R, Wainwright M. Statistical Learning with Sparsity: The Lasso and Generalizations. Boca Raton: CRC Press; 2015.
20. Mattei PA, Bouveyron C, Latouche P. Globally Sparse Probabilistic PCA. In: Gretton A, Robert CC, editors. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics. vol. 51 of Proceedings of Machine Learning Research. Cadiz, Spain: PMLR; 2016. p. 976–984. Available from: http://proceedings.mlr.press/v51/mattei16.html.
21. Mackey L. Deflation Methods for Sparse PCA. In: Advances in Neural Information Processing Systems; 2009. p. 1017–1024.
22. Jenatton R, Audibert JY, Bach F. Structured Variable Selection with Sparsity-Inducing Norms. The Journal of Machine Learning Research. 2011;12:2777–2824.
- View Article
- Google Scholar
23. Chun H, Keleş S. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society Series B, Statistical methodology. 2010 jan;72(1):3–25. pmid:20107611
- View Article
- PubMed/NCBI
- Google Scholar
24. Le Floch E, Guillemot V, Frouin V, Pinel P, Lalanne C, Trinchera L, et al. Significant correlation between a set of genetic polymorphisms and a functional brain network revealed by feature selection and sparse Partial Least Squares. NeuroImage. 2012 oct;63(1):11–24. pmid:22781162
- View Article
- PubMed/NCBI
- Google Scholar
25. Silver M, Janousova E, Hua X, Thompson PM, Montana G, Alzheimer’s Disease Neuroimaging Initiative TADN. Identification of gene pathways implicated in Alzheimer’s disease using longitudinal imaging phenotypes with sparse regression. NeuroImage. 2012 nov;63(3):1681–94. pmid:22982105
- View Article
- PubMed/NCBI
- Google Scholar
26. Gloaguen A, Guillemot V, Tenenhaus A. An efficient algorithm to satisfy ℓ₁ and ℓ₂ constraints. In: 49èmes Journées de statistique. Avignon, France; 2017. p. 1–6. Available from: http://jds2017.sfds.asso.fr/program/Soumissions/subm306.pdf.
27. Razaviyayn M, Hong M, Luo ZQ. A Unified Convergence Analysis of Block Successive Minimization Methods for Nonsmooth Optimization. SIAM Journal on Optimization. 2013 Jun;23(2):1126–1153.
- View Article
- Google Scholar
28. Boyd SP, Vandenberghe L. Convex optimization. 1st ed. Cambridge: Cambridge University Press; 2004.
29. Blajenkova O, Kozhevnikov M, Motes MA. Object-spatial imagery: a new self-report imagery questionnaire. Applied Cognitive Psychology. 2006 mar;20(2):239–263.
- View Article
- Google Scholar
30. Valentin D, Abdi H, Edelman B. From rotation to disfiguration: Testing a dual-strategy model for recognition of faces across view angles. Perception. 1999;28:817–824. pmid:10664774
- View Article
- PubMed/NCBI
- Google Scholar
31. Abdi H, Beaton D. Principal Component and Correspondence Analyses Using R. New York: Springer Verlag; 2019.
32. Turk MA, Pentland AP. Face recognition using eigenfaces. In: Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Comput. Sco. Press; 1991. p. 586–591. Available from: http://ieeexplore.ieee.org/document/139758/.
33. Abdi H. Factor rotations in factor analysis. In: Salkind NJ, editor. Encyclopedia for Research Methods for the Social Sciences. Thousand Oaks (CA): Sage; 2003. p. 792–795.
34. Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2010 jul;72(4):417–473.
- View Article
- Google Scholar
35. Van Den Berg E, Schmidt M, Friedlander MP, Murphy K. Group Sparsity via Linear-Time Projection; 2008. Available from: http://www.optimization-online.org/DB_FILE/2008/07/2056.pdf.
36. Candes EJ, Romberg JK. Signal recovery from random projections. In: Bouman CA, Miller EL, editors. Computational Imaging III, Proceedings of Electronic Imaging 2005. vol. 5674. International Society for Optics and Photonics; 2005. p. 76. Available from: http://proceedings.spiedigitallibrary.org/proceeding.aspx?doi=10.1117/12.600722.
37. Daubechies I, Fornasier M, Loris I. Accelerated Projected Gradient Method for Linear Inverse Problems with Sparsity Constraints. Journal of Fourier Analysis and Applications. 2008 dec;14(5-6):764–792. Available from: http://link.springer.com/10.1007/s00041-008-9039-8.
- View Article
- Google Scholar
38. Duchi J, Shalev-Shwartz S, Singer Y, Chandra T. Efficient projections onto the l1-ball for learning in high dimensions. In: Proceedings of the 25th international conference on Machine learning—ICML’08. New York, New York, USA: ACM Press; 2008. p. 272–279. Available from: http://portal.acm.org/citation.cfm?doid=1390156.1390191.

[ref1] 1. Abdi H. Singular Value Decomposition (SVD) and Generalized Singular Value Decomposition (GSVD). In: Salkind NJ, editor. Encyclopedia of Measurement and Statistic. Thousand Oaks (CA): Sage; 2007. p. 907–912.

[ref2] 2. Greenacre M. Correspondence analysis. New-York: Academic Press; 1984.

[ref3] 3. Lebart L, Morineau A, Warwick KM. Multivariate Descriptive Statistical Analysis: Correspondence Analysis and Related Techniques for Large Matrices. New-York: Wiley; 1984.

[ref4] 4. Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences. 2016;374 (2065).
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref5] 5. Hotelling H. Relations between two sets of variates. Biometrika. 1936 dec;28(3-4):321–377.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref6] 6. Greenacre M. Correspondence analysis in practice. Boca Raton: Chapman and Hall; 2010.

[ref7] 7. Abdi H, Williams LJ. Partial least squares methods: Partial least squares correlation and partial least square regression. In: Reisfeld B, Mayeno A, editors. Methods in Molecular Biology: Computational Toxicology. New-York: Springer Verlag; 2013. p. 549–579.

[ref8] 8. Abdi H, Williams LJ. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistic. 2010;2:97–106.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref9] 9. Efron B, Hastie T. Computer Age Statistical Inference. Cambridge: Cambridge University Press; 2016.

[ref10] 10. Trendafilov NT. From simple structure to sparse components: a review. Computational Statistics. 2014 jun;29(3-4):431–454. Available from: http://link.springer.com/10.1007/s00180-013-0434-5.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref11] 11. Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10(3):515–534. pmid:19377034
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref12] 12. Lu Z, Zhang Y. An augmented Lagrangian approach for sparse principal component analysis. Mathematical Programming. 2012 oct;135(1-2):149–193. Available from: http://link.springer.com/10.1007/s10107-011-0452-4.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref13] 13. Genicot M, Huang W, Trendafilov NT. Weakly Correlated Sparse Components with Nearly Orthonormal Loadings. In: GSI: International Conference on Geometric Science of Information. Palaiseau, France: Springer, Cham; 2015. p. 484–490. Available from: http://link.springer.com/10.1007/978-3-319-25040-3_52.

[ref14] 14. Benidis K, Sun Y, Babu P, Palomar DP. Orthogonal Sparse PCA and Covariance Estimation via Procrustes Reformulation. IEEE Transactions on Signal Processing. 2016 dec;64(23):6211–6226. Available from: http://ieeexplore.ieee.org/document/7558183/.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref15] 15. Allen GI, Grosenick L, Taylor J. A Generalized Least-Square Matrix Decomposition. Journal of the American Statistical Association. 2014 jan;109(505):145–159.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref16] 16. Combettes PL. The foundations of set theoretic estimation. Proceedings of the IEEE. 1993;81(2):182–208.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref17] 17. Bauschke HH, Combettes PL. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. 2nd ed. New-York, NY: Springer Verlag; 2017.

[ref18] 18. Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. Journal of computational and graphical statistics. 2006;15(2):265–286.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref19] 19. Hastie T, Tibshirani R, Wainwright M. Statistical Learning with Sparsity: The Lasso and Generalizations. Boca Raton: CRC Press; 2015.

[ref20] 20. Mattei PA, Bouveyron C, Latouche P. Globally Sparse Probabilistic PCA. In: Gretton A, Robert CC, editors. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics. vol. 51 of Proceedings of Machine Learning Research. Cadiz, Spain: PMLR; 2016. p. 976–984. Available from: http://proceedings.mlr.press/v51/mattei16.html.

[ref21] 21. Mackey L. Deflation Methods for Sparse PCA. In: Advances in Neural Information Processing Systems; 2009. p. 1017–1024.

[ref22] 22. Jenatton R, Audibert JY, Bach F. Structured Variable Selection with Sparsity-Inducing Norms. The Journal of Machine Learning Research. 2011;12:2777–2824.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref23] 23. Chun H, Keleş S. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society Series B, Statistical methodology. 2010 jan;72(1):3–25. pmid:20107611
View Article
PubMed/NCBI
Google Scholar

[47] View Article

[48] PubMed/NCBI

[49] Google Scholar

[ref24] 24. Le Floch E, Guillemot V, Frouin V, Pinel P, Lalanne C, Trinchera L, et al. Significant correlation between a set of genetic polymorphisms and a functional brain network revealed by feature selection and sparse Partial Least Squares. NeuroImage. 2012 oct;63(1):11–24. pmid:22781162
View Article
PubMed/NCBI
Google Scholar

[51] View Article

[52] PubMed/NCBI

[53] Google Scholar

[ref25] 25. Silver M, Janousova E, Hua X, Thompson PM, Montana G, Alzheimer’s Disease Neuroimaging Initiative TADN. Identification of gene pathways implicated in Alzheimer’s disease using longitudinal imaging phenotypes with sparse regression. NeuroImage. 2012 nov;63(3):1681–94. pmid:22982105
View Article
PubMed/NCBI
Google Scholar

[55] View Article

[56] PubMed/NCBI

[57] Google Scholar

[ref26] 26. Gloaguen A, Guillemot V, Tenenhaus A. An efficient algorithm to satisfy ℓ₁ and ℓ₂ constraints. In: 49èmes Journées de statistique. Avignon, France; 2017. p. 1–6. Available from: http://jds2017.sfds.asso.fr/program/Soumissions/subm306.pdf.

[ref27] 27. Razaviyayn M, Hong M, Luo ZQ. A Unified Convergence Analysis of Block Successive Minimization Methods for Nonsmooth Optimization. SIAM Journal on Optimization. 2013 Jun;23(2):1126–1153.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref28] 28. Boyd SP, Vandenberghe L. Convex optimization. 1st ed. Cambridge: Cambridge University Press; 2004.

[ref29] 29. Blajenkova O, Kozhevnikov M, Motes MA. Object-spatial imagery: a new self-report imagery questionnaire. Applied Cognitive Psychology. 2006 mar;20(2):239–263.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref30] 30. Valentin D, Abdi H, Edelman B. From rotation to disfiguration: Testing a dual-strategy model for recognition of faces across view angles. Perception. 1999;28:817–824. pmid:10664774
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref31] 31. Abdi H, Beaton D. Principal Component and Correspondence Analyses Using R. New York: Springer Verlag; 2019.

[ref32] 32. Turk MA, Pentland AP. Face recognition using eigenfaces. In: Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Comput. Sco. Press; 1991. p. 586–591. Available from: http://ieeexplore.ieee.org/document/139758/.

[ref33] 33. Abdi H. Factor rotations in factor analysis. In: Salkind NJ, editor. Encyclopedia for Research Methods for the Social Sciences. Thousand Oaks (CA): Sage; 2003. p. 792–795.

[ref34] 34. Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2010 jul;72(4):417–473.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref35] 35. Van Den Berg E, Schmidt M, Friedlander MP, Murphy K. Group Sparsity via Linear-Time Projection; 2008. Available from: http://www.optimization-online.org/DB_FILE/2008/07/2056.pdf.

[ref36] 36. Candes EJ, Romberg JK. Signal recovery from random projections. In: Bouman CA, Miller EL, editors. Computational Imaging III, Proceedings of Electronic Imaging 2005. vol. 5674. International Society for Optics and Photonics; 2005. p. 76. Available from: http://proceedings.spiedigitallibrary.org/proceeding.aspx?doi=10.1117/12.600722.

[ref37] 37. Daubechies I, Fornasier M, Loris I. Accelerated Projected Gradient Method for Linear Inverse Problems with Sparsity Constraints. Journal of Fourier Analysis and Applications. 2008 dec;14(5-6):764–792. Available from: http://link.springer.com/10.1007/s00041-008-9039-8.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref38] 38. Duchi J, Shalev-Shwartz S, Singer Y, Chandra T. Efficient projections onto the l1-ball for learning in high dimensions. In: Proceedings of the 25th international conference on Machine learning—ICML’08. New York, New York, USA: ACM Press; 2008. p. 272–279. Available from: http://portal.acm.org/citation.cfm?doid=1390156.1390191.

Figures

Abstract

Introduction

1 Notations

2 Unconstrained singular value decomposition

3 Constrained singular value decomposition

3.1 Previous work

3.2 Current work: The constrained SVD (CSVD)

4 Empirical comparative evaluation of the CSVD

4.1 Simulated data

4.2 The face data

4.3 Psychometric example: The mental imagery questionnaire

Discussion

A The deflation operation generates orthogonal singular vectors

B Only some values of the constraints lead to solutions

C A fast and exact algorithm for the projection onto the intersection of an L1 and L2 ball

Projection onto the L1-ball

Projection onto the intersection of the L1 and L2-balls

D Convergence of the CSVD algorithm

D.1 Definitions and notations

D.2 Equivalence of the BSUM algorithm and the CSVD

D.3 Convergence of the BSUM algorithm

D.3.1 Regularity.

D.3.2 (B1), (B2), and (B3) in Razaviyayn et al., 2013.

D.3.3 Continuity: (B4) in Razaviyayn et al., 2013.

D.4 Conclusion

Supporting information

S1 Table. Simulated data.

References

C A fast and exact algorithm for the projection onto the intersection of an L₁ and L₂ ball

Projection onto the L₁-ball

Projection onto the intersection of the L₁ and L₂-balls