Skip to main content
Log in

A doubly sparse approach for group variable selection

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

An Erratum to this article was published on 25 July 2017

This article has been updated

Abstract

We propose a new penalty called the doubly sparse (DS) penalty for variable selection in high-dimensional linear regression models when the covariates are naturally grouped. An advantage of the DS penalty over other penalties is that it provides a clear way of controlling sparsity between and within groups, separately. We prove that there exists a unique global minimizer of the DS penalized sum of squares of residuals and show how the DS penalty selects groups and variables within selected groups, even when the number of groups exceeds the sample size. An efficient optimization algorithm is introduced also. Results from simulation studies and real data analysis show that the DS penalty outperforms other existing penalties with finite samples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Change history

  • 25 July 2017

    An erratum to this article has been published.

References

  • An, L. T. H., Tao, P. D. (1997). Solving a class of linearly constrained indefinite quadratic problems by DC algorithms. Journal of Global Optimization, 11, 253–285.

  • Bertsekas, D. P. (1999). Nonlinear Proramming (2nd ed.). Belmont: Athena Scientific.

    Google Scholar 

  • Bickel, P. J., Ritov, Y. A., Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37, 1705–1732.

  • Breheny, P. (2015). The group exponential lasso for bi-level variable selection. Biometrics, 71, 731–740.

    Article  MathSciNet  MATH  Google Scholar 

  • Breheny, P., Huang, J. (2009). Penalized methods for bi-level variable selection. Statistics and Its Interface, 2, 369–380.

  • Breiman, L., Friedman, J. H. (1985). Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association, 80, 580–598.

  • Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32, 407–499.

  • Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.

  • Fan, J., Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32, 928–961.

  • Friedman, J., Hastie, T., Hofling, H., Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1, 302–332.

  • Huang, J., Zhang, T. (2010). The benefit of group sparsity. The Annals of Statistics, 38, 1978–2004.

  • Huang, J., Horowitz, J. L., Ma, S. (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics, 36, 587–613.

  • Huang, J., Ma, S., Xie, H., Zhang, C.-H. (2009). A group bridge approach for variable selection. Biometrika, 96, 339–355.

  • Huang, J., Breheny, P., Ma, S. (2012). A selective review of group selection in high-dimensional models. Statistical Science, 27, 481–499.

  • Jiang, D., Huang, J. (2015). Concave 1-norm troup selection. Biostatistics, 16, 252–267.

  • Kim, Y., Kwon, S. (2012). Global optimality of nonconvex penalized estimators. Biometrika, 99, 315–325.

  • Kim, Y., Choi, H., Oh, H. (2008). Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association, 103, 1656–1673.

  • Kwon, S., Lee, S., Kim, Y. (2015). Moderately clipped LASSO. Computaitonal Statistics and Data Analysis, 92, 53–67.

  • Lin, Y., Zhang, H. H. (2006). Component selection and smoothing in smoothing spline analysis of variance models. The Annals of Statistics, 34, 2272–2297.

  • Meinshausen, N., Yu, B. (2009). Lasso-type recovery of sparse representation for high-dimensional data. The Annals of Statistics, 37, 246–270.

  • Rosset, S., Zhu, J. (2007). Piecewise linear regularized solution paths. The Annals of Statistics, 35, 1012–1030.

  • Sardy, S., Tseng, P. (2004). Amlet, ramlet, and gamlet: Automatic nonlinear fitting of additive models, robust and generalized, with wavelets. Journal of Computational and Graphical Statistics, 13, 283–309.

  • Scheetz, T. E., Kim, K. Y. A., Swiderski, R. E., Philp, A. R., Braun, T. A., Knudtson, K. L., Dorrance, A. M., DiBona, G. F., Huang, J., Casavant, T. L., Sheffield, V. C., Stone, E. M. (2006). Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences (Vol. 103, pp. 14429–14434).

  • Simon, N., Friedman, J., Hastie, T., Tibshirani, R. (2013). A sparse-group lasso. Journal of Computational and Graphical Statistics, 22, 231–245.

  • Sriperumbudur, B. K., & Lanckriet, G. R. (2009). On the convergence of the concave-convex procedure. Advances in Neural Information Processing Systems, 9, 1759–1767.

    Google Scholar 

  • Tibshirani, R. J. (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society Series B, 58, 267–288.

    MathSciNet  MATH  Google Scholar 

  • Wang, H., Li, R., Tsai, C. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94, 553–568.

  • Wang, H., Li, B., Leng, C. (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of Royal Statistical Society Series B, 71, 671–683.

  • Wang, L., Kim, Y., Li, R. (2013). Calibrating non-convex penalized regression in ultra-high dimension. The Annals of Statistics, 41, 2505–2536.

  • Wei, F., Huang, J. (2010). Consistent group selection in high-dimensional linear regression. Bernoulli, 16, 1369–1384.

  • Ye, F., Zhang, C.-H. (2010). Rate minimaxity of the lasso and dantzig selector for the \(\ell _q\) loss in \(\ell _r\) balls. Journal of Machine Learning Research, 11, 3519–3540.

  • Yuan, M., Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B, 68, 49–67.

  • Yuille, A., Rangarajan, A. (2003). The concave-convex procedure. Neural Computation, 15, 915–936.

  • Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38, 894–942.

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, C.-H., Zhang, T. (2012). A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science, 27, 576–593.

  • Zhao, P., Yu, B. (2006). On model selection consistency of lasso. Journal of Machine Learning Reserach, 7, 2541–2563.

  • Zhou, N., Zhu, J. (2010). Group variable selection via a hierarchical lasso and its oracle property. Statistics and Its Interface, 3, 557–574.

  • Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

We are grateful to the anonymous referees, the associate editor, and the editor for their helpful comments. The research of Sunghoon Kwon was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (No. 2014R1A1A1002995). The research of Woncheol Jang was supported by the Basic Science Research Program through the National Research Foundation (NRF) of Korea grant funded by the Ministry of Education (No. 2013R1A1A2010065) and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2014R1A4A1007895). The research of Yongdai Kim was supported by the Basic Science Research Program through the National Research Foundation (NRF) of Korea grant funded by the Korea government (MSIP) (No. 2014R1A4A1007895).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongdai Kim.

Additional information

An erratum to this article is available at https://doi.org/10.1007/s10463-017-0612-2.

Appendix

Appendix

Without loss of generality, we assume that the covariates are standardized so that \(\mathbf{X}_{kj}^\mathrm{T}\mathbf{X}_{kj}/n=1\) for all \(k\le K\) and \(j\le p_k\). Further, we use \({\varvec{\hat{{\beta }}}}^o\) instead of \({\varvec{\hat{{\beta }}}}^o(\gamma )\) for simplicity.

Proof of Lemma 1

From the first order optimality conditions (Bertsekas 1999), the necessary conditions follow directly. \(\square \)

Proof of Lemma 2

It suffices to show that there exists a \(\delta >0\) such that \(Q_{\lambda ,\gamma }(\varvec{{\beta }}) \ge Q_{\lambda ,\gamma }({\varvec{\hat{{\beta }}}})\) for all \(\varvec{{\beta }} \in B({\varvec{\hat{{\beta }}}},\delta ),\) where \(B({\varvec{\hat{{\beta }}}},\delta )=\{\varvec{{\beta }}:\Vert \varvec{{\beta }}-{\varvec{\hat{{\beta }}}}\Vert _1\le \delta \}.\) From the convexity of the sum of squared residuals, \(Q_{\lambda ,\gamma }(\varvec{{\beta }})-Q_{\lambda ,\gamma }({\varvec{\hat{{\beta }}}}) \ge \sum _{k=1}^K \chi _k\), where

$$\begin{aligned} \chi _k = D_k({\varvec{\hat{{\beta }}}})^\mathrm{T}(\varvec{{\beta }}_k-{\varvec{\hat{{\beta }}}}_k) + J_{\lambda ,\gamma }^{(k)}(\Vert \varvec{{\beta }}_k\Vert ) - J_{\lambda ,\gamma }^{(k)}(\Vert {\varvec{\hat{{\beta }}}}_k\Vert ). \end{aligned}$$

First, consider cases where \(k\in \mathcal{L}({\varvec{\hat{{\beta }}}})\). Let \(\delta _k=\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1-ap_k(\lambda -\gamma )\) then \(\Vert \varvec{{\beta }}_k\Vert _1/p_k >a(\lambda -\gamma )\) for all \(\varvec{{\beta }}_k \in B({\varvec{\hat{{\beta }}}}_k,\delta _k)\). Hence, the first and second conditions in Lemma 1 imply

$$\begin{aligned} \chi _k \ge -\gamma (\Vert \varvec{{\beta }}_k\Vert _1-\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1) + \gamma (\Vert \varvec{{\beta }}_k\Vert _1-\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1)= 0. \end{aligned}$$

Next, consider cases where \(k\in \mathcal{N}({\varvec{\hat{{\beta }}}})\). Let \(\delta _k = \min \{\omega _k,a(\lambda -\gamma )\}\), where \(\omega _k = 2a(\lambda -\Vert D_k({\varvec{\hat{{\beta }}}})\Vert _\infty )\). Then \(\Vert \varvec{{\beta }}_k\Vert _1 < \omega _k\) for all \(\varvec{{\beta }}_k \in B({\varvec{\hat{{\beta }}}}_k,\delta _k)\), which implies

$$\begin{aligned} \chi _k \ge \big (-\Vert D_k({\varvec{\hat{{\beta }}}})\Vert _\infty -\Vert \varvec{{\beta }}_k\Vert _1/2a +\lambda \big )\Vert \varvec{{\beta }}_k\Vert _1 \ge 0. \end{aligned}$$

Hence \(Q_{\lambda ,\gamma }(\varvec{{\beta }}) \ge Q_{\lambda ,\gamma }({\varvec{\hat{{\beta }}}})\) for all \(\varvec{{\beta }} \in B({\varvec{\hat{{\beta }}}},\min _{k \le K} \delta _k),\) which completes the proof. \(\square \)

Proof of Lemma 3

Assume that there exists another local minimizer \(\varvec{\tilde{{\beta }}} \in \varvec{{\Omega }}_{\lambda ,\gamma }\) and \(\varvec{\tilde{{\beta }}} \ne {\varvec{\hat{{\beta }}}}\). Let \(\varvec{{\beta }}^h=\varvec{\tilde{{\beta }}}+h({\varvec{\hat{{\beta }}}}-\varvec{\tilde{{\beta }}})=h{\varvec{\hat{{\beta }}}} +(1-h)\varvec{\tilde{{\beta }}}\) for \(0<h<1\), then we have

$$\begin{aligned} \Vert \mathbf{y}-\mathbf{X}\varvec{{\beta }}^h\Vert _2^2 {-} \Vert \mathbf{y}-\mathbf{X}\varvec{\tilde{{\beta }}}\Vert _2^2 = -2n h D({\varvec{\hat{{\beta }}}})^\mathrm{T}(\varvec{\tilde{{\beta }}}-{\varvec{\hat{{\beta }}}})+(h^2{-}2h)(\varvec{\tilde{{\beta }}}{-}{\varvec{\hat{{\beta }}}})^\mathrm{T}\mathbf{X}^\mathrm{T}\mathbf{X}(\varvec{\tilde{{\beta }}}-{\varvec{\hat{{\beta }}}}), \end{aligned}$$

by using the equality,

$$\begin{aligned} \Vert \mathbf{y}-\mathbf{X}\varvec{{\beta }}\Vert _2^2 - \Vert \mathbf{y}-\mathbf{X}{\varvec{\hat{{\beta }}}}\Vert _2^2 = 2nD({\varvec{\hat{{\beta }}}})^\mathrm{T}(\varvec{{\beta }} -{\varvec{\hat{{\beta }}}}) + (\varvec{{\beta }} -{\varvec{\hat{{\beta }}}})^\mathrm{T} \mathbf{X}^\mathrm{T}\mathbf{X}(\varvec{{\beta }} -{\varvec{\hat{{\beta }}}}), \end{aligned}$$

for any \(\varvec{{\beta }}\in \mathbb {R}^p.\) Hence, it follows that

$$\begin{aligned} Q_{\lambda ,\gamma }(\varvec{{\beta }}^h)-Q_{\lambda ,\gamma }(\varvec{\tilde{{\beta }}}) \le h\sum _{k=1}^K\chi _k(h)+ h^2(\varvec{\tilde{{\beta }}}-{\varvec{\hat{{\beta }}}})^\mathrm{T}(\mathbf{X}^\mathrm{T}\mathbf{X}/n)(\varvec{\tilde{{\beta }}}-{\varvec{\hat{{\beta }}}})/2, \end{aligned}$$

where

$$\begin{aligned} \chi _k(h)= -D_k({\varvec{\hat{{\beta }}}})^\mathrm{T}(\varvec{\tilde{{\beta }}}_k-{\varvec{\hat{{\beta }}}}_k) -\rho _{\min } \Vert \varvec{\tilde{{\beta }}}_k-{\varvec{\hat{{\beta }}}}_k\Vert _2^2 + \big \{J_{\lambda ,\gamma }^{(k)}(\Vert \varvec{{\beta }}_k^h\Vert _1)-J_{\lambda ,\gamma }^{(k)}(\Vert \varvec{\tilde{{\beta }}}_k\Vert _1)\big \}/h. \end{aligned}$$

First, consider cases where \(k\in \mathcal{L}({\varvec{\hat{{\beta }}}})\). If \(k \in \mathcal{N}(\varvec{\tilde{{\beta }}})\) then,

$$\begin{aligned} \chi _k(h)= & {} D_k({\varvec{\hat{{\beta }}}})^\mathrm{T}{\varvec{\hat{{\beta }}}}_k-\rho _{\min }\Vert {\varvec{\hat{{\beta }}}}_k\Vert _2^2+J_{\lambda ,\gamma }^{(k)}(h\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1)/h\\= & {} -\sum _{\hat{\beta }_{kj}\ne 0}\gamma \mathrm{sign}(\hat{\beta }_{kj})\hat{\beta }_{kj}-\rho _{\min }\Vert {\varvec{\hat{{\beta }}}}_k\Vert _2^2+\lambda \\\le & {} \Vert {\varvec{\hat{{\beta }}}}_k\Vert _1(-\rho _{\min }\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1/p_k+\lambda -\gamma ) < 0, \end{aligned}$$

from the condition \(\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1/p_k > (\lambda -\gamma )/\rho _{\min }\). If \(k \in \mathcal{L}(\varvec{\tilde{{\beta }}})\) then,

$$\begin{aligned} \chi _k(h)< & {} -D_k({\varvec{\hat{{\beta }}}})^\mathrm{T}(\varvec{\tilde{{\beta }}}_k-{\varvec{\hat{{\beta }}}}_k) + \big \{J_{\lambda ,\gamma }^{(k)}(\Vert \varvec{{\beta }}_k^h\Vert _1) -J_{\lambda ,\gamma }^{(k)}(\Vert \varvec{\tilde{{\beta }}}_k\Vert _1)\big \}/h\\\le & {} \gamma (\Vert \varvec{\tilde{{\beta }}}_k\Vert _1-\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1)+\gamma (\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1-\Vert \varvec{\tilde{{\beta }}}_k\Vert _1)=0, \end{aligned}$$

unless \(\varvec{\tilde{{\beta }}}_k = {\varvec{\hat{{\beta }}}}_k\). If \(k \in \mathcal{S}(\varvec{\tilde{{\beta }}})\), we have

$$\begin{aligned} \sup _{k \in \mathcal{S}(\tilde{\varvec{\beta }})}\big \{\rho _{\min }\Vert \varvec{\tilde{{\beta }}}_k\Vert _1/p_k+\nabla J_{\lambda ,\gamma }(\Vert \varvec{\tilde{{\beta }}}_k\Vert _1)\big \} \le \max \big \{\lambda ,\rho _{\min } a(\lambda -\gamma )+\gamma \big \}, \end{aligned}$$

which implies

$$\begin{aligned} \chi _k(h)\le & {} \gamma (\Vert \varvec{\tilde{{\beta }}}_k\Vert _1-\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1) -\rho _{\min } \Vert \varvec{\tilde{{\beta }}}_k-{\varvec{\hat{{\beta }}}}_j\Vert _2^2 + \nabla J_{\lambda ,\gamma }(\Vert \varvec{\tilde{{\beta }}}_k\Vert _1)(\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1-\Vert \varvec{\tilde{{\beta }}}_k\Vert _1)\\\le & {} (\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1-\Vert \varvec{\tilde{{\beta }}}_k\Vert _1)\big \{-\gamma -\rho _{\min }(\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1-\Vert \varvec{\tilde{{\beta }}}_k\Vert _1)/p_k +\nabla J_{\lambda ,\gamma }(\Vert \varvec{\tilde{{\beta }}}_k\Vert _1)\big \}\\\le & {} (\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1-\Vert \varvec{\tilde{{\beta }}}_k\Vert _1)\big \{-\rho _{\min }\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1/p_k +\max \big \{\lambda -\gamma ,\rho _{\min } a(\lambda -\gamma )\big \}\big \} < 0, \end{aligned}$$

unless \(\Vert \varvec{\tilde{{\beta }}}_k\Vert _1 = \Vert {\varvec{\hat{{\beta }}}}_k\Vert _1\). Second, consider cases where \(k\in \mathcal{N}({\varvec{\hat{{\beta }}}})\). It is easy to see that

$$\begin{aligned} \chi _k(h)\le & {} \Vert D_k({\varvec{\hat{{\beta }}}})\Vert _\infty \Vert \varvec{\tilde{{\beta }}}_k\Vert _1-\rho _{\min }\Vert \varvec{\tilde{{\beta }}}_k\Vert _2^2 +\big \{J_{\lambda ,\gamma }^{(k)}(\Vert \varvec{{\beta }}_k^h\Vert _1)-J_{\lambda ,\gamma }^{(k)}(\Vert \varvec{\tilde{{\beta }}}_k\Vert _1)\big \}/h\\\le & {} \Vert \varvec{\tilde{{\beta }}}_k\Vert _1\big \{\Vert D_k({\varvec{\hat{{\beta }}}})\Vert _\infty -\rho _{\min }\Vert \varvec{\tilde{{\beta }}}_k\Vert _1/p_k-\nabla J_{\lambda ,\gamma }(\Vert \varvec{\tilde{{\beta }}}_k\Vert _1)\big \}< 0, \end{aligned}$$

unless \(\Vert \varvec{\tilde{{\beta }}}_k\Vert _1 =0\), since

$$\begin{aligned} \inf _{k \in \mathcal{A}(\tilde{\varvec{\beta }})}\big \{\rho _{\min }\Vert \varvec{\tilde{{\beta }}}_k\Vert _1/p_k+\nabla J_{\lambda ,\gamma }(\Vert \varvec{\tilde{{\beta }}}_k\Vert _1) \big \} \ge \min \big \{\lambda ,a\rho _{\min }(\lambda -\gamma )+\gamma \big \}. \end{aligned}$$

Hence, we finally have \(\sum _{k=1}^K\chi _k(h) < 0\), unless \(\Vert \varvec{\tilde{{\beta }}}_k\Vert _1=\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1\) for all \(k \le K\). This implies that there exists a \(\delta >0\) sufficiently small such that \(Q_{\lambda ,\gamma }(\varvec{{\beta }}^h)-Q_{\lambda ,\gamma }(\varvec{\tilde{{\beta }}})<0\) for all \(h\in (0,\delta )\) unless \(\Vert \varvec{\tilde{{\beta }}}_k\Vert _1=\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1\) for all \(k\le K\). Hence, \({\varvec{\hat{{\beta }}}}\) is the unique local minimizer. \(\square \)

Proof of Lemma 1

Let \(A^o= \{(k,j):\hat{\beta }_{kj}^o\ne 0\}\). From Lemma 2, it suffices to show that \(\mathbf{P}(E_1 \cap E_2 \cap E_3) \ge 1-\mathbf{P}_1 - \mathbf{P}_2 - \mathbf{P}_3\), where

$$\begin{aligned}&E_1=\big \{|A^o\cup A_*| \le (\alpha _0+1)|A_*|\big \},\\&E_2=\big \{{\min }_{k\in \mathcal{A}({\varvec{\beta }}^*)}\Vert {\varvec{\hat{{\beta }}}}_k^o\Vert _1/p_k > a(\lambda -\gamma )\big \},\\&E_3= \big \{{\max }_{k\in \mathcal{N}({\varvec{\beta }}^*)}\Vert D_k({\varvec{\hat{{\beta }}}})\Vert _\infty <\lambda \big \}. \end{aligned}$$

First consider the event \(E_1\). From Corollary 2 of Zhang and Zhang (2012), we have \(F \subset E_1\) provided that \(\phi _{\max }(\alpha _0|A_*|)/\alpha _0 \le \eta _{\min }/36\), where \(F =\big \{\max _{k \in \mathcal{A}({\varvec{\beta }}^*)}\Vert \mathbf{X}_k^\mathrm{T}\varvec{{\varepsilon }}/n\Vert _\infty \le \gamma /2\big \}\) and

$$\begin{aligned} \eta _{\min } = \inf _{{\varvec{\upsilon }}\in \mathbb {R}^{|G_*|}:\Vert {\varvec{\upsilon }}_{A_*^c}\Vert _1\le 3\Vert {\varvec{\upsilon }}_{A_*}\Vert _1} \big \{(|A_*|/n)\Vert \mathbf{X}_{G_*}^\mathrm{T}\mathbf{X}_{G_*}\varvec{{\upsilon }}\Vert _{\infty }/\Vert \varvec{{\upsilon }}\Vert _1\big \} \end{aligned}$$

is the cone invertible factor in Ye and Zhang (2010). On the other hand, inequality (7) of Zhang and Zhang (2012) proves \(\eta _{\min } \ge \delta _{\min }^2/16\), where

$$\begin{aligned} \delta _{\min }= \inf _{{\varvec{\upsilon }}\in \mathbb {R}^{|G_*|}:\Vert {\varvec{\upsilon }}_{A_*^c}\Vert _1\le 3\Vert {\varvec{\upsilon }}_{A_*}\Vert _1} \big \{ (1/\sqrt{n})\Vert \mathbf{X}_{G_*}\varvec{{\upsilon }}\Vert _2/\Vert \varvec{{\upsilon }}_{A_*}\Vert _2 \big \} \end{aligned}$$

is the restricted eigenvalue in Bickel et al. (2009) that satisfies \(\delta _{\min } \ge \sqrt{\kappa _{\min }}(1-3\sqrt{\phi _{\max }(\alpha _0|A_*|)/\alpha _0 \kappa _{\min }})\). Hence, (C2) implies that \(F \subset E_1\) and

$$\begin{aligned} \mathbf{P}(E_1^c)\le \mathbf{P}(F^c)\le & {} \sum _{k \in \mathcal{A}({\varvec{\beta }}^*)}\sum _{j=1}^{p_k} \mathbf{P}\big (|\mathbf{X}_{kj}^\mathrm{T}\varvec{{\varepsilon }}/n| > \gamma /2\big )\\\le & {} c_0|G_*|\exp (-d_0n\gamma ^2/4) =\mathbf{P}_1. \end{aligned}$$

Second, consider the event \(E_2\). From the first order optimality conditions (Rosset and Zhu 2007), \({\varvec{\hat{{\beta }}}}^o\) satisfies

$$\begin{aligned} \begin{array}{lll} \mathbf{X}_{kj}^\mathrm{T}(\mathbf{y}-\mathbf{X}{\varvec{\hat{{\beta }}}}^o)/n = \gamma \mathrm{sign}(\hat{\beta }_{kj}^o),&{}\quad \hat{\beta }_{kj}^o \ne 0,\\ |\mathbf{X}_{kj}^\mathrm{T}(\mathbf{y}-\mathbf{X}{\varvec{\hat{{\beta }}}}^o)/n| \le \gamma ,&{}\quad \hat{\beta }_{kj}^o =0, \end{array} \end{aligned}$$
(17)

for all \((k,j) \in G_*\). Let \(S=A^o \cup A_*\) and \({\varvec{\hat{{\beta }}}}_S^o\) be the vector that consists of elements \(\hat{\beta }_{kj}^o\) for \((k,j) \in S\). On the event \(E_1\), (C2) implies

$$\begin{aligned} {\varvec{\hat{{\beta }}}}_{S}^o - \varvec{{\beta }}_{S}^* =\varvec{{\Sigma }}_{S}^{-1}\{ -\mathbf{X}_{S}^\mathrm{T}(\mathbf{y}-\mathbf{X}_{S}{\varvec{\hat{{\beta }}}}_{S}^o)/n+ \mathbf{X}_{S}^\mathrm{T}\varvec{{\varepsilon }}/n\}, \end{aligned}$$
(18)

where \(\varvec{{\Sigma }}_{S}=\mathbf{X}_{S}^\mathrm{T}\mathbf{X}_{S}/n\). Let \(\mathbf{u}_{kj}\) be a vector of length \(|S|\le (\alpha _0+1)|A_*|\) whose unique nonzero element that corresponds to \(\beta _{kj}^*\) is 1 and the others are 0. Then, from (18), we can write

$$\begin{aligned} \hat{\beta }_{kj}^o - \beta _{kj}^*=\mathbf{u}_{kj}^\mathrm{T}({\varvec{\hat{{\beta }}}}_{S}^o - \varvec{{\beta }}_{S}^*)=\eta _{kj} + \mathbf{v}_{kj}^\mathrm{T}\varvec{{\varepsilon }}, \end{aligned}$$

where \(\eta _{kj}=-\mathbf{u}_{kj}^\mathrm{T}\varvec{{\Sigma }}_{S}^{-1}\mathbf{X}_{S}^\mathrm{T}(\mathbf{y}-\mathbf{X}_{S}{\varvec{\hat{{\beta }}}}_{S}^o)/n\) and \(\mathbf{v}_{kj}=\mathbf{X}_{S}\varvec{{\Sigma }}_{S}^{-1}\mathbf{u}_{kj}/n.\) Note that

$$\begin{aligned} |\eta _{kj}| \le \Vert \mathbf{u}_{kj}\Vert _2\Vert \mathbf{X}_{S}^\mathrm{T}(\mathbf{y}-\mathbf{X}_{S}{\varvec{\hat{{\beta }}}}_{S}^o)/n\Vert _2/\kappa _{\min } \le \gamma \sqrt{|S|}/\kappa _{\min } \end{aligned}$$

and

$$\begin{aligned} \Vert \mathbf{v}_{kj}\Vert _2^2= \mathbf{u}_{kj}^\mathrm{T}\varvec{{\Sigma }}_{S}^{-1}\mathbf{X}_{S}^\mathrm{T} \mathbf{X}_{S}\varvec{{\Sigma }}_{S}^{-1}\mathbf{u}_{kj}/n^2 \le 1/(n\kappa _{\min }). \end{aligned}$$

From (C1), it is easy to see that

$$\begin{aligned} \mathbf{P}_{E_1}\big (|\hat{\beta }_{kj}^o-\beta _{kj}^*| \ge \Vert \varvec{{\beta }}_k^*\Vert _1/p_k-a(\lambda -\gamma )\big )\le & {} \mathbf{P}\big (|\eta _{kj}|+|\mathbf{v}_{kj}^\mathrm{T}\varvec{{\varepsilon }}|\ge m_*-a(\lambda -\gamma )\big )\\\le & {} \mathbf{P}\big ( |\mathbf{v}_{kj}^\mathrm{T}\varvec{{\varepsilon }} |\ge m_*-a(\lambda -\gamma )-\gamma \sqrt{|S|}/\kappa _{\min }\big )\\\le & {} c_0 \exp \big (-d_0\kappa _{\min } n\xi _{\lambda ,\gamma }^{*2}\big ), \end{aligned}$$

where \(\mathbf{P}_{E_1}(A)=\mathbf{P}( E_1\cap A)\). Hence, by using the triangular inequality \(\Vert {\varvec{\hat{{\beta }}}}_k^o\Vert _1 \ge \Vert \varvec{{\beta }}_k^*\Vert _1 -\Vert {\varvec{\hat{{\beta }}}}_k^o-\varvec{{\beta }}_k^*\Vert _1 \), we have

$$\begin{aligned} \mathbf{P}_{E_1}\big (E_2^c\big )\le & {} \sum _{k\in \mathcal{A}({\varvec{\beta }}^*)}\mathbf{P}_{E_1}\big (\Vert {\varvec{\hat{{\beta }}}}_k^o-\varvec{{\beta }}_k^*\Vert _1/p_k\ge \Vert \varvec{{\beta }}_k^*\Vert _1/p_k-a(\lambda -\gamma )\big ) \nonumber \\\le & {} \sum _{(k,j) \in S}\mathbf{P}\big (|\hat{\beta }_{kj}^o-\beta _{kj}^*|\ge \Vert \varvec{{\beta }}_k^*\Vert _1/p_k-a(\lambda -\gamma )\big )\nonumber \\\le & {} c_0(\alpha _0+1)|A_*| \exp \big (-d_0\kappa _{\min } n\xi _{\lambda ,\gamma }^{*2}\big )=\mathbf{P}_2. \end{aligned}$$
(19)

Third, consider the event \(E_3\). From (18), we can write

$$\begin{aligned} \mathbf{X}_{kj}^\mathrm{T}(\mathbf{y}-\mathbf{X}_{S} {\varvec{\hat{{\beta }}}}_{S}^o)/n =\mathbf{X}_{kj}^\mathrm{T}(\mathbf{X}_{S}\varvec{{\beta }}_{S}^*-\mathbf{X}_{S}{\varvec{\hat{{\beta }}}}_{S}^o+\varvec{{\varepsilon }})/n = \zeta _{kj}+\mathbf{w}_{kj}^\mathrm{T}\varvec{{\varepsilon }}, \end{aligned}$$

for all \((k,j) \in S\), where \(\zeta _{kj}=\mathbf{X}_{kj}^\mathrm{T}\mathbf{X}_{S}\varvec{{\Sigma }}_{S}^{-1}\mathbf{X}_{S}^\mathrm{T}(\mathbf{y}-\mathbf{X}_{S}{\varvec{\hat{{\beta }}}}_{S}^o)/n^2\), \(\mathbf{w}_{kj}=(\mathbf{I}-\varvec{{\Pi }}_{S})\mathbf{X}_{kj}/n\) and \(\varvec{{\Pi }}_{S}=\mathbf{X}_{S}(\mathbf{X}_{S}^\mathrm{T}\mathbf{X}_{S})^{-1}\mathbf{X}_{S}^\mathrm{T}.\) Note that from (17),

$$\begin{aligned} |\zeta _{kj}|= & {} |\mathbf{X}_{kj}^\mathrm{T}\mathbf{X}_{S}\varvec{{\Sigma }}_{S}^{-1}\mathbf{X}_{S}^\mathrm{T}(\mathbf{y}-\mathbf{X}_{S}{\varvec{\hat{{\beta }}}}_{S}^o)/n^2|\\\le & {} \Vert \varvec{{\Sigma }}_{S}^{-1/2}\mathbf{X}_{S}^\mathrm{T} \mathbf{X}_{kj}/n\Vert _2\Vert \varvec{{\Sigma }}_{S}^{-1/2}\mathbf{X}_{S}^\mathrm{T}(\mathbf{y}-\mathbf{X}_{S}{\varvec{\hat{{\beta }}}}_{S}^o)/n\Vert _2\\\le & {} \Vert \varvec{{\Pi }}_{S} \mathbf{X}_{kj}/\sqrt{n}\Vert _2 \Vert \varvec{{\Sigma }}_{S}^{-1/2}\mathbf{X}_{S}^\mathrm{T}(\mathbf{y}-\mathbf{X}_{S}{\varvec{\hat{{\beta }}}}_{S}^o)/n\Vert _2\\\le & {} \gamma \sqrt{|S|/\kappa _{\min }} \end{aligned}$$

and \(\Vert \mathbf{w}_{kj}\Vert _2^2 =\mathbf{X}_{kj}^\mathrm{T}(\mathbf{I}-\varvec{{\Pi }}_{S})\mathbf{X}_{kj}/n^2 \le \Vert \mathbf{X}_{kj}\Vert _2^2/n^2 = 1/n\). Hence, from (C1),

$$\begin{aligned} \mathbf{P}_{E_1}\big (E_3^c\big )\le & {} \sum _{k\in \mathcal{N}({\varvec{\beta }}^*)}\sum _{j=1}^{p_k} \mathbf{P}\big (|\mathbf{X}_{kj}^\mathrm{T}(\mathbf{y}-\mathbf{X}{\varvec{\hat{{\beta }}}}^o)/n|\ge \lambda \big ) \nonumber \\\le & {} \sum _{k\in \mathcal{N}({\varvec{\beta }}^*)}\sum _{j=1}^{p_k} \mathbf{P}\big (|\mathbf{w}_{kj}^\mathrm{T}\varvec{{\varepsilon }}| \ge \lambda - \gamma \sqrt{|S|}/\sqrt{\kappa _{\min }}\big )\nonumber \\\le & {} c_0(p-|G_*|)\exp \big (-d_0n\zeta _{\lambda ,\gamma }^{*2}\big ) =\mathbf{P}_3. \end{aligned}$$
(20)

Hence, using \(\mathbf{P}(E_1 \cap E_2 \cap E_3) \ge 1-\mathbf{P}(E_1^c) - \mathbf{P}(E_1 \cap E_2^c) - \mathbf{P}(E_1 \cap E_3^c)\), we complete the proof. \(\square \)

Proof of Lemma 2

Suppose that there is another local minimizer \(\varvec{\tilde{{\beta }}}\in \varvec{{\Omega }}_{\lambda ,\gamma }((\alpha _0+1)|A_*|)\) such that \({\varvec{\hat{{\beta }}}}^o\ne \varvec{\tilde{{\beta }}}\). Let \(S=\{(k,j):\tilde{\beta }_{kj}\ne 0\} \cup A^o \cup A_*\). By replacing \(\mathbf{X}\) with \(\mathbf{X}_S\) in the proof of Lemma 3, we can see that if \({\varvec{\hat{{\beta }}}}^o\) satisfies conditions in Lemma 3 then \({\varvec{\hat{{\beta }}}}^o=\varvec{\tilde{{\beta }}}\). Since \(|S|\le 2(\alpha _0+1)|A_*|\), we have \(\lambda _{\min }(\mathbf{X}_S^\mathrm{T}\mathbf{X}_S/n) \ge \kappa _{\min }\) from (C2). Hence it suffices to show that

$$\begin{aligned}&\mathbf{P}_{E_1}\big (\min _{k\in \mathcal{A}({\varvec{\beta }}^*)}\Vert {\varvec{\hat{{\beta }}}}_k^o\Vert _1/p_k \le \max \{a,1/\kappa _{\min }\}(\lambda -\gamma ) \big ) \le \mathbf{P}_2\\&\mathbf{P}_{E_1}\big ({\max }_{k\in \mathcal{N}({\varvec{\beta }}^*)}\Vert D_k({\varvec{\hat{{\beta }}}}^o)\Vert _{\infty }<\min \big \{\lambda , a\kappa _{\min }(\lambda -\gamma )+\gamma \big \}\big ) \le \mathbf{P}_3, \end{aligned}$$

which is similar to proofs of (19) and (20) in the proof of Theorem 1. \(\square \)

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kwon, S., Ahn, J., Jang, W. et al. A doubly sparse approach for group variable selection. Ann Inst Stat Math 69, 997–1025 (2017). https://doi.org/10.1007/s10463-016-0571-z

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-016-0571-z

Keywords

Navigation