Learning Sparse FRAME Models for Natural Image Patterns

Xie, Jianwen; Hu, Wenze; Zhu, Song-Chun; Wu, Ying Nian

doi:10.1007/s11263-014-0757-x

Learning Sparse FRAME Models for Natural Image Patterns

Published: 02 October 2014

Volume 114, pages 91–112, (2015)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Jianwen Xie¹,
Wenze Hu¹,
Song-Chun Zhu¹ &
…
Ying Nian Wu¹

934 Accesses
26 Citations
Explore all metrics

Abstract

It is well known that natural images admit sparse representations by redundant dictionaries of basis functions such as Gabor-like wavelets. However, it is still an open question as to what the next layer of representational units above the layer of wavelets should be. We address this fundamental question by proposing a sparse FRAME (Filters, Random field, And Maximum Entropy) model for representing natural image patterns. Our sparse FRAME model is an inhomogeneous generalization of the original FRAME model. It is a non-stationary Markov random field model that reproduces the observed statistical properties of filter responses at a subset of selected locations, scales and orientations. Each sparse FRAME model is intended to represent an object pattern and can be considered a deformable template. The sparse FRAME model can be written as a shared sparse coding model, which motivates us to propose a two-stage algorithm for learning the model. The first stage selects the subset of wavelets from the dictionary by a shared matching pursuit algorithm. The second stage then estimates the parameters of the model given the selected wavelets. Our experiments show that the sparse FRAME models are capable of representing a wide variety of object patterns in natural images and that the learned models are useful for object classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Basis pursuit denoising-based image superresolution using a redundant set of atoms

Article 25 November 2014

Learning Scale and Shift-Invariant Dictionary for Sparse Representation

Directional Frames for Image Recovery: Multi-scale Discrete Gabor Frames

Article 21 June 2016

References

Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169.
Article Google Scholar
Adler, A., Elad, M., & Hel-Or, Y. (2013). Probabilistic Subspace Clustering via Sparse Representations. IEEE Signal Processing Letters, 20, 63–66.
Article Google Scholar
Aharon, M., Elad, M., & Bruckstein, A. M. (2006). The K-SVD: An algorithm for designing of overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54, 4311–4322.
Article Google Scholar
Bengio, Y., Courville, A. C., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on PAMI, 35, 1798–1828.
Bruckstein, A. M., Donoho, D. L., & Elad, M. (2009). From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Review, 51, 34–81.
Article MathSciNet MATH Google Scholar
Chen, S. S., Donoho, D. L., & Saunders, M. A. (2001). Atomic decomposition by basis pursuit. SIAM Review, 43, 129–159.
Article MathSciNet MATH Google Scholar
Chen, J., & Huo, X. (2005). Sparse representations for multiple measurements vectors (mmv) in an overcomplete dictionary. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. 257–260.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 886-893.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38.
MathSciNet MATH Google Scholar
Duane, S., Kennedy, A. D., Pendleton, B. J., & Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters, 195, 216–222.
Article Google Scholar
Elad, M. (2010). Sparse and redundant representations: From theory to applications in signal and image processing. Berlin: Springer.
Book Google Scholar
Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions Image Processing, 15, 3736–3745.
Article MathSciNet Google Scholar
Elad, M., Milanfar, P., & Rubinstein, R. (2007). Analysis versus synthesis in signal priors. Inverse problems, 23(3), 947.
Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). LIBLINEAR: A library for large linear classication. Journal of Machine Learning Research, 9, 1871–1874.
MATH Google Scholar
Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In: Procedings of the Computer Vision and Pattern Recognition Workshops.
Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on PAMI, 32, 1627–1645.
Article Google Scholar
Ferrari, V., Jurie, F., & Schmid, C. (2010). From images to shape models for object detection. International Journal of Computer Vision, 87, 284–303.
Fidler, S., Boben, M. & Leonardis, A. (2008). Similarity-based cross-layered hierarchical representation for object categorization. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR).
Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82, 249–266.
Article MathSciNet Google Scholar
Gelman, A., & Meng, X. L. (1998). Simulating normalizing constants: from importance sampling to bridge sampling to path sampling. Statistical Science, 13, 163–185.
Article MathSciNet MATH Google Scholar
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Transactions on PAMI, 6, 721–741.
Article MATH Google Scholar
Geman, S., Potter, D. F., & Chi, Z. (2002). Composition systems. Quarterly of Applied Mathematics, 60, 707–736.
MathSciNet MATH Google Scholar
Gong, B., Shi, Y., Sha, F. & Grauman, K. (2012). Geodesic flow kernel for unsupervised domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Gopalan, R., Li, R., & Chellappa, R. (2011). Domain adaptation for object recognition: an unsupervised approach. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset. Caltech: Technical report.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14, 1771–1800.
Article MATH Google Scholar
Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554.
Article MathSciNet MATH Google Scholar
Hoffman, J., Rodner, E., Donahue, J., Saenko, K., & Darrell, T. (2013). Efficient learning of domain-invariant image representations. In: Proceedings of the International Conference of Learning Representations.
Hong, Y., Si, Z., Hu, W., Zhu, S. C., & Wu, Y. N. (2013). Unsupervised learning of compositional sparse code for natural image representation. Quarterly of Applied Mathematics, 72, 373–406.
Article MathSciNet Google Scholar
Jhou, I., Liu, D., Lee, D. T. & Chang, S. (2012). Robust visual domain adaptation with low-rank reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR)..
Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th Annual International Conference on Machine Learning.
Liu, C., Zhu, S.-C., & Shum, H.-Y. (2001). Learning inhomogeneous gibbs model of faces by minimax entropy. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 281–287.
Lounici, K., Tsybakov, A. B., Pontil, M., & van de Geer, S. A. (2009). Taking advantage of sparsity in multi-task learning. In: Proceedings of the 22nd Conference on Learning Theory.
Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60, 91–110.
Mallat, S., & Zhang, Z. (1993). Matching pursuit in a time-frequency dictionary. IEEE Transactions on Signal Processing, 41, 3397–3415.
Article MATH Google Scholar
Marszalek, M., & Schmid, C. (2007). Accurate object localization with shape masks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Nama, S., Daviesb, M. E., Eladc, M., & Gribonval, R. (2013). The cosparse analysis model and algorithms. Applied and Computational Harmonic Analysis, 34, 30–56.
Article MathSciNet Google Scholar
Neal, R. (2001). Annealed importance sampling. Statistics and Computing, 11, 125–139.
Article MathSciNet Google Scholar
Neal, R. (2011). MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo.
Obozinski, G., Wainwright, M. J., & Jordan, M. I. (2011). Support union recovery in high-dimensional multivariate regression. Annals of Statistics, 39, 1–47.
Article MathSciNet MATH Google Scholar
Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609.
Article Google Scholar
Pati, Y. C., Rezaiifar, R., & Krishnaprasad, P. S. (1993). Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In: Proceedings of the 27th Asilomar Conference on Signals, Systems and Computers, pp. 40–44.
Pietra, S. D., Pietra, V. D., & Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on PAMI, 19, 380–393.
Ranzato, M., & Hinton, G. E. (2010). Modeling pixel means and covariances using factorized third-order Boltzmann machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2, 1019–1025.
Article Google Scholar
Roth, S., & Black, M. (2009). Fields of experts. International Journal of Computer Vision, 82, 205–229.
Rubinstein, R., Zibulevsky, M., & Elad, M. (2010). Double sparsity: Learning sparse dictionaries for sparse signal approximation. IEEE Transactions on Signal Processing, 58, 1553–1564.
Article MathSciNet Google Scholar
Saenko, K., Kulis, B., Fritz, M. & Darrell, T. (2010). Adapting visual category models to new domains. In: Proceedings of the European Conference on Computer Vision (ECCV).
Shekhar, S., Patel, V. M., Nguyen, H. V., & Chellappa, R. (2013). Generalized domain adaptive dictionaries. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Si, Z., & Zhu, S. C. (2012). Learning hybrid image template (HIT) by information projection. IEEE Transactions on PAMI, 34, 1354–1367.
Article Google Scholar
Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (pp. 194–281). Cambridge: MIT Press.
Teh, Y. W., Welling, M., Osindero, S., & Hinton, G. E. (2003). Energy-based models for sparse overcomplete representations. Journal of Machine Learning Research, 4, 1235–1260.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, B, 58, 267–288.
MathSciNet MATH Google Scholar
Tropp, J., Gilbert, A., & Straus, M. (2006). Algorithms for simultaneous sparse approximation. part I: Greedy pursuit. Journal of Signal Processing, 86, 572–588.
Article MATH Google Scholar
Tuytelaars, T., Lampert, C. H., Blaschko, M. B., & Buntine, W. (2009). Unsupervised object discovery: A comparison. International Journal of Computer Vision, 88(2), 284-302.
Vapnik, V. N. (2000). The nature of statistical learning theory. Berlin: Springer.
Welling, M., Hinton, G. E., & Osindero, S. (2003). Learning sparse topographic representations with products of student-t distributions. In: Proceedings of Advances in Neural Information Processing Systems (NIPS).
Wu, Y. N., Si, Z., Gong, H., & Zhu, S. C. (2010). Learning active basis model for object detection and recognition. International Journal of Computer Vision, 90, 198–235.
Xie, J., Hu, W., Zhu, S. C., & Wu, Y. N. (2014). Learning Inhomogeneous FRAME models for object patterns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Yang, M., Zhang, L., Feng, X., & Zhang, D. (2011). Fisher discrimination dictionary learning for sparse representation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 543-550.
Younes, L. (1999). On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics and Stochastic Reports, 65, 177–228.
Article MathSciNet MATH Google Scholar
Zeiler, M., Taylor, G., & Fergus, R. (2011). Adaptive deconvolutional networks for mid and high level feature learning. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Zhu, L., Lin, C., Huang, H., Chen, Y., & Yuille, A. (2008). Unsupervised structure learning: hierarchical recursive composition, suspicious coincidence and competitive exclusion. In: Proceedings of the European Conference on Computer Vision (ECCV).
Zhu, S. C., & Mumford, D. B. (2006). A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, 2, 259–362.
Article MATH Google Scholar
Zhu, S. C., Wu, Y. N., & Mumford, D. B. (1998). Minimax entropy principle and its application to texture modeling. Neural Computation, 9, 1627–1660.
Article Google Scholar

Download references

Acknowledgments

The work is supported by NSF DMS 1310391, NSF IIS 1423305, ONR MURI N00014-10-1-0933, DARPA MSEE FA8650-11-1-7149. We thank the three reviewers for their insightful comments and valuable suggestions that have helped us improve the presentation and the content of this paper. We are grateful to one reviewer for sharing the insights on the analysis prior models. Thanks also go to an editor of the special issue for helpful suggestions. We thank Adrian Barbu for discussions.

Author information

Authors and Affiliations

Department of Statistics, UCLA, Los Angeles, CA, USA
Jianwen Xie, Wenze Hu, Song-Chun Zhu & Ying Nian Wu

Authors

Jianwen Xie
View author publications
You can also search for this author in PubMed Google Scholar
Wenze Hu
View author publications
You can also search for this author in PubMed Google Scholar
Song-Chun Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Ying Nian Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ying Nian Wu.

Additional information

Communicated by Julien Mairal, Francis Bach, and Michael Elad.

Appendices

Appendix: Simulation by Hamiltonian Monte Carlo

To approximate $\mathrm{E}_{p(\mathbf{I};\lambda ^{(t)})}[|\langle \mathbf{I}, B_{x, s, \alpha }\rangle |]$ in Eq. (9), we need to draw a synthesized sample set $\{\tilde{\mathbf{I}}_m\}$ from $p(\mathbf{I};\lambda ^{(t)})$ by HMC (Duane et al. 1987). We can write $p(\mathbf{I}; \lambda )$ as $p(\mathbf{I}) \propto \exp (-U(\mathbf{I}))$, where $\mathbf{I}\in R^{|\mathcal{D}|}$ and

$$\begin{aligned} U(\mathbf{I})=-\sum _{x, s, \alpha } \lambda _{x, s, \alpha } \big | \langle \mathbf{I}, B_{x, s, \alpha } \rangle \big |+ \frac{1}{2} |\mathbf{I}|^2 \end{aligned}$$

(33)

(assuming $\sigma ^2 = 1$). In physics context, $\mathbf{I}$ can be regarded as a position vector and $U(\mathbf{I})$ the potential energy function. To allow Hamiltonian dynamics to operate, we need to introduce an auxiliary momentum vector ${\varvec{\phi }}\in R^{|\mathcal{D}|}$ and the corresponding kinetic energy function $K({\varvec{\phi }})=|{\varvec{\phi }}|^2/2m$, where $m$ represents the mass. After that, a fictitious physical system described by the canonical coordinates $(\mathbf{I},{\varvec{\phi }})$ is defined, and its total energy is $H(\mathbf{I},{\varvec{\phi }})=U(\mathbf{I})+K({\varvec{\phi }})$. Instead of sampling from $p(\mathbf{I})$ directly, HMC samples from the joint canonical distribution $p(\mathbf{I},{\varvec{\phi }}) \propto \exp (-H(\mathbf{I},{\varvec{\phi }}))$, under which $\mathbf{I}\sim p(\mathbf{I})$ marginally and ${\varvec{\phi }}$ follows a Gaussian distribution and is independent of $\mathbf{I}$. Each time HMC draws a random sample from the marginal Gaussian distribution of ${\varvec{\phi }}$, and then evolves according to the Hamiltonian dynamics that conserves the total energy.

In practical implementation, the leapfrog algorithm is used to discretize the continuous Hamiltonian dynamics as follows, with $\epsilon $ being the step-size:

$$\begin{aligned}&{\varvec{\phi }}^{(t+\epsilon /2)}={\varvec{\phi }}^{(t)}-\big (\epsilon /2\big )\frac{\partial U}{\partial \mathbf{I}}\big (\mathbf{I}^{(t)}\big ), \end{aligned}$$

(34)

$$\begin{aligned}&\mathbf{I}^{(t+\epsilon )}= \mathbf{I}^{(t)} + \epsilon \frac{{\varvec{\phi }}^{(t+\epsilon /2)}}{m}, \end{aligned}$$

(35)

$$\begin{aligned}&{\varvec{\phi }}^{(t+\epsilon )}={\varvec{\phi }}^{(t+\epsilon /2)}-(\epsilon /2)\frac{\partial U}{\partial \mathbf{I}}\big (\mathbf{I}^{(t+\epsilon )}\big ), \end{aligned}$$

(36)

that is, a half-step update of ${\varvec{\phi }}$ is performed first and then it is used to compute $\mathbf{I}^{(t + \epsilon )}$ and ${\varvec{\phi }}^{(t + \epsilon )}$.

A key step in the leapfrog algorithm is the computation of the derivative of the potential energy function

$$\begin{aligned} \frac{\partial U}{ \partial \mathbf{I}}=-\sum _{x, s, \alpha } \lambda _{x, s, \alpha } \text{ sign }\big ( \langle \mathbf{I}, B_{x, s, \alpha } \rangle \big )B_{x, s, \alpha }+ \mathbf{I}, \end{aligned}$$

(37)

where the map of responses $r_{x, s, \alpha } = \langle \mathbf{I}, B_{x, s, \alpha } \rangle $ is computed by bottom-up convolution of the filter corresponding to $(s, \alpha )$ with $\mathbf{I}$ for each $(s, \alpha )$. Then the derivative is computed by top-down linear superposition of the basis functions: $-\sum _{x, s, \alpha } \lambda _{x, s, \alpha } \text{ sign }( r_{x, s, \alpha } )B_{x, s, \alpha } + \mathbf{I}$, which can again be computed by convolution. Both bottom-up and top-down convolutions can be carried out efficiently by GPUs.

The discretization of the leapfrog algorithm cannot keep $H(\mathbf{I}, {\varvec{\phi }})$ exactly constant, so a Metropolis acceptance/rejection step is used to correct the discretization error. Starting with the current state, $(\mathbf{I},{\varvec{\phi }})$, the new state $(\mathbf{I}^ \star ,{\varvec{\phi }}^ \star )$, after $L$ leapfrog steps, is accepted as the next state of the Markov chain with probability $ \min [1, \exp (-H(\mathbf{I}^ \star ,{\varvec{\phi }}^ \star )+H(\mathbf{I},{\varvec{\phi }})) ]. $ If it is not accepted, the next state is the same as the current state.

In summary, a complete description of the HMC sampler for inhomogeneous FRAME is as follows:

(i)
Generate the momentum vector ${\varvec{\phi }}$ from its marginal distribution $p({\varvec{\phi }}) \propto \exp (-K({\varvec{\phi }}))$, which is the zero-mean Gaussian distribution with covariance matrix $m I$ ($I$ is the identity matrix).
(ii)
Perform $L$ leapfrog steps to reach the new state $(\mathbf{I}^{\star },{\varvec{\phi }}^{\star }).$
(iii)
Perform acceptance/rejection of the proposed state $(\mathbf{I}^{\star },{\varvec{\phi }}^{\star }).$

$L$, $\epsilon $, and $m$ are parameters of the algorithm, which need to be tuned to obtain good performance.

Maximum Entropy Justification

The inhomogeneous FRAME model can be justified by the maximum entropy principle. Suppose the true distribution that generates the observed images $\{\mathbf{I}_m\}$ is $f(\mathbf{I})$. Let $\lambda ^{\star }$ solve the population version of the maximum likelihood equation:

$$\begin{aligned} \mathrm{E}_{p(\mathbf{I}; \lambda )}\big [|\langle \mathbf{I}, B_{x, s, \alpha }\rangle |\big ] = \mathrm{E}_{f}\big [|\langle \mathbf{I}, B_{x, s, \alpha }\rangle |\big ], \quad \forall x, s, \alpha . \end{aligned}$$

(38)

Let $\varOmega $ be the set of all the distributions $p(\mathbf{I})$ such that

$$\begin{aligned} \mathrm{E}_{p}\big [|\langle \mathbf{I}, B_{x, s, \alpha }\rangle |\big ] = \mathrm{E}_{f}\big [|\langle \mathbf{I}, B_{x, s, \alpha }\rangle |\big ], \quad \forall x, s, \alpha . \end{aligned}$$

(39)

Then $f \in \varOmega $. Let $\Lambda $ be the set of all the distributions $\{p_{\lambda }, \forall \lambda \}$, where $p_{\lambda }(\mathbf{I}) = p(\mathbf{I}; \lambda )$. Then $q \in \Lambda $ since $q(\mathbf{I}) = p(\mathbf{I}; \lambda = 0)$. Thus $p_{\lambda ^\star }$ is the intersection between $\Lambda $ and $\varOmega $. In Fig. 17, $\Lambda $ and $\varOmega $ are illustrated by blue and green curves respectively, where each point on the curves is a probability distribution. The two curves $\Lambda $ and $\varOmega $ are “orthogonal” in the sense that for any $p_{\lambda } \in \Lambda $ and for any $p \in \varOmega $, it can be easily proved that the Pythagorean property

$$\begin{aligned} \mathrm{KL}\big (p || p_{\lambda }\big ) = \mathrm{KL}\big (p || p_{\lambda ^{\star }}\big ) + \mathrm{KL}\big (p_{\lambda ^{\star }}||p_{\lambda }\big ) \end{aligned}$$

(40)

holds (Pietra et al. 1997), where $\mathrm{KL}(p||q)$ is the Kullback-Leibler divergence from $p$ to $q$. This Pythagorean property leads to the following dual properties of $p_{\lambda ^{\star }}$:

(1)
Maximum likelihood: Among all $p_{\lambda } \in \Lambda $, $p_{\lambda ^{\star }}$ achieves the minimum of $\mathrm{KL}(f||p_{\lambda })$.
(2)
Maximum entropy or minimum divergence: Among all $p \in {\varOmega }$, $p_{\lambda ^{\star }}$ achieves the minimum of $\mathrm{KL}(p||q)$. Thus $p_{\lambda ^{\star }}$ can be considered the minimal modification of the reference distribution $q$ to match the statistical properties of the true distribution $f$.

The above justification is also true for the sparse FRAME model.

For sparsification, in principle, we can select $B_{x_i, s_i, \alpha _i}$ sequentially using a procedure like projection pursuit (Friedman 1987) or filter pursuit (Zhu et al. 1998). Suppose we have selected $k$ basis functions $(B_{x_i, s_i, \alpha _i}, i = 1, \ldots , k)$, and let $p_k$ be the fitted model with the corresponding $\lambda = (\lambda _i, i = 1, \ldots , k)$ estimated by MLE. Suppose we are to select the next basis function $B_{x_{k+1}, s_{k+1}, \alpha _{k+1}}$. Let $p_{k+1}$ be the fitted model. Then we want to minimize $\mathrm{KL}(f||p_{k+1}) = \mathrm{KL}(f||p_{k}) - \mathrm{KL}(p_{k+1}||p_k)$, that is, we want to maximize $\mathrm{KL}(p_{k+1}||p_k)$, which serves as the pursuit index. The problem with such a procedure is that each time we need to fit $p_k$ which involves MCMC computation, and the pursuit index is also difficult to compute. So we choose to pursue a different approach by exploring the connection between sparse FRAME and the shared sparse coding.

Sparse FRAME and Shared Sparse Coding

From sparse FRAME to shared sparse coding Let us assume that the reference distribution $q(\mathbf{I})$ in the sparse FRAME model (15) is a Gaussian white noise model so that the pixel intensities follow $\mathrm{N}(0, \sigma ^2)$ independently. For sparse FRAME, it is natural to assume that the number of selected basis functions $n$ is much less than the number of pixels in $\mathbf{I}$, i.e., $n \ll |\mathcal{D}|$, where $\mathcal{D}$ is the image domain. For notational convenience, we can make $\mathbf{I}$ and $B_i = B_{x_i, s_i, \alpha _i}$, $i = 1, \ldots , n$ into $|\mathcal{D}|$-dimensional vectors, and let $\mathbf{B}= (B_1, \ldots , B_n)$ be the resulting $|\mathcal{D}| \times n$ matrix.

The connection between sparse FRAME and shared sparse coding is most evident if we temporarily assume that the selected basis functions $(B_i, i = 1, \ldots , n)$ are orthogonal (with unit $\ell _2$ norm as assumed before). Extension to non-orthogonal $\mathbf{B}$ is straightforward but requires tedious notation (such as $(\mathbf{B}^{T}\mathbf{B})^{-1}$). For $\mathbf{B}$, we can construct $\bar{n} = |\mathcal{D}| - n$ basis vectors of unit norm $\bar{B}_1, \ldots , \bar{B}_{\bar{n}}$ that are orthogonal to each other and that are also orthogonal to $(B_i, i = 1, \ldots , n)$. Thus each image $\mathbf{I}= \sum _{i=1}^{n} r_i B_i + \sum _{i=1}^{\bar{n}} \bar{r}_i \bar{B}_i$, where $r_i = \langle \mathbf{I}, B_i \rangle $, and $\bar{r}_i = \langle \mathbf{I}, \bar{B}_i\rangle $. So we have the linear additive model $\mathbf{I}= \sum _{i=1}^{n} r_i B_i + \epsilon $, with $\epsilon = \sum _{i=1}^{\bar{n}} \bar{r}_i \bar{B}_i$ being the least squares residual image.

Under the Gaussian white noise $q(\mathbf{I})$, $r_i$ and $\bar{r}_i$ are all independent $\mathrm{N}(0, \sigma ^2)$ random variables because of the orthogonality of $(\mathbf{B}, \bar{\mathbf{B}})$. Let $R$ be the column vector whose elements are $r_i$, and $\bar{R}$ be the column vector whose elements are $\bar{r}_i$. Then under the sparse FRAME model (15), only the distribution of $R$ is modified during the change from $q(\mathbf{I})$ to $p(\mathbf{I}; \mathbf{B}, \lambda )$, which changes the distribution of $R$ from Gaussian white noise $q(R)$ to

$$\begin{aligned} p(R; \lambda ) = \frac{1}{Z(\lambda )}\exp \left( \sum _{i=1}^{n} \lambda _i |r_i|\right) q(R), \end{aligned}$$

(41)

while the distribution of the residual coordinates $\bar{R}$ remains Gaussian white noise, and $R$ and $\bar{R}$ remain independent. That is, $p(R, \bar{R}; \lambda ) = p(R; \lambda ) q(\bar{R}) $.

Thus the sparse FRAME model implies a linear additive model $\mathbf{I}= \sum _{i=1}^{n} r_i B_i + \epsilon $, where $R \sim p(R; \lambda )$ and $\epsilon $ is a Gaussian white noise in the $\bar{n}$-dimensional residual space, and $\epsilon $ is independent of $R$. If we observe independent training images $\{\mathbf{I}_m, m = 1, \ldots , M\}$ from the model, then $\mathbf{I}_m = \sum _{i=1}^{n} r_{m, i} B_i + \epsilon _m$, i.e., $\{\mathbf{I}_m\}$ share a common set of basis functions $\mathbf{B}= (B_i, i = 1, \ldots , n)$ that provide sparse coding for multiple images simultaneously.

From shared sparse coding to sparse FRAME Conversely, suppose we are given a shared sparse coding model of the form $\mathbf{I}=\sum _{i=1}^{n} c_i B_i + \epsilon = \mathbf{B}C + \epsilon $, where $C$ is a column vector whose components are $c_i$. Assume $C \sim p(C)$ and $\epsilon \sim \mathrm{N}(0, I \sigma ^2)$, where $I$ is the $|\mathcal{D}|$-dimensional identity matrix, and $\epsilon $ and $C$ are independent. Let $\delta = \mathbf{B}^T \epsilon $, each component of which $\delta _i = \langle \epsilon , B_i\rangle \sim \mathrm{N}(0, \sigma ^2)$ independently. Then we can write $\mathbf{I}= \mathbf{B}R + \bar{\mathbf{B}}\bar{R}$, where $R = C + \delta $, and $\bar{\epsilon } = \bar{\mathbf{B}}\bar{R}$ is the projection of $\epsilon $ onto the space of $\bar{\mathbf{B}}$. Let $\tilde{p}(R)$ be the density of $R = C+ \delta $, which is obtained by convolving $p(C)$ with Gaussian white noise density. Then $p(\mathbf{I}) = \tilde{p}(R) q(\bar{R}) = q(\mathbf{I}) \tilde{p}(R) /q(R)$ since $q(\mathbf{I}) = q(R)q(\bar{R})$ under Gaussian white noise model ($d\mathbf{I}= dR d\bar{R}$ under orthogonality so there is no Jacobian term). If we choose to model $\tilde{p}(R)/q(R) = \exp \left( \sum _{i=1}^{n} \lambda _i |r_i|\right) /Z(\lambda )$, we arrive at the sparse FRAME model.

Selection of basis functions For orthogonal $\mathbf{B}$, as shown above, the probability density $p(\mathbf{I}; \mathbf{B}, \lambda ) = q(\bar{R}) p(R; \lambda ) = q(\bar{R}) q(R) \exp \left( \sum _{i=1}^{n} \lambda _i |r_i|\right) /Z(\lambda )$. Given a set of training images $\{\mathbf{I}_m, m = 1, \ldots , M\}$, and for a candidate set of basis functions $\mathbf{B}= (B_i, i= 1, \ldots , n)$, we can estimate $\lambda = (\lambda _i, i = 1, \ldots , n)$ by MLE, giving us $\lambda ^{\star }$, and the resulting log-likelihood is

$$\begin{aligned}&\sum _{m=1}^{M} \log p\big (\mathbf{I}_m; \mathbf{B}, \lambda ^{\star }\big ) \nonumber \\&\quad = \sum _{m=1}^{M} \left[ \log q\big (\bar{R}_m\big ) + \log p\big (R_m; \lambda ^{\star }\big )\right] \end{aligned}$$

(42)

$$\begin{aligned}&=- \frac{1}{2\sigma ^2} \sum _{m=1}^{M} ||\mathbf{I}_m - \mathbf{B}R_m||^2 - \frac{M \bar{n}}{2} \log \big (2\pi \sigma ^2\big )\end{aligned}$$

(43)

$$\begin{aligned}&\quad + \sum _{m=1}^{M} \log p\big (R_m; \lambda ^{\star }\big ). \end{aligned}$$

(44)

Suppose we are to choose a $\mathbf{B}$ from a collection of candidates. Ideally we should maximize the sum of (43) and (44). We may interpret (43) to be the negative coding length of the residual image $\epsilon $ by the Gaussian white noise model, and interpret (44) to be the negative coding length of the coefficients $R_m$ by the fitted model $p(R; \lambda ^{\star })$. If $\sigma ^2$ is small, (43) can be more important, while the coding length of $R_m$ for different $\mathbf{B}$ may not differ too much in comparison. So we choose to seek a $\mathbf{B}$ to maximize only (43) or equivalently minimize the overall reconstruction error $\sum _{m=1}^{M} ||\mathbf{I}_m - \mathbf{B}R_m||^2$. This reflects a two-stage strategy in modeling $\{\mathbf{I}_m\}$. First, we find a set of basis functions $\mathbf{B}$ to reconstruct $\{\mathbf{I}_m\}$ as accurately as possible. Then we fit a statistical model for the reconstruction coefficients.

Non-orthogonality Even if $\mathbf{B}$ is not orthogonal, which is the case in our work, the connection between the sparse FRAME and shared sparse coding still holds. The responses $R = \mathbf{B}^{T} \mathbf{I}$, but the reconstruction coefficients become $C = (\mathbf{B}^{T}\mathbf{B})^{-1}R$. The projection of $\mathbf{I}$ onto the subspace spanned by $\mathbf{B}$ is $\mathbf{B}C$. We can continue to assume the implicit $\bar{\mathbf{B}}= (\bar{B}_i, i = 1, \ldots , \bar{n})$ to be orthonormal, and that they are orthogonal to the columns of $\mathbf{B}$. We can also continue to let $\bar{R} = \bar{\mathbf{B}}^{T}\mathbf{I}$. In this setting, $R$ and $\bar{R}$ are still independent under the Gaussian white noise model $q(\mathbf{I})$ because $\mathbf{B}$ and $\bar{\mathbf{B}}$ are still orthogonal to each other. Under the sparse FRAME model (15), it is still the case that only the distribution of $R$ is modified during the change from $q(\mathbf{I})$ to $p(\mathbf{I}; \mathbf{B}, \lambda )$, while the distribution of $\bar{R}$ remains white noise and is independent of $R$. The distribution of $R$ implies a distribution of the reconstruction coefficients $C$ because they are linked by a linear transformation. In fact, the distribution of $C$ is:

$$\begin{aligned} p_C(C; \lambda ) = \frac{1}{Z(\lambda )} \exp \big (\langle \lambda , |\mathbf{B}^{T}\mathbf{B}C|\rangle \big ) q_C(C), \end{aligned}$$

(45)

where $q_C(C)$ is the distribution of $C$ under the reference distribution $q(\mathbf{I})$, and for a vector $u$, $|u|$ means the vector obtained by taking the absolute values of $u$ component-wise. Now the distributions of $R$ and $C$ involve the Jacobian terms such that $dR d\bar{R} = |\mathrm{det}(\mathbf{B}^{T}\mathbf{B})|^{1/2} d\mathbf{I}= |\mathrm{det}(\mathbf{B}^{T}\mathbf{B})| dC d\bar{R}$. In fact $p(\mathbf{I}; \mathbf{B}, \lambda ) = p_C(C; \lambda )q_{\bar{R}}(\bar{R}) |\det (\mathbf{B}^{T}\mathbf{B})|^{-1/2}$. By the same logic as in (43) and (44), we still want to find $\mathbf{B}$ to minimize the overall reconstruction error $\sum _{m=1}^{M}\Vert \mathbf{I}_m - \mathbf{B}C_m\Vert ^2$.

Under the shared sparse coding model, it is tempting to model the coefficients $C$ of the selected basis functions directly. However, $C$ is still a multi-dimensional vector, and direct modeling of $C$ can be difficult. One may assume that the components of $C$ are statistically independent for simplicity, but this assumption is unlikely to be realistic. So after selecting the basis functions, we choose to model the image intensities by the inhomogeneous FRAME model. Even though this model only matches the marginal distributions of filter responses of the selected basis functions, the model does not assume that the responses are independent.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xie, J., Hu, W., Zhu, SC. et al. Learning Sparse FRAME Models for Natural Image Patterns. Int J Comput Vis 114, 91–112 (2015). https://doi.org/10.1007/s11263-014-0757-x

Download citation

Received: 01 February 2014
Accepted: 13 August 2014
Published: 02 October 2014
Issue Date: September 2015
DOI: https://doi.org/10.1007/s11263-014-0757-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Sparse FRAME Models for Natural Image Patterns

Abstract

Access this article

Similar content being viewed by others

Basis pursuit denoising-based image superresolution using a redundant set of atoms

Learning Scale and Shift-Invariant Dictionary for Sparse Representation

Directional Frames for Image Recovery: Multi-scale Discrete Gabor Frames

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix: Simulation by Hamiltonian Monte Carlo

Maximum Entropy Justification

Sparse FRAME and Shared Sparse Coding

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning Sparse FRAME Models for Natural Image Patterns

Abstract

Access this article

Similar content being viewed by others

Basis pursuit denoising-based image superresolution using a redundant set of atoms

Learning Scale and Shift-Invariant Dictionary for Sparse Representation

Directional Frames for Image Recovery: Multi-scale Discrete Gabor Frames

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix: Simulation by Hamiltonian Monte Carlo

Maximum Entropy Justification

Sparse FRAME and Shared Sparse Coding

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation