Video quality assessment based on structural distortion measurement

https://doi.org/10.1016/S0923-5965(03)00076-6Get rights and content

Abstract

Objective image and video quality measures play important roles in a variety of image and video processing applications, such as compression, communication, printing, analysis, registration, restoration, enhancement and watermarking. Most proposed quality assessment approaches in the literature are error sensitivity-based methods. In this paper, we follow a new philosophy in designing image and video quality metrics, which uses structural distortion as an estimate of perceived visual distortion. A computationally efficient approach is developed for full-reference (FR) video quality assessment. The algorithm is tested on the video quality experts group Phase I FR-TV test data set.

Introduction

There has been an increasing need recently to develop objective quality measurement techniques that can predict perceived image and video quality automatically. These methods are useful in a variety of image and video processing applications, such as compression, communication, printing, displaying, analysis, registration, restoration, enhancement and watermarking. Generally speaking, these methods can be employed in three ways. First, they can be used to monitor image/video quality for quality control systems. Second, they can be employed to benchmark image/video processing systems and algorithms. Third, they can also be embedded into image/video processing systems to optimize algorithms and parameter settings.

Currently, the most commonly used full-reference (FR) objective image and video distortion/quality metrics are mean squared error (MSE) and peak signal-to-noise ratio (PSNR). MSE and PSNR are widely used because they are simple to calculate, have clear physical meanings, and are mathematically easy to deal with for optimization purposes. However, they have been widely criticized as well for not correlating well with perceived quality measurement [8], [10], [19], [22], [23], [25], [26], [28]. In the last three decades, a great deal of effort has been made to develop objective image and video quality assessment methods, which incorporate perceptual quality measures by considering human visual system (HVS) characteristics. Some of the developed models are commercially available. The video quality experts group (VQEG) was formed to develop, validate and standardize new objective measurement methods for video quality. Although the Phase I test [4], [20] for FR television video quality assessment only achieved limited success, VQEG continues its work on Phase II test for FR quality assessment for television, and reduced-reference (RR) and no-reference (NR) quality assessment for television and multimedia.

It is worth noting that many of the proposed objective image/video quality assessment approaches in the literature share a common error sensitivity-based philosophy [22], [26], [28], which is motivated from psychophysical and physiological vision research. The basic principle is to think of the distorted signal being evaluated as the sum of a perfect quality reference signal and an error signal. The task of perceptual image quality assessment is then to evaluate how strong the error signal is perceived by the HVS according to the characteristics of human visual error sensitivity.

A general framework following error sensitivity-based philosophy is shown in Fig. 1 [28]. First, the original and distorted image/video signals are subject to preprocessing procedures, possibly including alignment, transformations of color spaces, calibration for display devices, point spread function (PSF) filtering that simulates the eye optics, and light adaptation. Next, a contrast sensitivity function (CSF) filtering procedure may be applied, where the CSF models the variation in the sensitivity of the HVS to different spatial and temporal frequencies. The CSF feature may be implemented before the channel decomposition module using linear filters that approximate the frequency responses of the CSF. It may also be considered as a normalization factor between channels after channel decomposition. The channel decomposition process transforms the signals into different spatial and temporal frequency as well as orientation selective subbands. A number of channel decomposition methods that attempt to model the neuron responses in the primary visual cortex have been used [5], [11], [13], [18], [19], [29]. Some quality assessment metrics use much simpler transforms such as the discrete cosine transform (DCT) [30], [32] and the separable wavelet transforms [1], [12], [34], and still achieved comparable results. Channel decompositions tuned to various temporal frequencies have also been reported [2], [35]. The errors calculated in each channel are adjusted according to the “base sensitivity” for each channel (related to the CSF) as a normalization process. They are also adjusted according to a spatially varying masking factor, which refers to the fact that the presence of one image component reduces the visibility of another image component spatially and temporally proximate. Both intra- and inter-channel masking effects may be considered. Finally, the error pooling module combines the error signals in different channels into a single distortion/quality value, where most quality assessment methods take the form of the Minkowski error metric [15]. The overall framework covers MSE as the simplest special case (with identity preprocessing, no CSF filtering, identity transform, constant error adjustment and L2 Minkowski error pooling). A perceptual image quality metric may implement one or more perceptually meaningful components of the system. Reviews on perceptual error sensitivity-based models can be found in [6], [15], [28].

The visual error sensitivity-based algorithms attempt to predict perceived errors by simulating the perceptual quality-related functional components of the HVS. However, the HVS is an extremely complicated system and the current understanding of the HVS is limited. Currently, several issues that are critical for justifying the general framework are still under investigations [28].

The “suprathreshold problem” is one issue that has not been well understood: note that most psychophysical subjective experiments are conducted near the threshold of visibility. These measured threshold values are then used to define visual error sensitivity models, such as the CSF and the various masking effect models. However, current psychophysical studies are still not sufficient to determine whether such near-threshold models can be generalized to characterize perceptual distortions significantly larger than threshold levels, as is the case in a majority of image processing situations. Another question is: when the errors are much larger than the thresholds, can the relative errors between different channels be normalized using the visibility thresholds? Recent efforts have been made to incorporate suprathreshold psychophysics research for analyzing image distortions (e.g., [3], [9], [16], [33], [36]). It remains to be seen how much these efforts can improve the performance of the current quality assessment algorithms.

The “natural image complexity problem” is another important issue: the CSF and various masking models are established based on psychovisual experiments conducted using one or a few relatively simple patterns, such as bars, sinusoidal gratings and Gabor patches. But all such patterns are much simpler than real world images, which can be thought of as a superposition of a much larger number of simple patterns. Can we generalize the model for the interactions between a few simple patterns to model the interactions between tens or hundreds of patterns? Are these simple-pattern experiments sufficient to build a model that can predict the quality of complex-structured natural images? Although the answers to these questions are currently not known, the recently established Modelfest dataset [31] includes both simple and complicated patterns, and should facilitate future studies.

Motivated from a substantially different philosophy, a simple structural distortion-based method is proposed for still image quality assessment in [22], [23], [25]. In this paper, an improved version of the algorithm is employed for video quality assessment. In Section 2, the general philosophy and a specific implementation of the structural distortion-based method are presented. A new video quality assessment system is introduced in Section 3. The system is tested on the VQEG Phase I FR-TV video dataset. Finally, Section 4 draws conclusions and provides further discussions.

Section snippets

The new philosophy

Natural image signals are highly structured. By “structured signal”, we mean that the samples of the signals have strong dependencies between each other, especially when they are close in space. However, the Minkowski error pooling formula used in the error-sensitivity-based method is in the form of pointwise signal differencing, which is independent of the signal structure. Furthermore, decomposing the signals using linear transformations still cannot remove the strong dependencies between the

Video quality assessment

In [27], a hybrid video quality assessment method was developed, where the proposed quality indexing approach (with C1=C2=0) was combined with blocking and blurring measures as well as a texture classification algorithm. In this paper, we attempt to use a much simpler method, which employs the SSIM index as a single measure for various types of distortions.

Conclusions and discussions

We designed a new objective video quality assessment system. The key feature of the proposed method is the use of structural distortion instead of error-sensitivity-based measurement for quality evaluation. Experiments on VQEG FR-TV Phase I test data set show that it has good correlation with perceived video quality.

One of the most attractive features of the proposed method is perhaps its simplicity. Note that no complicated procedures (such as spatial and temporal filtering, linear

Acknowledgements

The authors would like to thank Dr. Eero Simoncelli, Mr. Hamid Sheikh and Dr. Jesus Malo for helpful discussions, and thank Dr. Philip Corriveau and Dr. John Libert for providing the Matlab routines used in VQEG FR-TV Phase I test for the regression analysis of subjective/objective data comparison.

References (36)

  • M.P. Eckert et al.

    Perceptual quality metrics applied to still image compression

    Signal Processing

    (November 1998)
  • Y.K. Lai et al.

    A Haar wavelet approach to compressed image quality measurement

    J. Visual Comm. Image Representat.

    (March 2000)
  • A.B. Watson

    The cortex transformrapid computation of simulated neural images

    Comput. Vision Graphics Image Process.

    (1987)
  • A.P. Bradley

    A wavelet visible difference predictor

    IEEE Trans. Image Process.

    (May 1999)
  • C.J. van den Branden Lambrecht, O. Verscheure, Perceptual quality measure using a spatio-temporal model of the human...
  • D.M. Chandler, S.S. Hemami, Additivity models for suprathreshold distortion in quantized wavelet-coded images, in:...
  • P. Corriveau

    Video quality experts groupcurrent results and future directions

    Proc. SPIE Visual Comm. Image Process.

    (June 2000)
  • S. Daly

    The visible difference predictor: an algorithm for the assessment of image fidelity

  • I. Epifanio et al.

    Linear transform for simultaneous diagonalization of covariance and perceptual metric matrix in image coding

    Pattern Recognition

    (August 2003)
  • A.M. Eskicioglu et al.

    Image quality measures and their performance

    IEEE Trans. Comm.

    (December 1995)
  • D.R. Fuhrmann et al.

    Experimental evaluation of psychophysical distortion metrics for JPEG-encoded images

    J. Electron. Imaging

    (October 1995)
  • B. Girod

    What's wrong with mean-squared error

  • D.J. Heeger, P.C. Teo, A model of perceptual image fidelity, in: Proceedings of the IEEE International Conference on...
  • J. Lubin

    The use of psychophysical data and models in the analysis of display system performance

  • J. Malo, R. Navarro, I. Epifanio, F. Ferri, J.M. Artifas, Non-linear Invertible Representation for Joint Statistical...
  • T.N. Pappas et al.

    Perceptual criteria for image quality evaluation

  • J.G. Ramos et al.

    Suprathreshold wavelet coefficient quantization in complex stimulipsychophysical evaluation and analysis

    J. Opt. Soc. Am. A

    (2001)
  • H.R. Sheikh, Z. Wang, A.C. Bovik, L.K. Cormack, Image and video quality assessment research at LIVE,...
  • Cited by (930)

    • FAVER: Blind quality prediction of variable frame rate videos

      2024, Signal Processing: Image Communication
    • Efficient representation of spatio-temporal data using cylindrical shearlets

      2023, Journal of Computational and Applied Mathematics
    View all citing articles on Scopus
    View full text