Open Access
4 February 2014 ViS3: an algorithm for video quality assessment via analysis of spatial and spatiotemporal slices
Author Affiliations +
Abstract
Algorithms for video quality assessment (VQA) aim to estimate the qualities of videos in a manner that agrees with human judgments of quality. Modern VQA algorithms often estimate video quality by comparing localized space-time regions or groups of frames from the reference and distorted videos, using comparisons based on visual features, statistics, and/or perceptual models. We present a VQA algorithm that estimates quality via separate estimates of perceived degradation due to (1) spatial distortion and (2) joint spatial and temporal distortion. The first stage of the algorithm estimates perceived quality degradation due to spatial distortion; this stage operates by adaptively applying to groups of spatial video frames the two strategies from the most apparent distortion algorithm with an extension to account for temporal masking. The second stage of the algorithm estimates perceived quality degradation due to joint spatial and temporal distortion; this stage operates by measuring the dissimilarity between the reference and distorted videos represented in terms of two-dimensional spatiotemporal slices. Finally, the estimates obtained from the two stages are combined to yield an overall estimate of perceived quality degradation. Testing on various video-quality databases demonstrates that our algorithm performs well in predicting video quality and is competitive with current state-of-the-art VQA algorithms.

1.

Introduction

The ability to quantify the visual quality of an image or video is a crucial step for any system that processes digital media. Algorithms for image quality assessment (IQA) and video quality assessment (VQA) aim to estimate the quality of a distorted image/video in a manner that agrees with the quality judgments reported by human observers. Over the last few decades, numerous IQA algorithms have been developed and shown to perform reasonably well on various image-quality databases. Therefore, a natural technique to VQA is to apply existing IQA algorithms to each frame of the video and to pool the per-frame results across time. A key advantage of this approach is that it is very intuitive, easily implemented, and computationally efficient. However, such a frame-by-frame IQA approach often fails to correlate with the subjective ratings of quality.1,2

1.1.

General Approaches to VQA

One reason frame-by-frame IQA performs less well for VQA is because it ignores temporal information, which is important for video quality due to temporal effects, such as temporal masking and motion perception.3,4 Many researchers have incorporated temporal information into their VQA algorithms by supplementing frame-by-frame IQA with a model of temporal masking and/or temporal weighting.58 For example, in Refs. 6 and 7, motion-weighting and temporal derivatives have been used to extend structural similarity (SSIM)9 and visual information fidelity (VIF)10 for VQA.

Modern VQA algorithms often estimate video quality by extracting and comparing visual/quality features from localized space-time regions or groups of video frames. For example, in Refs. 11 and 12, video quality is estimated based on spatial gradients, color information, and the interaction of contrast and motion from spatiotemporal blocks; motion-based temporal pooling is employed to yield the quality estimate. In Ref. 4, video quality is estimated via measures of spatial quality, temporal quality, and spatiotemporal quality for groups of video frames via a three-dimensional (3-D) Gabor filter-bank; the spatial and temporal components are combined into an overall estimate of quality. In Ref. 13, spatial edge features and motion characteristics in localized space-time regions are used to estimate quality.

Furthermore, it is known that the subjective assessment of video quality is time-varying,14 and this temporal variation can strongly influence the overall quality ratings.15,16 Models of VQA that consider these effects have been proposed in Refs. 1617.18. to 19. For example, in Ref. 19, Ninassi et al. measured temporal variations of spatial visual distortions in a short-term pooling for groups of frames through a mechanism of visual attention; the global video quality score is estimated via a long-term pooling. In Ref. 16, Seshadrinathan et al. proposed a hysteresis temporal pooling model of spatial quality values by studying the relation between time-varying quality scores and the final quality score assigned by human subjects.

1.2.

Different Approach for VQA: Analysis of Spatiotemporal Slices

Traditional analyses of temporal variation in VQA tend to formulate methods to compute spatial distortion of a standalone frame,5,7 of local space-time regions,12,13 or of groups of adjacent frames4,19 and then measure the changes of spatial distortion over time. An alternative approach, which is the technique we adopt in this paper, is to use spatiotemporal slices (as illustrated in Fig. 1), which allows one to analyze longer temporal variations.20,21 In the context of general motion analysis, Ngo et al.21 stated that analyzing the visual patterns of spatiotemporal slices could characterize the changes of motion over time and describe the motion trajectories of different moving objects. Inspired by this result, in this paper, we present an algorithm that estimates quality based on the differences between the spatiotemporal slices of the reference and distorted videos.

Fig. 1

A video can be envisaged as a rectangular cuboid in which two of the sides represent the spatial dimensions (x and y), and the third side represents the time dimension (t). If one takes slices of the cuboid from front-to-back, then the extracted slices correspond to normal video frames. Slicing the cuboid vertically and horizontally yields spatiotemporal slice images (STS images). Examples of three different slice types are presented in part (b) of the figure.

JEI_23_1_013016_f001.png

As shown in Fig. 1(a), a video can be envisaged as a rectangular cuboid in which two of the sides represent the spatial dimensions (x and y), and the third side represents the time dimension (t). If one takes slices of the cuboid from front-to-back, then the extracted slices correspond to normal video frames. However, it is also possible to take the slices of the cuboid from other directions (e.g., from left-to-right or top-to-bottom) to extract images that contain spatiotemporal information, hereafter called the STS images. As shown in Fig. 1(b), if the cuboid is sliced vertically (left-to-right or right-to-left), then the extracted slices represent time along one dimension and vertical space along the other dimension, hereafter called the vertical STS images. If the cuboid is sliced horizontally (top-to-bottom or bottom-to-top), then the extracted slices represent time along one dimension and horizontal space along the other dimension, hereafter called the horizontal STS images.

Figure 2 shows examples of STS images from some typical videos. At one extreme, if the video contains no changes across time (e.g., no motion, as in a static video), then the STS images will contain only horizontal lines [see Fig. 2(a)] or only vertical lines [see Fig. 2(b)]. In both Figs. 2(a) and 2(b), the perfect temporal relationship in the video content manifests as perfect spatial relationship along the dimension that corresponds to time in the STS images. At the other extreme, if the video is rapidly changing (e.g., each frame contains vastly different content), the STS images will appear as random patterns. In both Figs. 2(c) and 2(d), the randomness of temporal content in the video manifests as spatially random pixels along the dimension that corresponds to time in the STS images. The STS images for normal videos [Figs. 2(e) and 2(f)] are generally well structured due to the joint spatiotemporal relationship of neighboring pixels and the smooth frame-to-frame transition.

Fig. 2

Demonstrative STS images extracted from a static video [(a) and (b)], from a video with a vastly different content for each frame [(c) and (d)], and from a typical normal natural video [(e) and (f)]. The STS images for the atypical videos in (a) to (d) appear similar to textures, whereas the STS images for normal videos are generally smoother and more structured due to the joint spatial and temporal (spatiotemporal) relationship.

JEI_23_1_013016_f002.png

The STS images have been effectively used in a model of human visual-motion sensing,22 in energy models of motion perception,23 and in video motion analysis.20,21 Here, we argue that the temporal variation of spatial distortion is exhibited as spatiotemporal dissimilarity in the STS images, and thus, these STS images can also be used to estimate video quality. To illustrate this, Fig. 3 shows sample STS images from a reference video (reference STS image) and from a distorted video (distorted STS image), where some dissimilar regions are clearly visible in the close-ups. As we will demonstrate, by quantifying the spatiotemporal dissimilarity between the reference and distorted STS images, it is possible to estimate video quality.

Fig. 3

Demonstrative STS images extracted from the reference and distorted videos. The close-ups show some dissimilar regions between the STS images.

JEI_23_1_013016_f003.png

Figure 4 shows sample STS images from two distorted videos of the LIVE video database24 and the normalized absolute difference images between the reference and distorted STS images. The associated estimates PSNRsts and MADsts are computed by applying peak SNR (PSNR)25 and the most apparent distortion (MAD) algorithm26 to each pair of the reference and distorted STS images and by averaging the results across all STS images. The higher the PSNRsts value, the better the video quality; and the lower the MADsts value, the better the video quality. As seen from Fig. 4, the PSNRsts and MADsts values show promise for VQA by comparing the STS images, whereas the frame-by-frame MAD fails to predict the qualities of these videos. However, it is important to note that, although PSNR and MAD show promise when applied to the STS images, neither PSNR nor MAD were designed for use with STS images. In particular, PSNR and MAD do not account for the responses of the human visual system (HVS) to temporal changes of spatial distortion. Consequently, PSNRsts and MADsts can yield predictions that correlate poorly with mean opinion score (MOS)/ difference mean opinion score (DMOS). Thus, we propose an alternative method of quantifying degradation of the STS images via a measure of correlation and a model of motion perception.

Fig. 4

Sample STS images and their absolute difference STS images (relative to the STS images of the reference videos) extracted from videos (a) pa2_25fps.yuv and (b) pa8_25fps.yuv for vertical STS images (upper row) and for horizontal STS images (lower row). The videos are from the LIVE video database.24 The values obtained by applying frame-by-frame most apparent distortion (MAD) on normal (front-to-back) frames are shown for comparison. The PSNRsts and MADsts values, which are computed from the STS images, show promise in estimating video quality. However, neither peak SNR nor MAD account for human visual system responses to temporal changes of spatial distortion, and thus we propose an alternative method of quantifying degradation of the STS images.

JEI_23_1_013016_f004.png

1.3.

Proposal and Contributions

In this paper, we propose a VQA algorithm that estimates video quality by measuring spatial distortion and spatiotemporal dissimilarity separately. To estimate perceived video quality degradation due to spatial distortion, both the detection-based strategy and the appearance-based strategy of our MAD algorithm are adapted and applied to groups of normal video frames. A simple model of temporal weighting using optical-flow motion estimation is employed to give greater weights to distortions in the slow-moving regions.5,18 To estimate spatiotemporal dissimilarity, we extend the models of Watson–Ahumada27 and Adelson–Bergen,23 which have been used to measure energy of motion in videos, to the STS images and measure the local variance of spatiotemporal neural responses. The spatiotemporal response is measured by filtering the STS image via one one-dimensional (1-D) spatial filter and one 1-D temporal filter.23,27 The overall estimate of perceived video quality degradation is given by a geometric mean of the spatial distortion and spatiotemporal dissimilarity values.

We have named our algorithm ViS3 according to its two main stages: the first stage estimates video quality degradation based on spatial distortion (ViS1), and the second stage estimates video quality degradation based on the dissimilarity between spatiotemporal slice images (ViS2). The final estimate of perceived video quality degradation ViS3 is a combination of ViS1 and ViS2. The ViS3 algorithm is an improved and extended version of our previous VQA algorithms presented in Refs. 28 and 29. We demonstrate the performance of this algorithm on various video-quality databases and compare to some recent VQA algorithms. We also analyze the performance of ViS3 on different types of distortion by measuring its performance on each subset of videos.

The major contributions of this paper are as follows. First, we provide a simple yet effective extension of our MAD algorithm for use in VQA. Specifically, we show how to apply MAD’s detection- and appearance-based strategies to groups of video frames and how to modify the combination to take into account temporal masking. This contribution is presented in the first stage of the ViS3 algorithm. Second, we demonstrate that the spatiotemporal dissimilarity exhibited in the STS images can be used to effectively estimate video quality degradation. We specifically provide in the second stage of the ViS3 algorithm a technique to quantify the spatiotemporal dissimilarity by measuring spatiotemporal correlation and by applying an HVS-based model to the STS images. Finally, we demonstrate that a combination of the measurements obtained from these two stages is able to estimate video quality quite accurately.

This paper is organized as follows. In Sec. 2, we provide a brief review of current VQA algorithms. In Sec. 3, we describe details of the ViS3 algorithm. In Sec. 4, we present and compare the results of applying ViS3 to different video databases. General conclusions are presented in Sec. 5.

2.

Brief Review of Existing VQA Algorithms

In this section, we provide a brief review of current VQA algorithms. Following the classification specified in Ref. 30, current VQA methods can roughly be divided into four classes: (1) those that employ IQA on a frame-by-frame basis, (2) those that estimate quality based on differences between visual features of the reference and distorted videos, (3) those that estimate quality based on statsitical differences between the reference and distorted videos, and (4) those that attempt to model one or more aspects of the HVS.

2.1.

Frame-by-Frame IQA

As stated in Sec. 1, the most straightforward technique to estimate video quality is to apply existing IQA algorithms on a frame-by-frame basis. These per-frame quality estimates can then be collapsed across time to predict an overall quality estimate of the video. It is common to find these frame-by-frame IQA algorithms used as a baseline for comparison,24,31 and some authors implement this technique as a part of their VQA algorithms.32,33 However, due to the lack of temporal information, this technique often fails to correlate with the perceived quality measurements obtained from human observers.

2.2.

Algorithms Based on Visual Features

An approach commonly used in VQA is to extract spatial and temporal visual features of the videos and then estimate quality based on the changes of these features between the reference and distorted videos.11,12,3440

One of the earliest approaches to feature-based VQA was proposed by Pessoa et al.34 Their VQA algorithm employs segmentation along with segment-type-specific error measures. Frames of the reference and distorted videos are first segmented into smooth, edge, and texture segments. Various pixel-based and edge-detection-based error measures are then computed between corresponding regions of the reference and distorted videos for both the luminance and chrominance components. The overall estimate of quality is computed via a weighted linear combination of logistic-normalized versions of these error measures, using segment-category-specific weights, collapsed across all segments and all frames.

One of the most popular feature-based VQA algorithms, called the video quality metric (VQM), was developed by Pinson and Wolf.11,12 The VQM algorithm employs quality features that capture spatial, temporal, and color-based differences between the reference and distorted videos. The VQM algorithm consists of four sequential steps. The first step calibrates videos in terms of brightness, contrast, and spatial and temporal shifts. The second step breaks the videos into subregions of space and time, and then extracts a set of quality features for each subregion. The third step compares features extracted from the reference and distorted videos to yield a set of quality indicators. The last step combines these indicators into a video quality index.

Okamoto et al.35 proposed a VQA algorithm that operates based on the distortion of edges in both space and time. Okamoto et al. employ three general features: (1) blurring in edge regions, which is quantified by using the average edge energy difference described in ANSI T1.801.03; (2) blocking artifacts, which are quantified based on the ratio of horizontal and vertical edge distortions to other edge distortions; and (3) the average local motion distortion, which is quantified based on the average difference between block-based motion measures of the reference and distorted frames. The overall video quality is estimated via a weighted average of these three features.

In Ref. 36, Lee and Sim propose a VQA algorithm that operates under the assumption that visual sensitivity is greatest near edges and block boundaries. Accordingly, their algorithm applies both an edge-detection stage and a block-boundary detection stage to frames from the reference video to locate these regions. Separate measures of distortion for the edge regions and block regions are then computed between the reference and distorted frames. These two features are supplemented with a gradient-based distortion measure, and the overall estimate of quality is then obtained via a weighted linear sum of these three features averaged across all frames.

In the context of packet-loss scenarios, Barkowsky et al.37 designed the TetraVQM algorithm by adding a model of temporal distortion awareness to the VQM algorithm. The key idea in TetraVQM is to estimate the temporal visibility of image areas and, therefore, weight the degradations in these areas based on their durations. TetraVQM employs block-based motion estimation to track image objects over time. The resulting motion vectors and motion-prediction errors are then used to estimate the temporal visibility, and this information is used to supplement VQM for estimating the overall quality. In Ref. 39, Engelke et al. demonstrated that significant improvements to VQM and TetraVQM can be realized by augmenting these techniques with information regarding visual saliency.

Various features have also been combined via machine-learning for improved VQA. In Ref. 8, Narwaria et al. proposed the temporal quality variation (TQV) algorithm, a low-complexity VQA algorithm that employs a machine-learning mechanism to determine the impact of the spatial and temporal factors as well as their interactions on the overall video quality. Spatial quality factors are estimated by a singular value decomposition (SVD)-based algorithm,41 and the temporal variation of spatial quality factors is used as a feature to estimate video quality.

2.3.

Algorithms Based on Statistical Measurements

Another class of VQA algorithms has been proposed that estimate quality based on differences in statistical features of the reference and distorted videos.57

In Ref. 5, Wang et al. proposed the video structural similarity (VSSIM) index. VSSIM computes various SSIM9 indices at three different levels: the local region level, the frame level, and the video sequence level. In the local region level, the SSIM index of each region is computed for the luminance and chrominance components, with greater weight given to luminance component. These SSIM indices are weighted by local luminance intensity to yield the frame-level SSIM index. Finally, at the sequence level, the frame SSIM index is weighted by global motion to yield an estimate of video quality.

Another extension of SSIM to VQA, called speed SSIM, was also proposed by Wang and Li.6 There, they augmented SSIM9 with an additional stage that employs Stocker and Simoncelli’s statistical model42 of visual speed perception. The speed perception model is used to derive a spatiotemporal importance weight function, which specifies a relative weighting at each spatial location and time instant. The overall estimate of video quality is obtained by using this weight function to compute a weighted average of SSIM over all space and time.

In Ref. 7, Sheikh and Bovik augmented the VIF IQA algorithm10 for use in VQA. VIF estimates quality based on the amount of information that the distorted image provides about the reference image. VIF models images as realizations of a mixture of marginal Gaussian densities of wavelet subbands, and quality is then determined based on the mutual information between the subband coefficients of the reference and distorted images. To account for motion, V-VIF quantifies loss in motion information by measuring deviations in the spatiotemporal derivatives of the videos, the latter of which are estimated by using separable bandpass filters in space and time.

Tao and Eskicioglu33 proposed a VQA algorithm that estimates quality based on SVD. Each frame of the reference and distorted videos are divided into 8×8 blocks, and then the SVD is applied to each block. Differences in the SVDs of corresponding blocks of the reference and distorted frames, weighted by the edge-strength in each block, are used to generate a frame-level distortion estimate. Both luminance and chrominance SVD-based distortions are combined via a weighted sum. These combined frame-level estimates are then averaged across all frames to derive an overall estimate of video quality.

Peng et al. proposed a motion-tuned and attention-guided VQA algorithm based on a space-time statistical texture representation of motion. To construct the spacetime texture representation, the reference and distorted videos are filtered via a bank of 3-D Gaussian derivative filters at multiple scales and orientations. Differences in the energies within local regions of the filtered outputs between the reference and distorted videos are then computed along 13 different planes in space-time to define their temporal distortion measure. This temporal distortion measure is then combined with a model of visual saliency and multiscale SSIM43 (averaged across frames) to estimate quality.

2.4.

Algorithms Based on Models of Human Vision

Another widely adopted approach to VQA is to estimate video quality via the use of various models of the HVS.4,44,55

One of the earliest VQA algorithms based on a vision model was developed by Lukas and Budrikis.44 Their technique employs a spatiotemporal visual filter that models visual threshold characteristics on uniform backgrounds. To account for nonuniform backgrounds, the model is supplemented with a masking function based on the spatial and temporal activities of the video.

The digital video quality algorithm, developed by Watson et al.,49 also models visual thresholds to estimate video quality. The authors employ the concept of just noticeable differences (JNDs), which are computed via a discrete cosine transform (DCT)-based model of early vision. After sampling, cropping, and color conversion, each 8×8 block of the videos is transformed to DCT coefficients, converted to local contrast, and filtered by a model of the temporal contrast sensitivity function. JNDs are then measured by dividing each DCT coefficient by its respective visual threshold. Contrast masking is estimated based on the differences between successive frames, and the masking-adjusted differences are pooled and mapped to a visual quality estimate.

Other HVS-based approaches to VQA have employed various subband decompositions to model the spatiotemporal response properties of populations of visual neurons, which are assumed to underlie the multichannel nature of the HVS.4,4547,53,55 These algorithms generally compute simulated neural responses to the reference and distorted videos and then estimate quality based on the extent to which these responses differ.

The moving picture quality metric algorithm, proposed by Basso et al.,45 employs a spatiotemporal multichannel HVS model by using 17 spatial Gabor filters and two temporal filters on the luminance component. After contrast sensitivity and masking adjustments, distortion is measured within each subband and pooled to yield the quality estimate. The color MPQM algorithm, proposed by Lambrecht,46 extends and applies the MPQM algorithm to both luminance and chrominance components with a reduced number of filters for the chrominance components (nine spatial filters and one temporal filter).

The normalization video fidelity metric algorithm, proposed by Lindh and Lambrecht,47 implements a visibility prediction model based on the Teo–Heeger gain-control model.56 Instead of using Gabor filters, the multichannel decomposition is performed by using the steerable pyramid with four scales and four orientations. An excitatory-inhibitory stage and a pooling stage are performed to yield a map of normalized responses. The distortion is measured based on the squared error between normalized response maps generated for the reference and the distorted videos.

Masry et al.53 developed a VQA algorith that employs a multichannel decomposition and a masking model implemented via a separable wavelet transform. A training step was performed on a set of videos and associated subjective quality scores to obtain the masking parameters. Later in Ref. 55, Li et al. utilized this algorithm as part of a VQA algorithm that measures and combines detail losses and additive impairments within each frame; optimal parameters were determined by training the algorithm on a subset of the LIVE video database.24

Seshadrinathan and Bovik4 proposed the motion-based video integrity evaluation (MOVIE) algorithm that estimates spatial quality, temporal quality, and spatiotemporal quality via a 3-D subband decomposition. MOVIE decomposes both the reference and distorted videos by using a 3-D Gabor filter-bank with 105 spatiotemporal subbands. The spatial component of MOVIE uses the outputs of the spatiotemporal Gabor filters and a model of contrast masking to capture spatial distortion. The temporal component of MOVIE employs optical-flow motion estimation to determine motion information, which is combined with the outputs of the spatiotemporal Gabor filters to capture temporal distortion. These spatial and temporal components are combined into an overall estimate of video quality.

2.5.

Summary

In summary, although previous VQA algorithms have analyzed the effects of spatial and temporal interactions on video quality, none have estimated video quality based on spatiotemporal slices (STS images), which contain important spatiotemporal information on a longer time scale. Earlier related work was performed by Péchard et al.,57 where spatiotemporal tubes rather than slices were used for VQA. Their algorithm employs a segmentation to create spatiotemporal tubes, which are coherent in terms of motion and spatial activity. Similar to our STS images, the spatiotemporal tubes permit analysis of spatiotemporal information on a long time scale, and Pechard et al. demonstrated the superiority of their approach compared to other VQA algorithms on videos containing H.264 artifacts.

In the following section, we describe our HVS-based VQA algorithm, ViS3, which employs measures of both motion-weighted spatial distortion and spatiotemporal dissimilarity of the STS images to estimate perceived video quality degradation.

3.

Algorithm

The ViS3 algorithm estimates video quality degradation by using the luminance components of the reference and distorted videos in YUV color space. We denote I as the cuboid representation of the Y component of the reference video, and we denote I^ as the cuboid representation of the Y component of the distorted video.

The ViS3 algorithm employs a combination of both spatial and spatiotemporal analyses to estimate the perceived video quality degradation of the distorted video I^ in comparison to the reference video I. Figure 5 shows a block diagram of the ViS3 algorithm, which measures spatial distortion and spatiotemporal dissimilarity separately via two main stages:

  • Spatial distortion: This stage estimates the average perceived distortion that occurs spatially in every group of frames (GOF). A motion-weighting scheme is used to model the effect of motion on the visibility of distortion. These per-group spatial distortion values are then combined into a single scalar, ViS1, which denotes an estimate of overall perceived video quality degradation due to spatial distortion.

  • Spatiotemporal dissimilarity: The spatiotemporal dissimilarity stage estimates video quality degradation by computing the spatiotemporal dissimilarity of the STS images extracted from the reference and distorted videos via the differences of spatiotemporal responses of modeled visual neurons. These per-STS-image spatiotemporal dissimilarity values are then combined into a single scalar, ViS2, which denotes an estimate of overall perceived video quality degradation due to spatiotemporal dissimilarity.

Fig. 5

Block diagram of the ViS3 algorithm. The spatial distortion stage is applied to groups of normal video frames extracted in a front-to-back fashion to compute spatial distortion value ViS1. The spatiotemporal dissimilarity value ViS2 is computed from the STS images extracted in a left-to-right fashion and a top-to-bottom fashion. The final scalar output of the ViS3 algorithm is computed via a geometric mean of the spatial distortion and spatiotemporal dissimilarity values.

JEI_23_1_013016_f005.png

Finally, the spatial distortion value ViS1 and the spatiotemporal dissimilarity value ViS2 are combined via a geometric mean to yield a single scalar ViS3 that represents the overall perceived quality degradation of the video. The following subsections provide details of each stage of the algorithm.

3.1.

Spatial Distortion

In the spatial distortion stage, we employ and extend our MAD algorithm,26 which was designed for still images, to measure spatial distortion in each GOF of the video. The MAD algorithm is composed of two separate strategies: (1) a detection-based strategy, which computes the perceived distortion due to visual detection (denoted by ddetect) and (2) an appearance-based strategy, which computes the perceived distortion due to visual appearance changes (denoted by dappear). The perceived distortion due to visual detection is measured by using a masking-weighted block-based mean-squared error in the lightness domain. The perceived distortion due to visual appearance changes is measured by computing the average differences between the block-based log-Gabor statistics of the reference and distorted images.

The MAD index of the distorted image is computed via a geometric weighted mean.

Eq. (1)

α=11+β1×(ddetect)β2,

Eq. (2)

MAD=(ddetect)α×(dappear)1α,
where the weight α[0,1] serves to adaptively combine the two strategies (ddetect and dappear) based on the overall level of distortion. As described in Ref. 26, for high-quality images, MAD should obtain its value mostly from ddetect, whereas for low-quality images, MAD should obtain its value mostly from dappear. Thus, an initial estimate of the quality level is required in order to determine the proper weighting (α) of the two strategies. In Ref. 26, the value of ddetect served as this initial estimate, and thus, α is a function of ddetect. The two free parameters β1=0.467 and β2=0.130 were obtained after training on the A57 image database;58 see Ref. 26 for a complete description of the MAD algorithm.

To extend MAD for use with video, we take the Y components of the videos and perform the following steps (shown in Fig. 6) on each group of N consecutive frames:

  • 1. Compute a visible distortion map for each frame by using MAD’s detection-based strategy. The maps computed from all frames in each GOF are then averaged to yield a GOF-based visible distortion map.

  • 2. Compute a statistical difference map for each frame by using MAD’s appearance-based strategy. The maps computed from all frames in each GOF are then averaged to yield a GOF-based statistical difference map.

  • 3. Estimate the magnitude of the motion vectors in each frame of the reference video by using the Lucas–Kanade optical flow method.59 The motion magnitude maps computed from all frames in each GOF are averaged to yield a GOF-based motion magnitude map.

  • 4. Combine the three GOF-based maps into a single spatial distortion map; the root mean squared (RMS) value of this map serves as the spatial distortion value of the GOF. The estimated spatial distortion values of all GOFs are combined via an arithmetic mean to yield a single scalar that represents the perceived video quality degradation due to spatial distortion.

Fig. 6

Block diagram of the spatial distortion stage. The extracted frames from the reference and distorted videos are used to compute a visible distortion map and a statistical difference map of each group of frames (GOF). Motion estimation is performed on the reference video frames and used to model the effect of motion on the visibility of distortion. All maps are combined and collapsed to yield a spatial distortion value ViS1.

JEI_23_1_013016_f006.png

The video frames are extracted from the Y components of the reference and distorted videos. Let It(x,y) denote the t’th frame of the reference video I, and let I^t(x,y) denote the t’th frame of the distorted video I^, where t[1,T] denotes the frame (time) index, and T denotes the number of frames in video I. These video frames are then divided into groups of N consecutive frames for both the reference and the distorted video. The following subsections describe the details of each step.

3.1.1.

Compute visible distortion map

We apply the detection-based strategy from Ref. 26 to all pairs of respective frames from the reference video and the distorted video. A block diagram of this detection-based strategy is provided in Fig. 7.

Fig. 7

Block diagram of the detection-based strategy used to compute a visible distortion map. Both the reference and the distorted frames are converted to perceived luminance and filtered by a contrast sensitivity function. By comparing the local contrast of the reference frame L and the error frame ΔL, we obtain a local distortion visibility map. This map is then weighted by local mean squared error to yield a visible distortion map.

JEI_23_1_013016_f007.png

Detection-based strategy

As illustrated in Fig. 7, a preprocessing step is first performed by using the nonlinear luminance conversion and spatial contrast sensitivity function filtering. Then, models of luminance and contrast masking are used to compute a local distortion visibility map. Next, this map is weighted by local mean squared error (MSE) to yield a visible distortion map. The specific steps are given below (see Ref. 26 for additional details).

First, to account for the nonlinear relationship between digital pixel values and physical luminance of typical display media, the video I is converted to a perceived luminance video L via

Eq. (3)

L=(a+kI)γ/3,
where the parameters a, k, and γ are constants specific to the device on which the video is displayed. For 8-bit pixel values and an sRGB display, these parameters are given by a=0, k=0.02874, and γ=2.2. The division by 3 attempts to take into account the nonlinear HVS response to luminance by converting luminance into perceived luminance (relative lightness).

Next, the contrast sensitivity function (CSF) is applied by filtering both the reference frame L and the error frame ΔL=LL^. The filtering is performed in the frequency domain via

Eq. (4)

L˜=F1[H(u,v)×F[L]],
where F and F1 denote the discrete fourier transform (DFT) and inverse DFT, respectively; H(u,v) is the DFT-based version of the CSF function defined by Eq. (3) in Ref. 26.

To account for the fact that the presence of an image can reduce the detectability of distortions, MAD employs a simple spatial-domain measure of contrast masking.

First, a local contrast map is computed for the reference frame in the lightness domain by dividing L˜ into 16×16 blocks (with 75% overlap between neighboring blocks) and then measuring the RMS contrast of each block. The RMS contrast of block b of L˜ is computed via

Eq. (5)

Cref(b)=σ˜ref(b)/μref(b),
where μref(b) denotes the mean of block b of L˜, and σ˜ref(b) denotes the minimum of the standard deviations of the four 8×8 subblocks of b. The block size of 16×16 was chosen because it is large enough to accommodate division into reasonably sized subblocks (to avoid overestimating the contrast around edges), but small enough to yield decent spatial localization (see Appendix A in Ref. 26).

Cref(b) is a measure of the local RMS contrast in the reference frame and is thus independent of the distortions. Accordingly, we next compute a local contrast map for the error frame to account for the spatial distribution of the distortions in the distorted frame. The error frame ΔL is divided into 16×16 blocks (with 75% overlap between blocks), and then the RMS contrast Cerr(b) for each block b is computed via

Eq. (6)

Cerr(b)={σerr(b)/μref(b)ifμref(b)>0.50otherwise,
where σerr(b) denotes the standard deviation of block b of ΔL. A lightness threshold of 0.5 is employed to account for the fact that the HVS is relatively insensitive to changes in extremely dark regions.

The local contrast maps are computed for both the reference frame and the error frame for every block b of size 16×16 with 75% overlap between neighboring blocks. The two local contrast maps {Cref} and {Cerr} are used to compute a local distortion visibility map denoted by ξ(b) via

Eq. (7)

ξ(b)={ln[Cerr(b)]ln[Cref(b)]ifln[Cerr(b)]>ln[Cref(b)]>5ln[Cerr(b)]+5ifln[Cerr(b)]>5ln[Cref(b)]0otherwise.

The local distortion visibility map ξ is then point-by-point multiplied by the local MSE to determine a visible distortion map denoted by ϒD, where the superscript D is used to imply that the map is computed from the detection-based strategy. The visible distortion at the location of block b is given by

Eq. (8)

ϒD(b)=ξ(b)·MSE(b).

Note that in Ref. 26, the visible distortion map ϒD is collapsed into a single scalar that represents the perceived distortion due to visual detection ddetect, which is computed via ddetect=b[ϒD(b)]2, where the summation is over all blocks. In the current paper, we do not collapse ϒD.

Apply to groups of video frames

Let ϒtD denote the visible distortion map computed from the t’th frame of the reference video and the t’th frame of the distorted video. The visible distortion maps computed from all frames in the k’th GOF will be {ϒN(k1)+1D,ϒN(k1)+2D,,ϒNkD}, where k{1,2,,K} is the GOF index and K is the number of GOFs in the video. These maps are combined via a point-by-point average to yield a GOF-based visible distortion map of the k’th GOF, which is denoted by ϒ¯kD.

Eq. (9)

ϒ¯kD=1Nτ=1NϒN(k1)+τD.

3.1.2.

Compute statistical difference map

As argued in Ref. 26, when the distortions in the image are highly suprathreshold, perceived distortion is better modeled by quantifying the extent to which the distortions degrade the appearance of the image’s subject matter. The appearance-based strategy measures local statistics of multiscale log-Gabor filter responses to capture changes in visual appearance. Figure 8 shows a block diagram of the appearance-based strategy used to compute a statistical difference map between the reference and the distorted frame.

Fig. 8

Block diagram of the appearance-based strategy used to compute a statistical difference map. The reference and the distorted frames are decomposed into different subbands using a two-dimensional log-Gabor filter-bank. Local standard deviation, skewness, and kurtosis are computed for each subband of both the reference and the distorted frames. The differences of local standard deviation, skewness, and kurtosis between each subband of the reference frame and the respective subband of the distorted frame are combined into a statistical difference map.

JEI_23_1_013016_f008.png

Appearance-based strategy

The appearance-based strategy employs a computational neural model using a log-Gabor filter-bank (with five scales s{1,2,3,4,5} and four orientations o{1,2,3,4}), which implements both even-symmetric (cosine-phase) and odd-symmetric (sine-phase) filters. The even and odd filter outputs are then combined to yield magnitude-only subband values. Let {Rs,o} and {R^s,o} denote the sets of log-Gabor subbands computed for a reference and a distorted frame, respectively, where each subband is the same size as the frames.

The standard deviation, skewness, and kurtosis are then computed for each block b of size 16×16 (with 75% overlap between blocks) for each log-Gabor subband of the reference frame and the distorted frame. Let σs,o(b), ςs,o(b), and κs,o(b) denote the standard deviation, skewness, and kurtosis computed from block b of subband Rs,o. Let σ^s,o(b), ς^s,o(b), and κ^s,o(b) denote the standard deviation, skewness, and kurtosis computed from block b of subband R^s,o. The statistical difference map is computed as the weighted combination of the differences in standard deviation, skewness, and kurtosis for all subbands. We denote ΥA as the statistical difference map, where the superscript A is used to imply that the map is computed from the appearance-based strategy. Specifically, the statistical difference at the location of block b is given by

Eq. (10)

ϒA(b)=s=15o=14ws[|σs,o(b)σ^s,o(b)|+2|ςs,o(b)ς^s,o(b)|+|κs,o(b)κ^s,o(b)|],
where the scale-specific weights ws={0.5,0.75,1,5,6} (for the finest to coarsest scales, respectively) are chosen the same as in Ref. 26 to account for the HVS’s preference for coarse scales over fine scales (see Ref. 26 for more details).

Note that in Ref. 26, the statistical difference map ϒA is collapsed into a single scalar that represents the perceived distortion due to visual appearance changes dappear, which is computed via dappear=b[ϒA(b)]2, where the summation is over all blocks. In the current paper, we do not collapse ϒA.

Apply to groups of video frames

Let ΥtA denote the statistical difference map computed from the t’th frame of the reference video and the t’th frame of the distorted video. The statistical difference maps computed from all frames in the k’th GOF will be {ϒN(k1)+1A,ϒN(k1)+2A,,ϒNkA}, where k{1,2,,K} is the GOF index and K is the number of GOFs in the video. These maps are combined via a point-by-point average to yield a GOF-based statistical difference map of the k’th GOF, which is denoted by ϒ¯kA.

Eq. (11)

ϒ¯kA=1Nτ=1NϒN(k1)+τA.

3.1.3.

Optical-flow motion estimation

Both the detection-based strategy and the appearance-based strategy were designed for still images. They do not account for the effects of motion on the visibility of distortion. One attribute of motion that affects the visibility of distortion in video is the speed of motion (or the magnitude of motion vectors). According to Wang et al.5 and Barkowsky et al.,18 the visibility of distortion is significantly reduced when the speed of motion is large. Alternatively, the distortion in slow-moving regions is more visible than the distortion in fast-moving regions.

To model this effect of motion, we measure the speed of motion in different regions of the video by using an optical flow algorithm. We specifically apply the optical flow method designed by Lucas and Kanade59 to the reference video to estimate motion vectors. The Lucas–Kanade method assumes that the displacement of the frame contents between two nearby frames is small and approximately constant within a neighborhood (window) of a point under consideration. Thus, the optical-flow motion vector can be assumed the same within a window centered at that point, and it is computed from solving the optical-flow equations using the least squares criterion.

By using a window of size 8×8, for each pair of consecutive frames, we obtain two matrices of motion vectors, Mv and Mh, with respect to the vertical and horizontal directions. The motion magnitude matrix is then computed as M=Mv2+Mh2. Each element in this matrix represents the motion magnitude of a region defined by an 8×8 block in the frame.

Let Mt denote the motion magnitude matrix computed from the t’th video frame and its successive frame, where t=1,2,,T1 denotes the frame index and T is the number of frames in the video. For the k’th GOF of the reference video, the motion magnitude matrices computed from all N of its frames are averaged to yield an average motion magnitude matrix via

Eq. (12)

M¯k=1Nτ=1NMN(k1)+τ.

Note that the sizes of Mt and M¯k are both 64 times smaller than a regular frame because each value in these matrices represents motion magnitude of an 8×8 window in the regular frame. We therefore resize the M¯k matrix to the size of the video frame by using nearest-neighbor interpolation to obtain the GOF-based motion magnitude map of the k’th GOF denoted by ϒ¯kM, where the superscript M is used to imply that the map is computed from the motion magnitudes.

3.1.4.

Combine maps and compute spatial distortion value

For each GOF, we have computed the GOF-based visible distortion map ϒ¯D, the GOF-based statistical difference map ϒ¯A, and the GOF-based motion magnitude map ϒ¯M. Now, we extend and apply Eq. (2) to respective regions of the visible distortion map and the statistical difference map to obtain the GOF-based most apparent distortion map. This map is then point-by-point weighted by the motion magnitude map ϒ¯kM to yield the spatial distortion map of the k’th GOF. We denote Δk(x,y) of size W×H, the video frame size, as the spatial distortion map of the k’th GOF. Specifically, the value at (x,y) of the spatial distortion map Δk(x,y) is computed via

Eq. (13)

α^(x,y)=11+β1×[ϒ¯kD(x,y)]β2,

Eq. (14)

Δk(x,y)=[ϒ¯kD(x,y)]α^(x,y)×[ϒ¯kA(x,y)]1α^(x,y)1+ϒ¯kM(x,y).

The division by ϒ¯kM(x,y) accounts for the fact that the distortion in slow-moving regions is generally more visible than the distortion in fast-moving regions. When the value in the motion magnitude map ϒ¯kM is relatively large or the corresponding spatial region is fast-moving, the visible distortion value in Δk(x,y) is relatively small; when the value in the motion magnitude map ϒ¯kM is relatively small or the corresponding spatial region is slow-moving, the visible distortion value in Δk(x,y) is relatively large. When there is no motion in the region, the visible distortion is determined solely by ϒ¯kD and ϒ¯kA.

Figure 9 shows examples of the first frame (a) and the last frame (b) of a specific GOF of video mc2_50fps.yuv from the LIVE video database.24 The visible distortion map (c), the statistical difference map (d), the motion magnitude map (e), and the spatial distortion map (f) computed for this GOF are also shown. As seen from the visible distortion map (c) and the statistical difference map (d), at the regions of high visible distortion level (i.e., the train, the numbers in the calendar), the spatial distortion map is weighted more by the statistical difference map. At the regions of low visible distortion level (i.e., the wall background), the spatial distortion map is weighted more by the visible distortion map.

Fig. 9

Examples of the first and last frames [(a) and (b)], the visible distortion map (c), the statistical difference map (d), the motion magnitude map (e), and the spatial distortion map (f) computed for a specific GOF of the video mc2_50fps.yuv from the LIVE video database.24 All maps have been normalized in contrast to promote visibility. Note that the brighter the maps, the more distorted the corresponding spatial region of the GOF; for the motion magnitude map, the brighter the map, the faster the motion in the corresponding spatial region of the GOF.

JEI_23_1_013016_f009.png

As also seen from Figs. 9(c) and 9(d), the region corresponding to the train at the bottom of the frames is more heavily distorted than the other regions. However, due to the fast movement of the train, which is reflected in the bottom of the motion magnitude map (e), the visibility of distortion is reduced, making this region less bright in the spatial distortion map (f).

To estimate spatial distortion value of each GOF, we compute the RMS value of the spatial distortion map. The RMS value of the map Δk(x,y) of size W×H is given by

Eq. (15)

Δ¯kXY=1W×Hx=1Wy=1H[Δk(x,y)]2,
where the superscript XY is used to remind readers that the value is computed from the normal frames with two dimensions x and y. The overall perceived spatial distortion value, denoted by ViS1, is computed as the arithmetic mean of all spatial distortion values Δ¯kXY via

Eq. (16)

ViS1=1Kk=1KΔ¯kXY.

Here, ViS1 is a single scalar that represents the overall perceived quality degradation of the video due to spatial distortion. The lower the ViS1 value, the better the video quality. A value ViS1=0 indicates that the distorted video is equal in quality to the reference video.

3.2.

Spatiotemporal Dissimilarity

In the distorted video, the distortion impacts not only the spatial relationship between neighboring pixels within the current frame, but also the transition between frames, which can be captured via the use of STS images. The difference between the STS images from the reference and distorted videos is referred to as the spatiotemporal dissimilarity in this paper. If the spatiotemporal dissimilarity between the STS images is small, the distorted video has high quality relative to the reference video; if the spatiotemporal dissimilarity between the STS images is large, the distorted video has low quality relative to the reference video. Figure 10 depicts a block diagram of the spatiotemporal dissimilarity stage, which estimates the spatiotemporal dissimilarity between the reference and the distorted video via the following steps:

  • 1. Extract the vertical and horizontal STS images in the lightness domain.

  • 2. Compute a spatiotemporal correlation map of the STS images.

  • 3. Filter the STS images by using a set of spatiotemporal filters. These spatiotemporally filtered images are used to compute a map of spatiotemporal response differences.

  • 4. Combine the above two maps into a spatiotemporal dissimilarity map and collapse this map into a spatiotemporal dissimilarity value. These per-STS-image dissimilarity values are combined into a single scalar, ViS2, which denotes the overall perceived video spatiotemporal dissimilarity.

Fig. 10

Block diagram of the spatiotemporal dissimilarity stage of the ViS3 algorithm. The STS images are extracted from the perceived luminance videos. The spatiotemporal correlation and the difference of spatiotemporal responses are computed in a block-based fashion and combined to yield a spatiotemporal dissimilarity map. All maps are then collapsed by using root mean square and combined to yield the spatiotemporal dissimilarity value ViS2 of the distorted video.

JEI_23_1_013016_f010.png

The following subsections describe the details of each step.

3.2.1.

Extract the STS images

The reference video I and the distorted video I^ are converted to perceived luminance videos L and L^, respectively, using Eq. (3). Let Sx(t,y) denote the vertical STS image of the video cuboid L, where x[1,W] denotes the vertical slice (column) index and W denotes the spatial width of the video (measured in pixels). As shown previously in Fig. 1, these vertical STS images contain temporal information in the horizontal direction and spatial information in the vertical direction. Thus, for a video containing T frames, Sx(t,y) will be of size T×H, where H denotes the spatial height of the video (measured in pixels). There are W such STS images S1(t,y),S2(t,y),,SW(t,y).

Similarly, let Sy(x,t) denote the horizontal STS image of the video cuboid L, where y[1,H] denotes the horizontal slice (row) index and H denotes the spatial height of the video. These horizontal STS images contain spatial information in the vertical direction and temporal information in the horizontal direction. Thus, for a video containing T frames, Sy(x,t) will be of size W×T, and there are H such STS images S1(x,t),S2(x,t),,SH(x,t).

The STS images extracted from the reference video [Sx(t,y), Sy(x,t)] and the STS images extracted from the distorted video [S^x(t,y), S^y(x,t)] are then used to compute the spatiotemporal dissimilarity values. This procedure consists of two main steps: (1) compute the spatiotemporal correlation maps and (2) compute the spatiotemporal response difference maps.

3.2.2.

Compute spatiotemporal correlation map

One simple way to measure the spatiotemporal dissimilarity is by using the local linear correlation coefficients of the STS images extracted from the reference and the distorted videos. If the distorted video has perfect quality relative to the reference video, these two videos should have high correlation in the STS images; if the distorted video has low quality relative to the reference video, the spatiotemporal correlation will be low.

Let ρ(b) denote the linear correlation coefficient computed from block b of the two STS images Sx(t,y) and S^x(t,y). We define the local spatiotemporal correlation coefficient ρ˜(b) of these two blocks as

Eq. (17)

ρ˜(b)={0ifρ(b)<01ifρ(b)>0.9ρ(b)otherwise.

As shown in Eq. (17), if the two blocks are highly positively correlated, we set ρ˜(b)=1. The threshold value of 0.9 was chosen empirically so that a relatively high positive correlation (ρ>0.9) is still considered perfect by the algorithm. As we demonstrate in the online supplement to this paper,60 the performance of the algorithm is relatively robust to small changes in this threshold value. On the other hand, if the two blocks are negatively correlated, we set ρ˜(b)=0 to reflect the dissimilarity between the two blocks.

This process is performed on every block of size 16×16 with 75% overlap between neighboring blocks, yielding a spatiotemporal correlation map denoted by Px(t,y) between Sx(t,y) and S^x(t,y). Similarly, we compute a spatiotemporal correlation map denoted by Py(x,t) between Sy(x,t) and S^y(x,t). Examples of the correlation maps are shown in Fig. 11(c). The brighter the maps, the higher the spatiotemporal correlation between corresponding regions of the two STS images.

Fig. 11

Demonstrative maps for two pairs of STS images Sy(x,t) and S^y(x,t) from videos mc2_50fps.yuv (LIVE) and PartyScene_dst_09.yuv (CSIQ) with the correlation maps Py(x,t), the log of response difference maps Dy(x,t), and spatiotemporal dissimilarity maps Δy(x,t). All maps have been normalized to promote visibility. Note that the brighter the spatiotemporal dissimilarity maps Δy(x,t), the more dissimilar the corresponding regions in the STS images.

JEI_23_1_013016_f011.png

3.2.3.

Compute spatiotemporal response difference map

The spatiotemporal correlation coefficient computed in Sec. 3.2.2 does not account for the HVS’s response to joint spatiotemporal characteristics of the video. Therefore, in addition to measuring the spatiotemporal correlation, we employ a computational HVS model that takes into account joint spatiotemporal perception based on the work of Watson and Ahumada in Ref. 27. This model applies separate 1-D filters to each dimension of the STS images to measure spatiotemporal responses. In Ref. 23, Adelson and Bergen used these spatiotemporal responses to measure energy of motion in a video. Here, we apply the model to the STS images and measure the differences of spatiotemporal responses to estimate video quality.

Decompose STS images into spatiotemporally filtered images

As stated by Adelson and Bergen in Ref. 23, the spatiotemporal information presented in the STS images can be captured via a set of spatiotemporally oriented filters. As suggested by Watson and Ahumada,27 these filters can be constructed by two sets of separate 1-D filters (spatial and temporal) with appropriate spatiotemporal characteristics. Following this suggestion, we employ a set of log-Gabor 1-D filters {gs}, s{1,2,3,4,5}, as the spatial filters, where the frequency response of each filter is given by

Eq. (18)

Gs(ω)=exp[(ln|ωωs|)22(lnBs)2],
where Gs, ωs, and Bs denote the frequency response, center frequency, and bandwidth of the filter gs, respectively, ω[ωs,ωs] is the 1-D spatial frequency. The bandwidth Bs is held constant for all scales to obtain constant filter shape. We specifically choose five scales and a filter bandwidth of approximately two octaves (Bs=0.55). These filters are almost the same as the log-Gabor filters used in Ref. 26 without the orientation information.

The two temporal filters {hz}, z{1,2}, were selected following the Adelson–Bergen model.23 The impulse response at time instance t of each filter is given by

Eq. (19)

hz(t)=tnzexp(t)[1nz!t2(nz+2)!],
where n1=6 and n2=9 were chosen to approximate the temporal contrast sensitivity functions reported by Robson,61 which correspond to the fast and slow motions, respectively.

The STS images are filtered along the spatial dimension by each spatial filter and then along the temporal dimension by each temporal filter to yield a spatiotemporally filtered image, which represents modeled spatiotemporal neural responses. With five spatial filters and two temporal filters, each STS image yields 10 spatiotemporally filtered images. We denote Rxs,z(t,y) and Rys,z(x,t), s{1,2,3,4,5} and z{1,2}, as the spatiotemporally filtered images obtained by filtering the STS images Sx(t,y) and Sy(x,t) from the reference video via spatial filter gs and temporal filter hz. These filtered images are computed via

Eq. (20)

Rxs,z(t,y)=[Sx(t,y)*ygs]*thz,

Eq. (21)

Rys,z(x,t)=[Sy(x,t)*xgs]*thz,
where *d, d{x,y,t}, denotes the convolution along dimension d.

Similarly, we denote R^xs,z(t,y) and R^ys,z(x,t) as the spatiotemporally filtered images obtained by filtering the STS images S^x(t,y) and S^y(x,t) from the distorted video via spatial filter gs and temporal filter hz. Then, the spatiotemporal response differences ΔRxs,z(t,y) and ΔRys,z(x,t) are defined as the absolute difference of the spatiotemporally filtered images via

Eq. (22)

ΔRxs,z(t,y)=|Rxs,z(t,y)R^xs,z(t,y)|,

Eq. (23)

ΔRys,z(x,t)=|Rys,z(x,t)R^ys,z(x,t)|.

Although the proper technique of estimating video quality based on the response differences remains an open research question, as discussed next, we employ a simple yet effective measure based on the local standard deviation of the spatiotemporal response differences.

Compute log of response difference map

We compute the local mean and standard deviation of the spatiotemporal response differences in a block-based fashion. Let μxs,z(b) and σxs,z(b) denote the local mean and standard deviation computed from block b of the response difference ΔRxs,z(t,y). Let μys,z(b) and σys,z(b) denote the local mean and standard deviation computed from block b of the response difference ΔRys,z(x,t).

The adjusted standard deviation of block b of the error-filtered image at spatial frequency index s and temporal frequency index z is given by

Eq. (24)

σ˜xs,z(b)={0,ifμxs,z(b)<pσxs,z(b)×μxs,z(b)p+μxs,z(b),otherwise,

Eq. (25)

σ˜ys,z(b)={0,ifμys,z(b)<pσys,z(b)×μys,z(b)p+μys,z(b),otherwise,
where p=0.01 is a threshold value. When the mean value of block b is small, there is no dissimilarity between the regions at the location of block b in the STS images; when the mean value of block b is large enough, the dissimilarity is approximately measured by the standard deviation of block b in the response differences.

This process is performed on every block of size 16×16 with 75% overlap between neighboring blocks, yielding maps of adjusted standard deviation σ˜xs,z(t,y) and σ˜ys,z(x,t). The log of response difference maps Dx(t,y) and Dy(x,t) are computed as a natural logarithm of a weighted sum of all the maps σ˜xs,z(t,y) and σ˜ys,z(x,t), respectively, as follows:

Eq. (26)

Dx(t,y)=ln{1+As=15z=12ws[σ˜xs,z(t,y)]2},

Eq. (27)

Dy(x,t)=ln{1+As=15z=12ws[σ˜ys,z(x,t)]2},
where the weights {ws}={0.5,0.75,1,5,6} were chosen following Ref. 26 to account for the HVS’s preference for coarse scales over fine scales. The addition of 1 is to prevent the logarithm of zero, and A=104 is a scaling factor to enlarge the adjusted variance. Examples of the log of response difference maps are shown in Fig. 11(d). The brighter the maps, the greater the difference in spatiotemporal responses between corresponding regions of the two STS images.

3.2.4.

Compute spatiotemporal dissimilarity value

The spatiotemporal correlation map P and the log of response difference map D are combined into a spatiotemporal dissimilarity map via a point-by-point multiplication.

Eq. (28)

Δx(t,y)=Dx(t,y)·1Px(t,y),

Eq. (29)

Δy(x,t)=Dy(x,t)·1Py(x,t).

Let Δ¯cTY denote the RMS value of the spatiotemporal dissimilarity map Δc(t,y) of size T×H, where c is the column (vertical slice) index of the vertical STS images. Let Δ¯rXT denote the RMS value of the spatiotemporal dissimilarity map Δr(x,t) of size W×T, where r is the row (horizontal slice) index of the horizontal STS images. Specifically, these RMS values are computed as follows:

Eq. (30)

Δ¯cTY=1T×Ht=1Ty=1H[Δc(t,y)]2,

Eq. (31)

Δ¯rXT=1W×Tx=1Wt=1T[Δr(x,t)]2,
where W and H are the spatial width and height of the video frame, respectively, and T is number of frames in the videos. The superscripts TY and XT are used to remind readers about the two dimensions of the STS images that are used to compute the values. The spatiotemporal dissimilarity value, denoted by ViS2, between the reference and the distorted video is given by

Eq. (32)

ViS2=1Wc=1W[Δ¯cTY]2+1Hr=1H[Δ¯rXT]2.

Here, ViS2 is a single scalar that represents the overall perceived video quality degradation due to spatiotemporal dissimilarity. The lower the ViS2 value, the better the video quality. A value of ViS2=0 indicates that the distorted video has perfect quality relative to the reference video.

Figure 11 shows the correlation maps Py(x,t), the log of response difference maps Dy(x,t), and the spatiotemporal dissimilarity maps Δy(x,t) computed from two pairs of specific horizontal STS images. The brighter values in the spatiotemporal dissimilarity maps Δy(x,t) in Fig. 11(e) denote the corresponding spatiotemporal regions of greater dissimilarity.

As observed from the video mc2_50fps.yuv (LIVE), the spatial distortion occurs more frequently in the middle frames. These middle frames are also heavily distorted in nearly every spatial region. This fact is well-captured by the spatiotemporal dissimilarity map in Fig. 11(e) (left). As observed in Fig. 11(e) (left), the dissimilarity map is brighter in the middle and along the entire spatial dimension. In video PartyScene_dst_09.yuv (CSIQ), the spatial distortion that occurs in the center of the video is smaller than the distortion in the surrounding area. This fact is also reflected in the spatiotemporal dissimilarity map in Fig. 11(e) (right), where the spatiotemporal dissimilarity map shows brighter surrounding regions compared to the center regions across the temporal dimension.

3.3.

Combine Spatial Distortion and Spatiotemporal Dissimilarity Values

Finally, the overall estimate of perceived video quality degradation, denoted by ViS3, is computed from the spatial distortion value ViS1 and the spatiotemporal dissimilarity value ViS2. Specifically, ViS3 is computed as a geometric mean of ViS1 and ViS2, which is given by

Eq. (33)

ViS3=ViS1×ViS2.

Here, ViS3 is a single scalar that represents the overall perceived quality degradation of the video. The smaller the ViS3 value, the better the video quality. A value of ViS3=0 indicates that the distorted video is equal in quality to the reference video.

Note that the values of ViS1 and ViS2 occupy different ranges. Thus, the use of a geometric mean in Eq. (33) allows us to combine these values without the need for custom weights (which would be required when using an arithmetic mean). Other combinations are also possible, e.g., using a weighted geometric mean with possibly adaptive weights. However, our preliminary attempts to select such weights have not yielded significant improvements (see also Sec. 4.3.4).

4.

Results

In this section, we analyze the performance of the ViS3 algorithm in predicting subjective ratings of quality on three publicly available video-quality databases. We also compare the performance of ViS3 with other quality assessment algorithms.

4.1.

Video Quality Databases

To evaluate the performance of ViS3 and other quality assessment algorithms, we used the following three publicly available video-quality databases that have multiple types of distortion:

  • 1. The LIVE video database (four types of distortion);24

  • 2. The IVPL video database (four types of distortion);62

  • 3. The CSIQ video database (six types of distortion).63

4.1.1.

LIVE video database

The LIVE video database24 developed at the University of Texas at Austin contains 10 reference videos and 150 distorted videos (15 distorted versions per each reference video). All videos are in raw YUV420 format with a resolution of 768×432pixels, 10s in duration, and at frame rates of 25 or 50 fps. There are four distortion types in this database: MPEG-2 compression (MPEG-2), H.264 compression (H.264), simulated transmission of H.264-compressed bit-streams through error-prone IP networks (IPPL), and simulated transmission of H.264-compressed bit-streams through error-prone wireless networks (WLPL). Three or four levels of distortion are present for each distortion type.

4.1.2.

IVPL video database

The IVPL HD video database62 developed at the Chinese University of Hong Kong consists of 10 reference videos and 128 distorted videos. All videos in this database are in raw YUV420 format with a resolution of 1920×1088pixels, 10s in duration, and at a frame rate of 25 fps. There are four types of distortion in this database: Dirac wavelet compression (DIRAC, three levels), H.264 compression (H.264, four levels), simulated transmission of H.264-compressed bit-streams through error-prone IP networks (IPPL, four levels), and MPEG-2 compression (MPEG-2, three levels). To reduce the computation time, we rescaled the videos to 960×544 using FFMPEG software64 with its default configuration.

4.1.3.

CSIQ video database

The CSIQ video database63 developed by the authors at Oklahoma State University consists of 12 reference videos and 216 distorted videos. All videos in this database are in raw YUV420 format with a resolution of 832×480pixels, a duration of 10 s, and span a range of various frame rates: 24, 25, 30, 50, and 60 fps. Each reference video has 18 distorted versions with six types of distortion; each distortion type has three different levels. The distortion types consist of four video compression distortion types [Motion JPEG (MJPEG), H.264, HEVC, and wavelet compression using SNOW codec64] and two transmission-based distortion types [packet-loss in a simulated wireless network (WLPL) and additive white Gaussian noise (AWGN)]. The experiment was conducted following the SAMVIQ testing protocol65 with 35 subjects.

4.2.

Algorithms and Performance Measures

We compared ViS3 with PSNR25 and recent full-reference video quality assessment algorithms for which code is publicly available, VQM,12 MOVIE,4 and TQV,8 on the three video databases. PSNR was applied on a frame-by-frame basis, VQM and MOVIE were applied using their default implementations and settings, and TQV was applied using its original training parameters. For ViS3, we used a GOF size of N=8.

Before evaluating the performance of each algorithm on each video database, we applied a four-parameter logistic transform to the raw predicted scores, as recommended by video quality experts group (VQEG) in Ref. 31. The four-parameter logistic transform is given by

Eq. (34)

f(x)=τ1τ21+exp(xτ3|τ4|)+τ2,
where x denotes the raw predicted score and τ1, τ2, τ3, and τ4 are free parameters that are selected to provide the best fit of the predicted scores to the subjective rating scores.

Following VQEG recommendations in Ref. 31, we employed the Spearman rank-order correlation coefficient (SROCC) to measure prediction monotonicity, and employed the Pearson linear correlation coefficient (CC) and the root mean square error (RMSE) to measure prediction accuracy. The prediction consistency of each algorithm was measured by two additional criteria: the outlier ratio (OR5) and the outlier distance (OD26). OR is the ratio of number of false scores predicted by the algorithm to the total number of scores. A false score is defined as the transformed score lying outside the 95% confidence interval of the associated subjective score.5 In addition, OD indicates how far the outliers fall outside of the confidence interval. The OD is measured by the total distance from all outliers to their closest edge points of the corresponding confidence interval.26

4.3.

Overall Performance

The performance of each algorithm on each video database is shown in Table 1 in terms of the five criteria (SROCC, CC, RMSE, OR, and OD). The best-performing algorithm is bolded, and the second best-performing algorithm is italicized and bolded. These data indicate that ViS3 is the best-performing algorithm on all three video databases in terms of all five evaluation criteria. The performances of ViS1 and ViS2 are also noteworthy.

Table 1

Performances of ViS3 and other algorithms on the three video databases. The best-performing algorithm is bolded and the second best-performing algorithm is italicized. Note that ViS3 is the best-performing algorithm on all three databases.

Peak SNR (PSNR)Video quality metric (VQM)Motion-based video integrity evaluation (MOVIE)TQVViS3ViS1ViS2
Spearman rank-order correlation coefficient (SROCC)LIVE0.5230.7560.7890.8020.8160.7620.736
IVPL0.7280.8450.8800.7010.8960.8720.817
CSIQ0.5790.7890.8060.8140.8410.7570.831
CCLIVE0.5490.7700.8110.8150.8290.7850.746
IVPL0.7230.8470.8790.7220.8960.8630.823
CSIQ0.5650.7690.7880.7950.8300.7390.830
Root mean square errorLIVE9.1757.0106.4256.3576.1466.8077.313
IVPL0.7300.5610.5040.7310.4700.5340.601
CSIQ13.72410.63310.23110.0909.27311.1979.279
Outlier ratioLIVE2.00%1.33%0%0%0%0%2.00%
IVPL7.81%0.78%1.56%7.81%0.78%1.56%4.69%
CSIQ12.96%5.09%4.17%4.63%3.70%7.41%3.24%
Outlier distanceLIVE11.4795.38500009.076
IVPL3.4220.4110.2222.5560.6161.0851.005
CSIQ169.18356.33444.63540.94628.19059.61930.546

In terms of prediction monotonicity (SROCC), ViS3 is the best-performing algorithm on all three databases. On the LIVE and CSIQ databases, ViS3 and TQV are the two best-performing algorithms. On the IVPL database, ViS3 and MOVIE are the two best-performing algorithms. A similar trend in performance is observed in terms of prediction accuracy (CC and RMSE).

In terms of prediction consistency measured by OR, on the LIVE database, three algorithms (MOVIE, TQV, and ViS3) have an OR of zero, which indicates that they do not yield any outliers. On the IVPL database, both ViS3 and VQM have only one outlier. On the CSIQ database, ViS3 and MOVIE are the two algorithms with the least number of outliers.

In terms of OD, on the LIVE database, three algorithms (MOVIE, TQV, and ViS3) have an OD of zero because they do not have any outliers. On the IVPL database, MOVIE and VQM have the smallest OD. Although ViS3 yields only one outlier on the IVPL database as well as VQM, ViS3 has larger OD because this outlier lies further away from its confidence interval. This indicates that ViS3 has a weakness on the IPPL distortion, to which the outlier belongs. Furthermore, on the CSIQ database, ViS3 and TQV yield the smallest OD values.

Observe that ViS1 and ViS2 yield different relative performances depending on the database. ViS1 shows better predictions than ViS2 on the LIVE and IVPL databases. However, ViS2 shows better predictions than ViS1 on the CSIQ database. Generally, ViS3 shows higher SROCC and CC and lower RMSE, OR, and OD than either ViS1 or ViS2 alone. Nonetheless, it may be possible to combine ViS1 and ViS2 in an adaptive fashion for even better prediction performance, and such an adaptive combination remains an area for future research.

The scatter-plots of logistic-transformed ViS3 values versus DMOS on the three databases are shown in Fig. 12. The plots show a highly correlated trend between the logistic-transformed ViS3 values versus DMOS values. For all the three databases, the predictions are homoscedastic; i.e., there are generally no subpopulations of videos/distortion types for which ViS3 yields lesser or greater residual variance in the predictions. These residuals are used for an analysis of statistical significance in Sec. 4.3.3.

Fig. 12

Scatter-plots of logistic-transformed scores predicted by ViS3 versus subjective scores on the three databases. Notice that all the plots are homoscedastic. The R values denote correlation coefficient between the logistic-transformed scores and subjective scores (DMOS).

JEI_23_1_013016_f012.png

4.3.1.

Performance on individual types of distortion

We measured the performance of ViS3 and other algorithms on individual types of distortion for videos from the three databases. For this analysis, we applied the logistic transform function to all predicted scores of each database, then divided the transformed scores into separate subsets according to the distortion types, and then measured the performance criteria in terms of SROCC and CC for each subset. Table 2 shows the resulting SROCC and CC values.

Table 2

Performances of ViS3 and other quality assessment algorithms measured on different types of distortion on the three video databases. The best-performing algorithm is bolded and the second best-performing algorithm is italicized.

DatabaseDistortionPSNRVQMMOVIETQVViS3
SROCC
LIVEWLPL0.6210.8170.8110.7540.845
IPPL0.4720.8020.7150.7420.788
H.2640.4730.6860.7640.7690.757
MPEG-20.3830.7180.7720.7850.730
IVPLDIRAC0.8600.8910.8880.7860.926
H.2640.8660.8620.8230.6720.876
IPPL0.7110.6500.8580.6290.807
MPEG-20.7380.7910.8230.5570.834
CSIQH.2640.8020.9190.8970.9550.920
WLPL0.8510.8010.8860.8420.856
MJPEG0.5090.6470.8870.8700.789
SNOW0.7590.8740.9000.8310.908
AWGN0.9060.8840.8430.9080.928
HEVC0.7850.9060.9330.9020.917
CC
LIVEWLPL0.6570.8120.8390.7770.846
IPPL0.4970.8000.7610.7940.816
H.2640.5710.7030.7900.7880.773
MPEG-20.3950.7370.7570.7940.746
IVPLDIRAC0.8780.8980.8700.8110.936
H.2640.8550.8690.8450.7440.898
IPPL0.6730.6420.8420.7350.802
MPEG-20.7180.8360.8240.5330.912
CSIQH.2640.8350.9160.9040.9650.918
WLPL0.8020.8060.8820.7840.850
MJPEG0.4600.6410.8820.8710.800
SNOW0.7690.8400.8980.8460.908
AWGN0.9490.9180.8550.9300.916
HEVC0.8050.9150.9370.9130.933

In general, VQM, MOVIE, and ViS3 all perform well on the WLPL distortion; these three algorithms show competitive and consistent performance on the WLPL distortion for both the LIVE and CSIQ databases. For the H.264 compression distortion, ViS3 and MOVIE perform well and consistently across all subsets of H.264 videos on all three databases. ViS3 and MOVIE are also competitive on the MPEG-2 compression distortion and the IPPL distortion on both the LIVE and IVPL databases.

In particular, on the LIVE database, ViS3 has the best performance on the WLPL distortion; VQM and ViS3 have the best performance on the IPPL distortion; ViS3, MOVIE, and TQV are the three best-performing algorithms on the H.264 compression distortion; and TQV and MOVIE are the two best-performing algorithms on the MPEG-2 compression distortion.

The low performance of the ViS3 algorithm on H.264 and MPEG-2 compression types in the LIVE video database is due to the outliers corresponding to specific videos as shown in Fig. 13; the outliers are marked by the red square markers. For H.264, the outliers correspond to the video riverbed where the water’s movement significantly masks the blurring imposed by the compression. However, ViS3 underestimates this masking and, thus, overestimates the DMOS. For MPEG-2, the sunflower seeds in the video sunflower generally impose signficant masking of the MPEG-2 blocking artifacts. However, there are select frames in this video in which the blocking artifacts become highly visible (owing perhaps to failed motion compensation), yet ViS3 does not accurately capture the visibility of these artifacts and, thus, underestimates the DMOS. These types of interactions between the videos and distortions are issues that certainly warrant future research.

Fig. 13

Scatter-plots of logistic-transformed scores predicted by ViS3 versus subjective scores on the H.264 and MPEG-2 distortion of the LIVE database. The second row shows representative frames of the two videos corresponding to the outliers (red square markers in the plots).

JEI_23_1_013016_f013.png

On the IVPL database, ViS3 yields the best performance on three types of distortion (DIRAC, H.264, and MPEG-2); ViS3 yields the second best performance on the IPPL distortion, on which MOVIE is the best-performing algorithm. VQM and MOVIE are the second best-performing algorithms on the MPEG-2 distortion. PSNR, VQM, and MOVIE are also competitive on both the DIRAC and H.264 distortion.

On the CSIQ database, TQV and ViS3 are the two best-performing algorithms on the H.264 compression distortion; ViS3 and MOVIE are the two best-performing algorithms on three types of distortion (WLPL, SNOW, and HEVC); MOVIE and TQV are the two best-performing algorithms on the MJPEG. On the AGWN distortion, ViS3 and TQV are competitive with PSNR, which is well known to perform well for white noise.

Generally, ViS3 excels on the H.264 compression distortion and the wavelet-based compression distortion (DIRAC, SNOW), and ViS3, VQM, and MOVIE excel on the WLPL distortion. ViS3 also performs well on the MPEG-2, HEVC, and AWGN distortion. However, ViS3 does not perform well on the MJPEG compression distortion compared to MOVIE and TQV.

4.3.2.

Performance with different GOF sizes

As we mentioned in Sec. 3.1, for ViS1, the size of the GOF used in Eqs. (9), (11), and (12) is a user-selectable parameter (N). The results presented in the previous subsection were obtained with a GOF size of N=8. To investigate how the prediction performance varies with different GOF sizes, we computed SROCC and CC values for ViS1 and ViS3 using values of N ranging from 4 to 16. The results of this analysis are listed in Table 3.

Table 3

Performances of ViS3 on the three video databases with different group of frames (GOF) size. Note that ViS3 is robust with the change of the GOF size on all three databases.

GOF size468101216
ViS1
SROCCLIVE0.7540.7590.7620.7670.7700.768
IVPL0.8680.8710.8720.8710.8730.874
CSIQ0.7510.7530.7570.7580.7590.760
CCLIVE0.7780.7830.7850.7890.7910.793
IVPL0.8600.8620.8630.8650.8660.868
CSIQ0.7330.7360.7390.7400.7420.743
ViS3
SROCCLIVE0.8180.8170.8160.8140.8130.812
IVPL0.8970.8970.8960.8970.8970.896
CSIQ0.8400.8400.8410.8410.8410.841
CCLIVE0.8330.8310.8290.8280.8270.825
IVPL0.8960.8960.8960.8960.8970.896
CSIQ0.8290.8290.8300.8300.8300.830

As shown in the upper portion of Table 3, the performance of ViS1 tends to increase with larger values of N. This trend may partially be attributable to the fact that a larger GOF size can give rise to a more accurate estimate of the motion and, thus, perhaps a more accurate account of the temporal masking. Nonetheless, as demonstrated in the lower portion of Table 3, ViS3 is relatively robust to small changes in N. The choice of N=8 generally provides good performance on all three databases. However, the optimal choice of N remains an open research question.

4.3.3.

Statistical significance

To assess the statistical significance of differences in performances of ViS3 and other algorithms, we used an F-test to compare the variances of the residuals (errors) of the algorithms’ predictions.66 If the distribution of residuals is sufficiently Gaussian, an F-test can be used to determine the probability that the residuals are drawn from different distributions and are thus statistically different.

To determine whether the residuals of an algorithm have Gaussian distributions, we performed the Jarque–Bera (JB) test (see Ref. 58) on the residuals to measure the JBSTAT value. If the JBSTAT value is smaller than a critical value, then the distribution of residuals is significantly Gaussian. If the JBSTAT value is greater than the critical value, then the distribution of residuals is not Gaussian. The JB test results show that for the LIVE database, all the algorithms pass the JB test and their residuals have Gaussian distributions. On the IVPL database, only PSNR does not pass the JB test. On the CSIQ database, only VQM and ViS3 pass the JB test.

We performed an F-test with 95% confidence to compare the residual variances of the algorithms whose distributions of residuals are significantly Gaussian. If the variances are significantly different, we conclude that the two algorithms are significantly different. The smaller the variance of residuals, the better the prediction performance of the algorithm.

Table 4 shows the F-test results between each pair of the algorithms whose distributions of residuals are significantly Gaussian. A “0” value implies that residual variances of two algorithms are not significantly different. A “+“ sign implies that the algorithm indicated by the column has significantly smaller residual variance than the algorithm indicated by the row, and therefore, it has better performance. A “−“ sign implies that the algorithm indicated by the column has significantly larger residual variance than the algorithm indicated by the row, and therefore, it has worse performance.

Table 4

Statistical significance relationship between each pair of algorithms on the three video databases. A “0” value implies that variances of residuals between the algorithm indicated by the column and the algorithm indicated by row are not significantly different. A “+” sign implies that the algorithm indicated by the column has significantly smaller residual variance than the algorithm indicated by the row. A “−“ sign implies that the algorithm indicated by the column has significantly larger residual variance than the algorithm indicated by the row.

PSNRVQMMOVIETQVViS3
LIVEPSNR++++
VQM000
MOVIE000
TQV000
ViS3000
IVPLVQM0+
MOVIE00
TQV+++
ViS30
CSIQVQM+
ViS3

As seen from Table 4, on the LIVE database, the variance of residuals yielded by PSNR is significantly larger than the variances of residuals yielded by the other algorithms, and therefore, PSNR is significantly worse than the other algorithms. The difference in residuals of ViS3 and either of VQM, MOVIE, or TQV is not statistically significant. On the IVPL database, the variance of residuals yielded by TQV is significantly larger than the variances of residuals yielded by VQM, MOVIE, and ViS3, and therefore, VQM, MOVIE, and ViS3 are significantly better than TQV on this database. On both IVPL and CSIQ databases, the variance of residuals yielded by VQM is significantly larger than the variance of residuals yielded by ViS3, and therefore, ViS3 is significantly better than VQM on these databases.

Although ViS3 is not significantly better than MOVIE on any of the three databases, it should be noted that MOVIE is not significantly better than VQM on any of the three database, while ViS3 is significantly better than VQM on the IVPL and CSIQ databases. Moreover, MOVIE requires more computation time than ViS3. Specifically, using a modern computer (Intel Quad Core at 2.66 GHz, 12 GB RAM DDR2 at 6400 MHz, Windows 7 Pro 64-bit, MATLAB® R2011b) to estimate the quality of a 10-s video of size 352×288 (300 frames total), MOVIE requires 200min, whereas basic MATLAB® implementations of VQM and ViS3 require 1 and 7 min, respectively.

4.3.4.

Summary, limitations, and future work

Through testing on various video-quality databases, we have demonstrated that ViS3 performs well in predicting video quality. It not only excels at VQA for whole databases with varying types of distortion and varying distortion levels, but also performs well on videos with a specific type of distortion. Our performance evaluation demonstrates that ViS3 is either better than or statistically tied with current state-of-the-art VQA algorithms. A statistical analysis also shows that ViS3 is significantly better than PSNR, VQM, and TQV in predicting the qualities of videos from specific databases.

Yet, ViS3 is not without its limitations. One important limitation is in regards to the potentially large memory requirements for long videos. The STS images of a long video can require a prohibitively large width or height for the dimension corresponding to time. In this case, one solution would be to divide the video into small chunks across time, where each chunk has a length of 500 to 600 frames. The final result can be estimated via the mean of the ViS3 values computed for each chunk.

Another limitation of ViS3 is that it currently takes into account only the luminance component of the video. Further improvements may be realized by also considering degradations in chrominance. Another possible improvement might be realized by employing a more accurate pooling model of the spatiotemporal responses used in the spatiotemporal dissimilarity stage.

Equation (33) gives the same weight to the spatial distortion and spatiotemporal dissimilarity values. However, it would seem possible to adaptively combine the two values in a way that more accurately reflects the visual contribution of each degradation to the overall quality degradation. Our preliminary attempts to select the weights based on the video motion magnitudes, the difference in motion, or the variance of spatial distortion have not yielded significant improvements. We are currently conducting a psychophysical study to better understand if and how the spatial distortion and spatiotemporal dissimilarity values should be adaptively combined.

The incorporation of visual-attention modeling is another avenue for potential improvements. Some studies have shown that visual attention can be useful for quality assessment (e.g., Refs. 39, 67, and 68; see also Ref. 69). One possible technique for incorporating such data into ViS3 would be to weight the maps generated during the computation of both ViS1 and ViS2 based on estimates of visual gaze data or regions of interest in both space and time. Another interesting avenue of future research would be to compare the ViS1 and ViS2 maps with gaze data to identify any existing relationships and, perhaps, determine techniques for predicting gaze data based on the STS images.

5.

Conclusions

In this paper, we have presented a VQA algorithm, ViS3, that analyzes various two-dimensional space-time slices of the video to estimate perceived video quality degradation via two different stages. The first stage of the algorithm adaptively applies two strategies in the MAD algorithm to groups of video frames to estimate perceived video quality degradation due to spatial distortion. An optical-flow-based weighting scheme is used to model the effect of motion on the visibility of distortion. The second stage of the algorithm measures spatiotemporal correlation and applies an HVS-based model to the STS images to estimate perceived video quality degradation due to spatiotemporal dissimilarity. The overall estimate of perceived video quality degradation is given as the geometric mean of the two measurements obtained from the two stages. The ViS3 algorithm has been shown to perform well in predicting quality of videos from the LIVE database,24 the IVPL database,62 and the CSIQ database.63 Statistically significant improvements in predicting subjective ratings are achieved in comparison to a variety of existing VQA algorithms. The online supplement to this paper is available in Ref. 60.

Acknowledgments

This material is based upon work supported by the National Science Foundation Awards 0917014 and 1054612, and by the U.S. Army Research Laboratory and the U.S. Army Research Office under contract/grant number W911NF-10-1-0015.

References

1. 

B. Girod, Digital Images and Human Vision, MIT Press, Cambridge, Massachusetts (1993). Google Scholar

2. 

A. M. EskiciogluP. S. Fisher, “Image quality measures and their performance,” IEEE Trans. Commun., 43 (12), 2959 –2965 (1995). http://dx.doi.org/10.1109/26.477498 IECMBT 0090-6778 Google Scholar

3. 

B. A. Wandell, Foundations of Vision, Sinauer Associates, Sunderland, Massachusetts (1995). Google Scholar

4. 

K. SeshadrinathanA. Bovik, “Motion tuned spatio-temporal quality assessment of natural videos,” IEEE Trans. Image Process., 19 (2), 335 –350 (2010). http://dx.doi.org/10.1109/TIP.2009.2034992 IIPRE4 1057-7149 Google Scholar

5. 

Z. WangL. LuA. C. Bovik, “Video quality assessment based on structural distortion measurement,” Signal Process.: Image Commun., 19 (2), 121 –132 (2004). http://dx.doi.org/10.1016/S0923-5965(03)00076-6 SPICEF 0923-5965 Google Scholar

6. 

Z. WangQ. Li, “Video quality assessment using a statistical model of human visual speed perception,” J. Opt. Soc. Am. A, 24 (12), B61 –B69 (2007). http://dx.doi.org/10.1364/JOSAA.24.000B61 JOAOD6 0740-3232 Google Scholar

7. 

H. R. SheikhA. C. Bovik, “A visual information fidelity approach to video quality assessment,” in First Int. Workshop on Video Processing and Quality Metrics for Consumer Electronics, 23 –25 (2005). Google Scholar

8. 

M. NarwariaW. LinA. Liu, “Low-complexity video quality assessment using temporal quality variations,” IEEE Trans. Multimedia, 14 (3), 525 –535 (2012). http://dx.doi.org/10.1109/TMM.2012.2190589 ITMUF8 1520-9210 Google Scholar

9. 

Z. Wanget al., “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process., 13 (4), 600 –612 (2004). http://dx.doi.org/10.1109/TIP.2003.819861 IIPRE4 1057-7149 Google Scholar

10. 

H. R. SheikhA. C. Bovik, “Image information and visual quality,” IEEE Trans. Image Process., 15 (2), 430 –444 (2006). http://dx.doi.org/10.1109/TIP.2005.859378 IIPRE4 1057-7149 Google Scholar

11. 

S. WolfM. Pinson, “In-service performance metrics for MPEG-2 video systems,” in Measurement Techniques of the Digital Age Technical Seminar, 12 –13 (1998). Google Scholar

12. 

M. H. PinsonS. Wolf, “A new standardized method for objectively measuring video quality,” IEEE Trans. Broadcast., 50 (3), 312 –322 (2004). http://dx.doi.org/10.1109/TBC.2004.834028 IETBAC 0018-9316 Google Scholar

13. 

Y. Wanget al., “Novel spatio-temporal structural information based video quality metric,” IEEE Trans. Circuits Syst. Video Technol., 22 (7), 989 –998 (2012). http://dx.doi.org/10.1109/TCSVT.2012.2186745 ITCTEM 1051-8215 Google Scholar

14. 

D. E. Pearson, “Variability of performance in video coding,” in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 5 –8 (1997). Google Scholar

15. 

D. E. Pearson, “Viewer response to time-varying video quality,” Proc. SPIE, 3299 16 –25 (1998). http://dx.doi.org/10.1117/12.320109 PSISDG 0277-786X Google Scholar

16. 

K. SeshadrinathanA. C. Bovik, “Temporal hysteresis model of time varying subjective video quality,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 1153 –1156 (2011). Google Scholar

17. 

M. A. MasryS. S. Hemami, “A metric for continuous quality evaluation of compressed video with severe distortions, signal processing,” Signal Process.: Image Commun., 19 (2), 133 –146 (2004). http://dx.doi.org/10.1016/j.image.2003.08.001 SPICEF 0923-5965 Google Scholar

18. 

M. Barkowskyet al., “Perceptually motivated spatial and temporal integration of pixel based video quality measures,” in Welcome to Mobile Content Quality of Experience, 4:1 –4:7 (2007). Google Scholar

19. 

A. Ninassiet al., “Considering temporal variations of spatial visual distortions in video quality assessment,” IEEE J. Sel. Topics Signal Process., 3 (2), 253 –265 (2009). http://dx.doi.org/10.1109/JSTSP.2009.2014806 1932-4553 Google Scholar

20. 

C. NgoT. PongH. Zhang, “On clustering and retrieval of video shots through temporal slices analysis,” IEEE Trans. Multimedia, 4 (4), 446 –458 (2002). http://dx.doi.org/10.1109/TMM.2002.802022 ITMUF8 1520-9210 Google Scholar

21. 

C. NgoT. PongH. Zhang, “Motion analysis and segmentation through spatio-temporal slices processing,” IEEE Trans. Image Process., 12 (3), 341 –355 (2003). http://dx.doi.org/10.1109/TIP.2003.809020 IIPRE4 1057-7149 Google Scholar

22. 

A. B. WatsonA. J. Ahumada, “Model of human visual-motion sensing,” J. Opt. Soc. Am. A, 2 (2), 322 –341 (1985). http://dx.doi.org/10.1364/JOSAA.2.000322 JOAOD6 0740-3232 Google Scholar

23. 

E. H. AdelsonJ. R. Bergen, “Spatiotemporal energy models for the perception of motion,” J. Opt. Soc. Am. A, 2 (2), 284 –299 (1985). http://dx.doi.org/10.1364/JOSAA.2.000284 JOAOD6 0740-3232 Google Scholar

24. 

K. Seshadrinathanet al., “Study of subjective and objective quality assessment of video,” IEEE Trans. Image Process., 19 (6), 1427 –1441 (2010). http://dx.doi.org/10.1109/TIP.2010.2042111 IIPRE4 1057-7149 Google Scholar

25. 

“Objective video quality measurement using a peak-signal-to-noise-ratio (PSNR) full reference technique,” (2001). Google Scholar

26. 

E. LarsonD. Chandler, “Most apparent distortion: full-reference image quality assessment and the role of strategy,” J. Electron. Imaging, 19 (1), 011006 (2010). http://dx.doi.org/10.1117/1.3267105 JEIME5 1017-9909 Google Scholar

27. 

A. B. WatsonA. J. Ahumada, A look at motion in the frequency domain, 1983). Google Scholar

28. 

P. V. VuC. T. VuD. M. Chandler, “A spatiotemporal most-apparent-distortion model for video quality assessment,” in IEEE Int. Conf. on Image Processing, 2505 –2508 (2011). Google Scholar

29. 

P. V. VuD. M. Chandler, “Video quality assessment based on motion dissimilarity,” in Seventh Int. Workshop on Video Processing and Quality Metrics for Consumer Electronics, (2013). Google Scholar

30. 

S. Chikkeruret al., “Objective video quality assessment methods: a classification, review, and performance comparison,” IEEE Trans. Broadcasting, 57 (2), 165 –182 (2011). http://dx.doi.org/10.1109/TBC.2011.2104671 IETBAC 0018-9316 Google Scholar

31. 

“Final report from the video quality experts group on the validation of objective models of video quality assessment, Phase II,” (2003). Google Scholar

32. 

L. Luet al., “Full-reference video quality assessment considering structural distortion and no-reference quality evaluation of MPEG video,” in IEEE Int. Conf. on Multimedia and Expo, 61 –64 (2002). Google Scholar

33. 

P. TaoA. M. Eskicioglu, “Video quality assessment using M-SVD,” Proc. SPIE, 6494 649408 (2007). http://dx.doi.org/10.1117/12.696142 PSISDG 0277-786X Google Scholar

34. 

A. Pessoaet al., “Video quality assessment using objective parameters based on image segmentation,” SMPTE J., 108 (12), 865 –872 (1999). http://dx.doi.org/10.5594/J04308 SMPJDF 0036-1682 Google Scholar

35. 

J. Okamotoet al., “Proposal for an objective video quality assessment method that takes temporal and spatial information into consideration,” Electron. Commun. Jpn., 89 (12), 97 –108 (2006). http://dx.doi.org/10.1002/(ISSN)1520-6424 ECOJAL 0424-8368 Google Scholar

36. 

S. O. LeeD. G. Sim, “New full-reference visual quality assessment based on human visual perception,” in Int. Conf. on Consumer Electronics, 1 –2 (2008). Google Scholar

37. 

M. Barkowskyet al., “Temporal trajectory aware video quality measure,” IEEE J. Sel. Topics Signal Process., 3 (2), 266 –279 (2009). http://dx.doi.org/10.1109/JSTSP.2009.2015375 1932-4553 Google Scholar

38. 

A. BhatI. RichardsonS. Kannangara, “A new perceptual quality metric for compressed video,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 933 –936 (2009). Google Scholar

39. 

U. Engelkeet al., “Modelling saliency awareness for objective video quality assessment,” in Second Int. Workshop on Quality of Multimedia Experience, 212 –217 (2010). Google Scholar

40. 

X. Guet al., “Region of interest weighted pooling strategy for video quality metric,” Telecommun. Syst., 49 (1), 63 –73 (2012). http://dx.doi.org/10.1007/s11235-010-9353-8 TESYEV 1018-4864 Google Scholar

41. 

M. NarwariaW. Lin, “Scalable image quality assessment based on structural vectors,” in IEEE Int. Workshop on Multimedia Signal Processing, 1 –6 (2009). Google Scholar

42. 

A. A. StockerE. P. Simoncelli, “Noise characteristics and prior expectations in human visual speed perception,” Nat. Neurosci., 9 (4), 578 –585 (2006). http://dx.doi.org/10.1038/nn1669 NANEFN 1097-6256 Google Scholar

43. 

Z. WangE. SimoncelliA. Bovik, “Multiscale structural similarity for image quality assessment,” in Conf. Record of the Thirty-Seventh Asilomar Conf. on Signals, Systems and Computers, 1398 –1402 (2003). Google Scholar

44. 

F. LukasZ. Budrikis, “Picture quality prediction based on a visual model,” IEEE Trans. Commun., 30 (7), 1679 –1692 (1982). http://dx.doi.org/10.1109/TCOM.1982.1095616 IECMBT 0090-6778 Google Scholar

45. 

A. Bassoet al., “Study of MPEG-2 coding performance based on a perceptual quality metric,” in Proc. of Picture Coding Symp. 1996, 263 –268 (1996). Google Scholar

46. 

C. J. van den Branden Lambrecht, “Color moving pictures quality metric,” in IEEE Int. Conf. on Image Process., 885 –888 (1996). Google Scholar

47. 

P. LindhC. J. van den Branden Lambrecht, “Efficient spatio-temporal decomposition for perceptual processing of video sequences,” in IEEE Int. Conf. on Image Processing, 331 –334 (1996). Google Scholar

48. 

A. Hekstraet al., “PVQM—a perceptual video quality measure,” Signal Process.: Image Commun., 17 (10), 781 –798 (2002). http://dx.doi.org/10.1016/S0923-5965(02)00056-5 SPICEF 0923-5965 Google Scholar

49. 

A. B. WatsonJ. HuJ. F. McGowan, “Digital video quality metric based on human vision,” J. Electron. Imaging, 10 (1), 20 –29 (2001). http://dx.doi.org/10.1117/1.1329896 JEIME5 1017-9909 Google Scholar

50. 

C. LeeO. Kwon, “Objective measurements of video quality using the wavelet transform,” Opt. Eng., 42 (1), 265 –272 (2003). http://dx.doi.org/10.1117/1.1523420 OPEGAR 0091-3286 Google Scholar

51. 

E. Onget al., “Video quality metric for low bitrate compressed videos,” in IEEE Int. Conf. on Image Processing, 3531 –3534 (2004). Google Scholar

52. 

E. Onget al., “Colour perceptual video quality metric,” in IEEE Int. Conf. on Image Processing, III-1172-5 (2005). Google Scholar

53. 

M. MasryS. HemamiY. Sermadevi, “A scalable wavelet-based video distortion metric and applications,” IEEE Trans. Circuits Syst. Video Technol., 16 (2), 260 –273 (2006). http://dx.doi.org/10.1109/TCSVT.2005.861946 ITCTEM 1051-8215 Google Scholar

54. 

P. Ndjiki-NyaM. BarradoT. Wiegand, “Efficient full-reference assessment of image and video quality,” in IEEE Int. Conf. on Image Processing, II-125 –II-128 (2007). Google Scholar

55. 

S. LiL. MaK. N. Ngan, “Full-reference video quality assessment by decoupling detail losses and additive impairments,” IEEE Trans. Circuits Syst. Video Technol., 22 (7), 1100 –1112 (2012). http://dx.doi.org/10.1109/TCSVT.2012.2190473 ITCTEM 1051-8215 Google Scholar

56. 

P. C. TeoD. J. Heeger, “Perceptual image distortion,” in IEEE Int. Conf. on Image Processing, 982 –986 (1994). Google Scholar

57. 

S. Péchardet al., “A new methodology to estimate the impact of H.264 artefacts on subjective video quality,” in Third Int. Workshop on Video Processing and Quality Metrics, (2007). Google Scholar

58. 

D. ChandlerS. Hemami, “VSNR: a wavelet-based visual signal-to-noise ratio for natural images,” IEEE Trans. Image Process., 16 (9), 2284 –2298 (2007). http://dx.doi.org/10.1109/TIP.2007.901820 IIPRE4 1057-7149 Google Scholar

59. 

B. D. LucasT. Kanade, “An iterative image registration technique with an application to stereo vision,” in 7th Int. Joint Conf. on Artificial Intelligence, 674 –679 (1981). Google Scholar

60. 

P. VuD. Chandler, “Online supplement: ViS3: an algorithm for video quality assessment via analysis of spatial and spatiotemporal slices,” (2013) http://vision.okstate.edu/vis3/ December 2013). Google Scholar

61. 

J. G. Robson, “Spatial and temporal contrast-sensitivity functions of the visual system,” J. Opt. Soc. Am., 56 (8), 1141 –1142 (1966). http://dx.doi.org/10.1364/JOSA.56.001141 JOSAAH 0030-3941 Google Scholar

62. 

Image & Video Processing Laboratory, The Chinese University of Hong Kong, “IVP subjective quality video database,” (2012) http://ivp.ee.cuhk.edu.hk/research/database/subjective/index.shtml April ). 2012). Google Scholar

63. 

Laboratory of Computational Perception & Image Quality, Oklahoma State University, “CSIQ video database,” (2013) http://vision.okstate.edu/csiq/ November 2012). Google Scholar

64. 

F. Bellardet al., “FFMPEG tool,” (2012) http://www.ffmpeg.org November ). 2012). Google Scholar

65. 

“Subjective quality of internet video codec phase II evaluations using SAMVIQ,” (2005). Google Scholar

66. 

H. R. SheikhM. F. SabirA. C. Bovik, “A statistical evaluation of recent full reference image quality assessment algorithms,” IEEE Trans. Image Process., 15 (11), 3440 –3451 (2006). http://dx.doi.org/10.1109/TIP.2006.881959 IIPRE4 1057-7149 Google Scholar

67. 

U. EngelkeV. X. NguyenH. Zepernick, “Regional attention to structural degradations for perceptual image quality metric design,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 869 –872 (2008). Google Scholar

68. 

J. Youet al., “Perceptual quality assessment based on visual attention analysis,” in Proc. of the 17th ACM Int. Conf. on Multimedia, 561 –564 (2009). Google Scholar

69. 

O. L. Meuret al., “Overt visual attention for free-viewing and quality assessment tasks: impact of the regions of interest on a video quality metric,” Signal Process.: Image Commun., 25 (7), 547 –558 (2010). http://dx.doi.org/10.1016/j.image.2010.05.006 SPICEF 0923-5965 Google Scholar

Biography

Phong V. Vu received his BE in telecommunications engineering from the Posts and Telecommunications Institute of Technologies, Hanoi, Vietnam, in 2004. He is currently working toward his PhD degree in the School of Electrical and Computer Engineering, Oklahoma State University, Stillwater, Oklahoma. His research interests include image and video processing, image and video quality assessment, and computational modeling of visual perception.

Damon M. Chandler received his BS degree in biomedical engineering from Johns Hopkins University, Baltimore, Maryland, in 1998, and his MEng, MS, and PhD degrees in electrical engineering from Cornell University, Ithaca, New York, in 2000, 2004, and 2005, respectively. He is currently an associate professor in the School of Electrical and Computer Engineering at Oklahoma State University, Stillwater, Oklahoma, where he heads the Laboratory of Computational Perception and Image Quality.

CC BY: © The Authors. Published by SPIE under a Creative Commons Attribution 4.0 Unported License. Distribution or reproduction of this work in whole or in part requires full attribution of the original publication, including its DOI.
Phong V. Vu and Damon M. Chandler "ViS3: an algorithm for video quality assessment via analysis of spatial and spatiotemporal slices," Journal of Electronic Imaging 23(1), 013016 (4 February 2014). https://doi.org/10.1117/1.JEI.23.1.013016
Published: 4 February 2014
Lens.org Logo
CITATIONS
Cited by 162 scholarly publications and 2 patents.
Advertisement
Advertisement
KEYWORDS
Video

Distortion

Databases

Image filtering

Image quality

Visualization

Video compression


CHORUS Article. This article was made freely available starting 04 February 2015

Back to Top