Keywords

1 Introduction

Image Quality Assessment (IQA) is a very active topic of research and even if today’s IQA algorithms predict quality for a variety of images and distortion types remarkably well, there is still a lack of understanding about the way humans perceive artifacts in images [1]. In particular, one of the unsolved question regards the interactions between the distortions and the image contents. Humans rate the quality of images highly or slightly distorted more easily than for the intermediate range of distortion. In the latter case, the task becomes more difficult as the interactions between distortions and image contents are more severe.

Larson and Chandler [2] claim that our visual system uses different strategies to evaluate image quality depending on the signal-distortion ratio. In the high quality regime, the visual system attempts to look for distortions in the presence of the image content, whereas in the low quality regime, the visual system attempts to look for image content in the presence of the distortions. Based on this hypothesis, the authors propose a Full Reference (FR) method which attempts to explicitly model these two separate strategies.

Objective image quality assessment is mainly related to measuring the presence of distortions. Humans, while scoring the quality of images, are not always able to disregard all the factors related not only to the distortion presence but also to other aspects like aesthetic or image semantic [3]. When these different aspects concur to generate final subjective rates, the objective metrics, that measure only distortions, may not properly predict human judgments.

The effect of content dependency on objective image quality metrics has been previously considered in the literature. For example, the authors in [4] addressed the problem of scene dependency and scene susceptibility in image quality assessments. They proposed image analysis as a means to group test scenes according to basic inherent scene properties that human observers refer to when they judge the quality of images. Experimental work was carried out for JPEG and JPEG2000 distortions. Oh et al. [5] analyzed the degree of correlation between scene descriptors (first and second order statistical measurements) and scene susceptibility parameters for noisiness and sharpness. Recently, Bondzulic et al. [6] analyzed the performance of the Peak Signal to Noise Ratio (PSNR) metric for video quality assessment as a function of the video content. They have shown that within a fixed content, the variation of the PSNR is a reliable indicator for predicting subjective quality of video streaming.

Attempts to improve the reliability of objective metrics involve also taking into account visual attention of the human visual system [7]. For example, Liu et al. [8] observed that adding Natural Scene Saliency (NSS) obtained from eye-tracking data can improve the performance of objective metrics and investigated the dependency of this improvement with respect to the image content. The authors demonstrated that the variation in NSS between participants largely depends on the visual content.

In this paper we investigate the interferences between distortion and image content in human quality perception. Image content here refers to image complexity described in terms of low level features. Our working hypothesis is that the correlation between subjective and predicted scores can be improved if performed within a group of images that present similar complexity. In a preliminary study [9], we presented how the correlation between No Reference (NR) metrics for JPEG distortion and subjective scores improves, considering image complexity and frequency analysis. Here we examine in depth this topic considering FR metrics and different types of distortions.

In this work we also take into account multiply distorted images. To this end the LIVE multi-distortion (LIVE-MD) database [10] is considered, together with two well known database of single distortions: the LIVE [11] and the CSIQ [12] databases. We here present an extensive analysis of this topic with respect to 17 state-of-the-art FR metrics.

In Sect. 2 we present the proposed complexity grouping strategy which is based on a fuzzy clustering algorithm, while in Sect. 3 we present and comment the results of our analysis of the FR metrics’ performance on the three datasets.

2 Grouping Images by Image Complexity

Our proposal is to first categorize the images within one of the following complexity groups: low, medium or high complexity, and then to perform the regression taking into account this grouping strategy.

There exists no unique definition of the complexity of an image. Researchers from various fields have proposed different measures to estimate image complexity. Visual or image complexity can be analyzed by using mathematical treatments, based on algorithmic information theory or Kolmogorov complexity theory. Image complexity is also related to aesthetics [13]. From the experiments of Oliva et al. [14] a multi-dimensional representation of visual complexity (quantity of objects, clutter, openness, symmetry, organization, variety of colors) was proposed. Fuzzy approaches [15], information-theoretical based techniques [16], and independent component analysis [17] have been proposed in the literature to determine the complexity of an image. Rosenholtz et al. [18] presented two measures of visual clutter, based on feature congestion and subband entropy, relating them to visual complexity. Edge density has been used by Mack and Oliva [19] to predict subjective judgments of image complexity. Recently, new measures of image complexity have been proposed as combination of single image features [20, 21]. In the context of image quality, Allen et al. [22] have observed that the perception of distortions is influenced by the amount of details in the images. Following the above mentioned results, in the present work we adopt the edge density as low level feature representative of visual complexity.

2.1 A Fuzzy Approach to Group Images by Complexity

Our complexity-based grouping strategy is based on fuzzy clustering and starts from the work of Chacon et al. [23]. It is a two steps method.

First Step. The aim is to decompose the edges in the images into five levels based on their “edgeness”. To this end, we have collected 23 images of about 0.5 Megapixel each. These images belong to a personal database and were chosen to represent different contents (close-ups, landscape, portraits, etc.). For each image \(I_k\) (\(k=1\dots 23\)), we extract the norm of the gradient of the intensity channel \(Y_k\). Edge pixels are selected by thresholding the natural logarithmic of the gradient module. The values above a threshold T are then collected in an edge vector \(\mathbf {E}\):

$$\begin{aligned} \mathbf {E} = \left\{ \left\{ {\text {ln}} \left\| \nabla {Y_k} \right\| _2 ; \,\,\, \left\| \nabla Y_k\right\| _2 > T \right\} \,\,\, k=1\dots 23\right\} \end{aligned}$$
(1)

To identify the five edge levels, we have applied the Fuzzy C-Means (FCM) algorithm [24] on the elements of \(\mathbf {E}\). Fuzzy clustering methods allow objects to belong to several clusters simultaneously, with different degrees of membership. The FCM algorithm minimizes the functional:

$$\begin{aligned} J_{FCM}=\sum ^{C}_{i=1}\sum ^{N}_{j=1}\mu _{ij}^{p}d_{ij}^2 \end{aligned}$$
(2)

where C is the number of cluster, N is the number of objects, \(p\in \left[ 1 \, \infty \right) \) is a parameter which determines the fuzziness of the resulting clusters, \(\mu _{ij}\) is the membership function that satisfies the following constraint, for \(j=1,2,...,N\):

$$\begin{aligned} \sum ^{C}_{i=1}\mu _{ij}=1 \end{aligned}$$
(3)

In Eq. 2, \(d_{ij}\) is the Euclidean norm between the j-th object defined through its feature vector \(\mathbf {v}_j\) and the center \(\mathbf {c}_i\) of the i-th fuzzy cluster:

$$\begin{aligned} d_{ij}=\left\| \mathbf {v}_{j}-\mathbf {c}_{i}\right\| _{2} \end{aligned}$$
(4)

In our case, the objects correspond to the elements in \(\mathbf {E}\). The feature space is mono dimensional, and \(\mathbf {v}_{j}\) corresponds to the j-th element of the vector \(\mathbf {E}\). N is the size of the vector and is about \(2\times 10^{6}\). The number of clusters C corresponds to the number of edge levels and is 5. Finally, the fuzziness parameter has been set to \(p=2\).

The minimization of the C-Means functional in Eq. 2, represents a nonlinear optimization problem that can be solved by using a variety of methods. In our work we have used Picard iteration as implemented in the Fuzzy Clustering and Data Analysis Toolbox available on line and developed by Alasko et al. [25].

Second Step. The aim is to categorize images into three complexity groups based on the edge decomposition. For this purpose, 370 images are used: 300 of them belong to the BSDS300 database [26], the remaining 70 belong to a personal database and are different from those used in the first step of clustering. We use the Fuzzy Gath-Geva (FGG) clustering method on the edge decomposition of the images to find three clusters corresponding to high, medium and low complexity. For each image \(I_k\) (\(k=1\dots 370\)) we compute a four dimensional feature vector \(\mathbf {w}_{k}\). The first three elements correspond to the densities of the first three edge levels; the fourth element is the sum of these values.

We adopt here FGG as it is able to detect clusters of varying shapes, sizes and densities. FCM instead permits to only detect clusters with the same shape and orientation. In particular the Euclidean norm induces hyper-spherical clusters. This is not a problem for the edge level decomposition step where we are interested in creating more uniform clusters. The distance norm adopted by FGG is based on the fuzzy maximum likelihood estimates, proposed by [27]:

$$\begin{aligned} d_{ik}=\frac{\left( det\mathbf F _{i}\right) ^\frac{1}{2}}{\left( \frac{1}{N}\sum _{k=1}^N\mu _{ik}\right) } exp\left( \frac{1}{2}\left( \mathbf {w}_{k}-\mathbf {c}_{i}\right) ^{T}{} \mathbf F _{i}^{-1}\left( \mathbf {w}_{k}-\mathbf {c}_{i}\right) \right) \end{aligned}$$
(5)

where \(\mathbf F _i\) is the fuzzy covariance matrix of the i-th cluster, given by:

$$\begin{aligned} \mathbf F _i=\frac{\sum _{k=1}^N\left( \mu _{ik}\right) ^p \left( \mathbf {w}_{k}-\mathbf {c}_{i}\right) \left( \mathbf {w}_{k}-\mathbf {c}_{i}\right) ^{T}}{\sum _{k=1}^N\left( \mu _{ik}\right) ^p} \end{aligned}$$
(6)

In our case the fuzziness parameter is \(p=2\), and \(N=370\). Note that FGG needs a good initialization, due to the exponential distance norm. To this end we use the output of a FCM to initialize the algorithm.

The above described method basically performs an unsupervised classification of an image in one of the three complexity categories. Formally, denoting with \(\mathcal {F}\) the function performing the category labelling, and given an image \(I_k\), we have that:

$$\begin{aligned} \mathcal {F}(I_k)=z_k \,\,\text{ with }\,\, z_k\in \{l,m,h\} \end{aligned}$$
(7)

where l, m, and h are the labels for the low, medium and high complexity category respectively.

3 Experimental Results

Different databases are available to test the algorithms’ performance with respect to the human subjective judgments [28]. Among them, we have chosen the following three reference IQ databases: the LIVE database [11, 29] containing 29 reference images and 779 distorted images with JPEG and JPEG2000 compression, Gaussian Blur (BLUR), Additive Gaussian White Noise (WHITE NOISE), and FAST FADING; the CSIQ database [2, 12] containing 30 reference images and 866 distorted images with JPEG and JPEG2000 compression, Gaussian Blur (BLUR), Additive White Gaussian Noise (AWGN), Additive Pink Gaussian Noise (FNOISE), and Global Contrast (CONTRAST); the LIVE-MD database [10] containing 15 reference images and 405 multiply distorted images (BLUR+JPEG, and BLUR+NOISE).

We first apply the fuzzy approach described by Eq. 7 on each of the three database. The complexity categories obtained are depicted in Fig. 1.

As full reference metrics, we focus here on 17 metrics available in the literature. We consider different kinds of approaches: from the simplest ones like MSE or peak signal-to-noise ratio (PSNR) to other, more sophisticated, metrics. These metrics estimate quality based on image structure, use statistical and information-theoretic approaches, or are based on models of the Human Visual System (HVS).

The full list of the 17 FR metrics evaluated is: Mean-Squared-Error (MSE), Signal Noise Ratio (SNR), Peak Signal-to-Noise-Ratio (PSNR), Universal Quality Index (UQI) [30], Structural Similarity Index (SSIM) [31], Multi-Scale SSIM index (MS-SSIM) [32], Visual Signal-to-Noise Ratio (VSNR) [33], Information Fidelity Criterion (IFC) [34], Visual Information Fidelity (VIF) [35], Most Apparent Distortion (MAD) [2], PSNR-HVS [36], PSNR-HVSM [37], Information content Weighted-SSIM (IW-SSIM) [38], Information content Weighted-PSNR (IW-PSNR) [38], Feature Similarity Index (FSIM) [39], Gradient magnitude Similarity Deviation (GMSD) [40], and Divisive Normalization Metric (DN) [41].

Fig. 1.
figure 1

Original images from LIVE, CSIQ and LIVE-MD databases grouped in the three categories: low, medium and high complexity.

If we consider a single metric and the corresponding subjective scores for a given database, a logistic regression curve can be computed. The correlation performance is evaluated using the Pearson Correlation Coefficient (PCC), that is the linear correlation coefficient between the quality predicted by the metric and the subjective scores. We refer to the regression using all the data within each database as \(R_{all}\), while the regressions within each of the Low, Medium and High complexity groups are referred as \(R_{L}, R_{M}\) and \(R_{H}\) respectively.

We present the performance of the 17 FR metrics when applied to all the dataset (LIVE, CSIQ, and LIVE-MD) and on each of the corresponding complexity groups separately. To better visualize the results, for each metric and each regression, we consider the relative improvements \(\varDelta \) of the PCC on the three groups w.r.t. the PCC on the whole dataset:

$$\begin{aligned} \varDelta = \frac{PCC(X) - PCC(R_{all})}{PCC(R_{all})} \times 100 \end{aligned}$$
(8)

where X is either \(R_L\), \(R_M\), or \(R_H\).

Table 1. Distortion independent analysis results. Relative increase of PCC for LIVE, CSIQ and LIVE-MD data. Positive increases are shown in bold.

3.1 Distortion Independent Analysis

In Table 1 we report the performance in terms of PCC using a Logistic Regression (\(R_{all}\)) for each of the 17 FR metrics on the three databases considering all the distortions together. These performance are compared with the corresponding ones obtained considering each of the complexity groups separately (\(R_{H}, R_{M}\) and \(R_{L}\)), and results are reported in terms of Eq. 8. We show in boldface the coefficients that have positive improvement within the complexity groups. Observing these results we note that:

  • the correlation coefficients obtained for \(R_{all}\) are in agreement with the values reported in the literature [40, 42];

  • MAD and GMSD are the most competitive metrics for both LIVE and CSIQ data;

  • for LIVE-MD data, the performance of nearly all the metrics are lower than the corresponding ones in case of LIVE and CSIQ;

  • nearly all the metrics improve the performance when evaluated on a subset of images of equivalent complexity. In particular the signal-based metrics: MSE, SNR, PSNR, as well as SSIM, PSNR-HVSM, and PSNR-HVS metrics exhibit the highest improvements;

  • the metric less affected by the complexity grouping across the three datasets is the MS-SSIM;

    Table 2. Relative increase of PCC for LIVE data on the 17 metrics. Positive increase is shown in bold.
    Table 3. Relative increase of PCC for CSIQ data for the 17 metrics. Positive increase is shown in bold.
  • the most relevant results are in the LIVE-MD data where we can find several metrics that exhibit two digits improvements when the three complexity groups are considered;

  • on the overall, the \(R_H\) group exhibits the most relevant improvement with respect to \(R_{all}\).

3.2 Distortion Dependent Analysis

In Tables 2, 3, and 4 we report the detailed results, in terms of PCC and relative improvement, for each distortion present in each dataset. For LIVE and CSIQ we observe that:

  • the performance of nearly all the metrics is improved when evaluated on each single complexity group;

  • for both LIVE and CSIQ the signal-based metrics are the ones with greatest \(\varDelta \) for JPEG, JPEG2000, BLUR and FAST FADING distortions;

  • for both LIVE and CSIQ metrics, UQI and IFC are the ones with highest \(\varDelta \) for noise distortions WHITE NOISE, AWGN, and FNOISE;

  • for CONTRAST distortion, a noticeable increase of around 50% is observed for the metric IFC.

For the LIVE-MD we observe that:

  • the \(\varDelta \) improvements result greater than in the case of single-distortion datasets;

  • we notice high increases for signal-based metrics, PSNR-HVSM, and PSNR-HVS;

  • we can also notice that in general for these metrics, the improvements for blur followed by JPEG are greater than the corresponding ones for blur followed by noise. These results are in accordance with the performance of the considered metrics in the case of single distortions.

Finally, we have performed the same analysis on the three datasets but now randomly grouped. We have thus verified that the improvements obtained are not due to the fact that the cardinality of the groups is lower than the whole dataset. In fact, the improvements are actually related to the fact that within each group the images have similar content (in terms of complexity). These results will be available at our website.

4 Conclusions

In this paper we have studied the interaction between distortions and image contents when assessing image quality. We have presented an extensive analysis of the performance of state-of-the-art FR metrics when evaluated within groups of images of similar complexity. We have proposed a fuzzy clustering technique to categorize images within three groups (low, medium and high) according to their complexity in terms of low level features. Our experiments show that in general a significant gain in performance of all the FR metrics considered is achieved when quality is separately evaluated on the three complexity groups. These results are consistent across quality metrics, distortion type and image datasets. In particular signal based metrics are the ones exhibiting the highest improvements. For the multi-distorted data we also observed a significant improvement for all the metrics. This result is encouraging as image quality of multi-distorted data is a challenging task and it is currently an open issue.

Table 4. Relative increase of PCC for LIVE-MD data. Results for the different types of distortion are shown for the 17 FR metrics. Positive increases are shown in bold.