Keywords

1 Introduction

Recent rapid development of virtual reality (VR) technologies has led to a new 360-degree look-around visual experience. By displaying stereoscopic 360 scene in head-mounted rigs like Occulus Rift, users can perceive an immersed sensation of reality. Image stitching is typically used to construct a seamless 360 view from multiple captured viewpoint images, and thus the quality of the stitched scene is crucial in determining the level of immersive experience provided. The widely-adopted stitching process [5, 22, 23] can be broadly divided into the following steps: (i) register the capturing cameras and project each captured scene accordingly, (ii) merge overlapping spatial regions based on corresponding capturing camera parameters, and (iii) smooth/blend over the merged scene. Although errors may be introduced at each step, noticeable distortions are usually introduced at the misaligned overlapping regions around the scene objects, which we call shape breakage. The following blending step is then employed to alleviate the breakage by imposing a consistency constraint over the entire scene [7, 16]. The misaligned scene objects, after being blended, are exposed as ghosting or object fragment [15, 19]. Because of the uniqueness of two most common error types during image stitching—shape breakage (including object fragment) and ghosting, stitched image quality assessment is fundamentally different from conventional IQA for images distorted from compression artifacts or network packet losses [17, 18, 24]. Specifically, conventional IQA methods focus on various noise type’s global influence upon visual comfort or information integrity, while in SIQA aims at local distortions that damage object or scene integrity. Further, the most common distortions in conventional IQA come form compression losses, which is not applied to SIQA tasks.

Fig. 1.
figure 1

Comparison of example dataset used for conventional IQA and SIQA experimentations. It is easily observed that conventional IQA samples are evenly distorted over the image, while the SIQA distortions come in local patches. Patch b and c are patches with ghosting error type, a and d are undistorted patches with high image quality.

Hence, the process of assessing stitched image quality can be understood as searching for stitched errors over the composed scene—a process of locating and assessing particular error types rather than overall assessment of every local spatial region. Thus, it is necessary to study SIQA as a new problem apart from the conventional IQA. Figure 1 illustrates the comparison between some typical samples used for IQA and SIQA. It is clear that most parts of the stitched image have approximately reference quality as in IQA tasks, and the noisy IQA samples do not have prominent local shape distortions as in SIQA tasks; the two groups of samples hardly share a comparable stand.

In this paper, we propose to assess the stitched image quality based on an error-localization-quantification algorithm. First, we detect the potential error regions by searching through the entire stitched image in local patches of unified size, each patch is to be decided as “intact” or “distorted”. The decision is based on an intelligent agent trained via a convolutional neural network (CNN) [21]. Then the detected regions are refined to finer regions according to the extent of error. This process is conducted within each potential region in finer pixel patches, which are later retained or removed from the coarse region according to the contributions made towards the region being tagged as distorted. Finally, after obtaining refined regions that well bound the distortions, a quantized metric is formulated on refined patches assessing both the error range and extent.

Contributions: Our contributions are twofold. First, for the SIQA task we propose a new algorithm. The proposed error-localization-quantification metric is simple, straightforward and requires no reference images. Further, our method outputs the explicit locations of error, this is far more meaningful for stitching algorithm optimization than just an evaluation score. Second, the successful localization of multiple error types in our pipeline demonstrates that the CNN is enabled to have remarkable ability to detect spatial patterns, which is beyond scene object detection. The observation implies the possibility for generic classification, localization and concept discovery.

The paper is organized as follows. Section 2 discusses previous related works in SIQA. Section 3 introduces our proposed method. Experimentation is presented in Sect. 4, and Sect. 5 draws the conclusion.

2 Related Work

This paper has two lines of work related to the proposed the method: previous SIQA methods, and deep features for discriminative localization.

Previous SIQA methods: In contrast with the emergence of panoramic techniques, the works to evaluate the stitched panoramic image quality seem insufficient and slow in development. Here, we introduce the previous SIQA methods. Much previous SIQA metrics pay more attention to photometric error assessment [12, 13, 20] rather than errors caused by misalignment. In [12] and [20], misaligned error types are omitted and the metrics focus on color correction and intensity consistency, which are low-level representation of overall distortion level. [13] try to quantify the error by computing the structure similarity index (SSIM) of high-frequency information of the stitched and unstitched image difference in the overlapping region. However, since unstitched images used for test are directly cropped from the reference, the effectiveness of the method is not validated. In [10], the work pays more attention to assessing video consistency among subsequent frames and only adopted a luminance-based metric around the seam. In [14], the gradient of the intensity difference between the stitched and reference image is adopted to assess the geometric error, however, the experiments are conducted on mere 6 stitched scenes and references, which is in sufficient for a designed metric. We observe that the design of most previous SIQA metrics require full reference [6], which are difficult to obtain in panorama-related applications. Moreover, there seems hardly any SIQA method directly indicates where the distortion is, thus limit the metric’s guidance for stitching algorithms.

Differently in our work, the assessment is handled under the error detection algorithm, which directly indicates the location of error and naturally requires no reference, the method is described in the next section.

Deep feature-based discriminative localization: The implementation of Convolutional Neural Networks (CNNs) has led to impressive performance on a variety of visual recognition tasks [8, 9, 26]. Much recent work show its remarkable ability to localize objects, and the potential of being transferred to other generic classification, localization and concept discovery [2, 25]. Most of the related works are based on the weakly-supervised object localization. In [3], the regions that causes the maximal activations are masked out with a self-taught object localization technique. In [11] a method is proposed for transferring mid-level image representations, and achieve object localization by evaluating the CNNs output on patches with overlap. [25] uses the class activation map to refer to the weighted activation maps generated for each image. In [2], a method for performing hierarchical object detection is proposed under the guidance of a deep reinforcement learning agent.

While global average pooling is not a novel technique that we propose here, the observation that it can be applied for nonphysical spatial patter – error localization and the implementation to solve image quality related problems, to the best of our knowledge, is unique to our work. We believe the effectiveness and simplicity of the proposed method will make it generic for other IQA tasks.

Fig. 2.
figure 2

The coarse error localization pipeline. A ResNet model is truncated and followed by a flatten layer and softmax layer of 2 classes. The trained classifier is applied to a top-down search which categorize local patches from distorted to intact. The detected patches are labeled in red-shadowed bounding boxes.

3 Proposed Method

The proposed method is to assess the quality of any stitched image and locate distorted regions. We construct it with three steps: Coarse error localization, error-activation-guided refinement, and error quantification.

3.1 Coarse Error Localization

There are two common error types in a stitched scene – ghosting and shape breakage. We employ a ResNet model, a state-of-the-art architecture, to obtain a two-class classifier between “intact” and “distorted”. The fine-tuned model is later utilized for error localization refinement. Even though it is possible for a single patch to hold two types of error at the same time, the detection of each error type is done separately for later assessment. As shown in Fig. 2, we feed the model with labeled bounding boxes containing error as “distorted” examples, and the perfectly aligned areas as “intact”. With ResNet, we achieved a remarkable classification accuracy. With the classifier, we coarsely localize the error through the stitched image.

To protect the potential continuous distortion regions, while preserve the fineness of search, we make a trade-off between the window size and sliding step size. In a complex scene constructed by multiple objects, the object volume has prominent effect on visual saliency [1, 4]. We assume this also implies to texture patterns like shape breakage or ghosting, thus the integrity of distorted region must be preserved. To this end, we merge the adjacent patches with the same tag, as illustrated in Fig. 2. The merged patches form the coarse error localization.

Fig. 3.
figure 3

After the coarse error localization, error-activation mapping further guide the refinement of locally detected error patches. The refined localization better shapes the distorted areas.

3.2 Error-Activation Guided Refinement

After obtaining the coarse error, a refined localization that more precisely describe the range of error is required for accurate error descriptions. We find that the class activation mapping considerably discriminative in describing image regions with errors, as a result, we trim the coarse regions with error-activation-guided refinement. The network we fine-tuned for coarse error detection – ResNet architecture largely consists of convolutional layers, similar to [25], we project back the weights of the output layer on to the convolutional feature maps, thus obtaining the importance of each pixel batch that activates a region to be categorized as containing error or no error, the process is called error-activation-mapping.

The error-activation-mapping is obtained by computing the weighted sum of feature maps of the last convolutional layer. For a stitched image with error type T, the error activation mapping E at spatial location (x, y) is computed by Eq. 1:

$$\begin{aligned} E_{T}(x,y) = \sum _{i} \omega ^{T}_{i}f_{i}(x,y), \end{aligned}$$
(1)

where \(f_{i}(x,y)\) is the activation of unit i in the last convolutional layer at (x, y), and \(\omega ^{T}_{i}\) indicates the importance of the global average pooling result for error type T. The score of an image being diagnosed with error type T can be presented in Eq. 2:

$$\begin{aligned} S_T = \sum _{x,y}\sum _{i} \omega _{i}^{c}f_{i}(x,y). \end{aligned}$$
(2)

Hence the error-activation mapping \(E_{T}(x,y)\) directly represents the importance of the activation at (x, y) leading to the image being diagnosed with error type T. The obtained error-activation mapping will serve as a guidance towards error localization refinement.

For each coarsely localized region, we apply the error-activation mapping as a filter. The threshold is adaptive according to how rigid the filter is, here we adopt the global average. Despite its simplicity, the refinement process integrates the global activation information into the locally categorized patches, which naturally protect the overall integrity of distorted regions. The entire refinement process is as demonstrated in Fig. 3.

3.3 Error Quantification

To quantify the error and form a unified metric, we think it necessary to combine a twofold evaluation, the error range and distortion level. The range is easily represented by the area of the refined location, while the distortion level is represented by the error-activation-mapping weights.

The range index \(M_r^j\) of a refined error location j is formulated as follows:

$$\begin{aligned} M_r^j = A^j/A, \end{aligned}$$
(3)

where \(A^j\) is the area of the refined error location j and A indicates the total area of the image. The distortion level \(M_d^j\) of a refined error location j is represented as the sum of error-activation mapping within:

$$\begin{aligned} M_d^j = \sum _{x,y} E_T \left( x, y\right) . \end{aligned}$$
(4)

The quantification of error for location j is represented as:

$$\begin{aligned} M^j = [M_r^j]^\alpha _j \cdot [M_d^j]^\beta _j, \end{aligned}$$
(5)

the exponents \(\alpha _j\) and \(\beta _j\) are used to adjust the relative importance of range and distortion level. Finally, the quantification of error for an entire stitched image M is formulated as Eq. 6:

$$\begin{aligned} M = \sum _j M^j = \sum _j [M_r^j]^\alpha _j \cdot [M_d^j]^\beta _j \end{aligned}$$
(6)

4 Experimentation

Experiment data: All experiments are conducted on our stitched image quality assessment dataset benchmark called SIQA dataset, which is based on synthetic virtual scenes, since we try to evaluate the proposed metric for various stitching algorithms under ideal photometric conditions. The images are obtained by establishing virtual scenes with the powerful 3D model tool—Unreal Engine. As illustrated in Fig. 4. A synthesized 12-head panoramic camera is placed at multiple locations of each scene, covering \(360^\circ \) surrounding view, and each camera has an FOV (field of view) of \(90^\circ \). Exactly one image is taken for each of the 12 cameras at one location simultaneously. SIQA dataset utilized twelve different 3D scenes varying from wild landscapes to structured scenes, stitched images are obtained using a popular off-the-shelf stitching tool Nuke, altogether 408 stitched scenes, the original images are in high-definition with \(3k-by-2k\) in size.

Fig. 4.
figure 4

The 12-head panoramic camera established in a virtual scene using the Unreal Engine, the stitched view is composed of the adjacent camera views with overlap.

Fig. 5.
figure 5

The coarse error localization result (example error type: breakage), and the corresponding error-activation mapping.

We label the two error types manually in each scene, a scene might contain multiple regions with a single or both error types, or there might be no distortion at all. For ghosting 297 bounding boxes are labeled, and for shape breakage 220 bounding boxes are labeled.

Fig. 6.
figure 6

The error-activation mapping for each error type under the same scene, it is clearly demonstrated that the mapping provide differentiated guidance for each error type.

Fig. 7.
figure 7

The refined error localization using the error-activation guidance, and the quantification result \(M^j\) of each error location.

Coarse error localization: Our fine-tuned model to categorize “intact” and “distorted” is the ResNet 50 architecture using Tensor-Flow backed. We truncate the layer after bn5c branch2c and follow a flatten layer and a softmax layer of 2 classes. We choose \(epoch=50\) and \(batchsize=16\) as the parameters, the model is fine-tuned separately for the two error types, and the classifier achieves remarkable accuracy of \(95.5\%\) for shape breakage and \(96.5\%\) for ghosting, as illustrated in Table 1.

Table 1. Test results of classifying each error type using the fine-tuned ResNet architecture.

The test result of the fine-tuned classifier is illustrated in Table 1. With the remarkable ability to classify distortion and undistorted regions, we impose a top-down search for distorted patches through the entire image. Considering the object size with respect to the image size, we choose \(400 \times 400\) for ghosting and two window shapes of \(200 \times 800\) and \(800 \times 200\) for shape breakage. The differentiated window shapes are chosen according to our analysis. To tell whether a region is ghosted, one must refer to the nearest object to decide where the duplicated artifact comes from, which mostly come in square patches. However, to see if there exists shape breakage, one must refer to the adjacent edge or silhouette to examine whether the shape integrity is damaged, in this case we design both vertical and horizontal window shapes to allow breakage detection. As mentioned earlier, we choose a small sliding step size in order to protect region continuity, here we implement \(stepsize = 100\) for both error types. Then we merge the adjacent patches with the same type of error, thus obtaining the coarse localization of error for the entire scene. As Fig. 5 shows, the integrity of continuous distorted region is basically preserved.

Error-activation guided refinement: By projecting back the weights of the output layer on to the convolutional feature maps, we obtain the error-activation mapping for the image, as demonstrated in Fig. 6. We can see that the discriminative regions of the images for each error type is high-lighted. We also observe that the discriminative regions for different error types are different for a given image, this suggest that the error-activation guidance works as expected.

We apply the error-activation guidance as a filter, the input is the coarsely localized regions. The results are quite impressive as Fig. 7 shows. The regions containing errors are prominently refined, which explicitly describe the distorted regions. Based on the properly refined regions, the following quantification is enabled to be reliable defining each error type.

Error quantification: We compute the quantified error for each location of error, and then for the entire image, according to Eqs. 5 and 6 which we introduced in the last section. The relative importance parameters we choose are \(\alpha =1\) and \(\beta =10\). To illustrate the objectiveness of the metric, we make extensive comparisons among the error patches. As demonstrated in Fig. 7, we compare local errors of similar size with differentiated level of distortion, and those with similar score. Location a and g are both from structured scenes with similar size, however, the shape of television in g is much more polluted than a, thus obtaining a relatively higher score of error. Similarly, location f has extensive error range but relatively slight distortion, thus the quantified score is reduced by the distortion level. We also compare the results of various stitched scenes. An interesting observation is that in natural scenes with less structured context, the metric is still capable of locating distortions that are much less noticeable for human vision. The phenomenon reveals the error-localization ability of our method.

5 Conclusion

In this paper we propose an error-activation-guided metric for stitched panoramic image quality assessment, which requires complete no reference. Our method not only provides a proper evaluation of the stitched image quality, but also directly indicates the explicit locations of error. The method is constructed by three main powerful steps: coarse error localization, error-activation-guided refinement and error quantification. Results reveal the error localization ability of the proposed method, and the extensive comparisons also suggest the effectiveness of the our metric and its ability to distinguish minor distortion levels in detail.