1 Image Stitching and Parallax Errors

The problem of image stitching, or the creation of a panorama from a set of overlapping images, is a well-studied topic with widespread applications [5,6,7]. Most modern digital cameras include a panorama creation mode, as do iPhones and Android smartphones. Google Street View presents the user with panoramas stitched together from images taken from moving vehicles, and the overhead views shown in map applications from Google and Microsoft are likewise stitched together from satellite images. Despite this ubiquity, stitching is far from solved. In particular, stitching algorithms often produce parallax errors even in a static scene with objects at different depths, or dynamic scene with moving objects. An example of motion errors is shown in Fig. 1.

The stitching problem is traditionally viewed as a sequence of steps that are optimized independently [6, 7]. In the first step, the algorithm computes a single registration for each input image to align them to a common surface.Footnote 1 The warped images are then passed on to the seam finding step; here the algorithm determines the registered image it should draw each pixel from. Finally, a blending procedure [9] is run on the composite image to correct visually unpleasant artifacts such as minor misalignment, or differences in color or brightness due to different exposure or other camera characteristics.

Fig. 1.
figure 1

Motion errors example. The strip of papers with numbers has undergone translation between input images. Our result in (b) shows the use of multiple registrations. Green: the reference, Red: registration aligning the number strip, Blue: registration aligning the letter strip. Autostitch result in (c) has visible ghosting on the number strip. (Color figure online)

In this paper, we argue that currently existing methods cannot capture the required perspective changes for scenes with parallax or motion in a single registration, and that seam finding cannot compensate for this when the seam must pass through content-rich regions. Single registrations fundamentally fail to capture the background and foreground of a scene simultaneously. This is demonstrated in Fig. 1, where registering the background causes errors in the foreground and vice versa. Several papers [1, 2] have addressed this problem by creating a single registration that is designed to produce a high quality stitch. However, as we will show, these still fail in cases of large motion or parallax due to the limitations inherent to single registrations. We instead propose an end-to-end approach where multiple candidate registrations are presented to the seam finding phase as alternate source images. The seam finding stage is then free to choose different registrations for different regions of the composite output image. Note that as any registration can serve as a candidate under our scheme, it represents a generalization of methods that attempt to find a single good registration for stitching.

Unfortunately, the classical seam finding approach [5] does not naturally work when given multiple registrations. First, traditional seam finding treats each pixel from the warped image equally. However, by the nature of our multiple registration algorithm, each of them only provides a good alignment for a particular region in the image. Therefore, we need to consider this pixel-level alignment quality in the seam finding phase. Second, seam finding is performed locally by setting up an MRF that tries to place seams where they are not visually obvious. Figure 1 illustrates a common failure; the best seam can cause objects to be duplicated. This issue is made worse by the use of multiple registrations. In traditional image stitching, pixels come from one of two images, so in the worst case scenario, an object is repeated twice. However, if we use n registrations, an object can be repeated as many as \(n+1\) times.

We address this issue by adding several additional terms to the MRF that penalize common stitching errors and encourage image accuracy. Our confidence term encourages pixels to select their value from registrations which align nearby pixels, our duplication term penalizes label sets which select the same object in different locations from different input images, and finally our tear term penalizes breaking coherent regions. While our terms are designed to handle the challenges of multiple registrations, they also provide improvements to the classical single-registration framework.

Our work can be interpreted as a layer-based approach to image stitching, where each registration is treated as a layer and the seam finding stage simultaneously solves for layer assignment and image stitching [3]. Under this view, this paper represents a modest step towards explicit scene modeling in image stitching.

1.1 Motivating Examples

Figure 2 demonstrates the power of multiple registrations. The plant, the floor and the wall each undergo very distinctive motions. Our technique captures all 3 motions. Another challenging example is shown in Fig. 3. Photoshop computes a single registration to align the background buildings, which duplicates the traffic cones and the third car from left. Our technique handles all these objects at different depth correctly.

1.2 Problem Formulation and Our Approach

We adopt the common formulation of image stitching, sometimes called perspective stitching [12] or a flat panorama [6, Sect. 6.1], that takes one image \(I_0\) as the reference, then warps another candidate image \(I_1\) into the reference coordinate system, and add its content to \(I_0\).

Instead of proposing a single warped \(\omega (I_1)\) and sending it to the seam finding phase, we proposed a set of warping \(\omega _1(I_1), \ldots , \omega _N(I_1)\), where each \(\omega _i(I_1)\) aligns a region in \(I_1\) with \(I_0\). We will detail our approach for multiple registrations in Sect. 3.1. Then we will formalize a multi-label MRF problem for seam finding. We have label set \(\mathcal {L}= \{0, 1,\ldots , N\}\), such that label \(x_p = 0\) indicates pixel p in the final stitched result will take its color value from \(I_0\), and from \(\omega _{x_p}(I_1)\) when \(x_p > 0\). We will get the optimal seam by minimizing the energy function E(x) with the new proposed terms to address the challenges we introduced before. We will describe our seam finding energy E(x) in Sect. 3.2. Finally, we adopt Poisson blending [9] to smooth transitions over stitching boundaries when generating the final result.

Fig. 2.
figure 2

Motivating example for multiple registrations. Even the sophisticated single registration approach of NIS [10] gives severe ghosting.

Fig. 3.
figure 3

Motivating example for multiple registrations. State of the art commercial packages like Adobe Photoshop [11] duplicate the traffic cones and other objects.

2 Related Work

The presence of visible seams due to parallax and other effects is a long-standing problem in image stitching. Traditionally there have been two avenues for eliminating or reducing these artifacts: improving registrations by allowing more degrees of freedom, or hiding misalignments by selecting better seams. Our algorithm can be seen as employing both of these strategies: the use of multiple registrations allows us to better tailor each registration to a particular region of the panorama, while our new energy terms improve the quality of the final seams.

2.1 Registration

Most previous works take a homography as a starting point and perform additional warping to correct any remaining misalignment. [13] describes a process in which each feature is shifted toward the average location of its matches in other images. The APAP algorithm divides images into grid cells and estimates a separate homography for each cell, with regularization toward a global homography [14].

Instead of solving registration and seam finding independently, another line of work explicitly takes into account the fact that the eventual goal of the registration step is to produce images that can be easily stitched together. ANAP, for instance, can be improved by limiting perspective distortions in regions without overlap and otherwise regularizing to produce more natural-looking mosaics [15]. Another approach is to confine the warping to a minimal region of the input images that is nevertheless large enough for seam selection and blending, which allows the algorithm to handle larger amounts of parallax [2]. Going a step further it is possible to interleave the registration and seam finding phases, as in the SEAGULL system [1]. In this case, the mesh-based warp can be modified to optimize feature correspondences that lie close to the current seam.

2.2 Seam Finding and Other Combination Techniques

The seam finding phase requires determining, for each pixel, which of the two source images contributes its color. [5] observed that this problem can be naturally formulated as a Markov Random Field and solved via graph cuts. This approach tends to give strong results, and the graph cuts method in particular often produces energies that are within a few percent of the global minimum [16]. Further work in this area has focused on taking into account the presence of edges or color gradients in the energy function in order to avoid visible discontinuities [17].

An alternative to seam finding is the use of a multi-band blending [18] phase immediately after registration [8]. This step blends low frequencies over a large spatial range and high frequencies over a short range to minimize artifacts.

2.3 Comparison to Our Technique

Our work clearly generalizes the line of work that optimizes a single registration, as this arises as a special case when only one candidate warp is used. More usefully, existing registration methods can serve as candidate generators in our technique. A single registration algorithm can propose multiple candidates when run with different parameters, or in the case of a randomized algorithm, such as RANSAC, run several times.

Similarly, our algorithm can be viewed as implicitly defining a single registration, given at each pixel by the warp \(\omega _i\) associated with the candidate registration from which the pixel was drawn in the final output. In theory, this piecewise defined warp is sufficient to obtain the results reported here, but in practice, finding it is difficult. Previous work along these lines has focused on iterative schemes in order to compute the varying warps that are required in different regions of the image [10, 15], but this is in general a very computationally challenging problem and the warping techniques used may not be sufficient to produce a good final results. Our technique allows multiple simple registrations to be used instead.

3 Our Multiple Registration Approach

We use a classic three stage image stitching pipeline, composed of registration, seam finding, and blending phases [6, 7].

In the registration phase, we propose multiple registrations, each of which attempts to register some part of one of the images with the other. In contrast to previous methods, which only pass a single proposed registration to the seam finding stage, our approach allows all of these proposed registrations to be used. Note that in this phase it is important that the set of registrations we propose be diverse.

In the seam finding stage, we solve an MRF inference problem to find the best way to stitch together the various proposals. We observed that using traditional MRF energy to stitch multiple registrations naively generated poor results, due to the reasons we mentioned in Sect. 1. To address these challenges, we propose the improved MRF energy by adding (1) a new data term that describes our confidence between different warping proposals at pixel p and (2) several new smoothness terms which attempt to prevent duplication or tearing. Although this new energy is proposed primarily for the stitching problem with multiple registrations, it addresses problems observed in the traditional approach (single registration) as well and provides marked improvements in final panorama quality in either framework.

Finally, we adopt Poisson blending [9] to smooth transitions over stitching boundaries when generating the final result.

3.1 Generating Multiple Registrations

There are two common categories of registration methods [7]: global transformations, implied by a single motion model over the whole image, such as a homography; and spatially-varying transformations, implicitly described by a non-uniform mesh. The candidate registrations we produce are spatially-varying non-rigid transformations. Similar to [2], we first obtain a homography that matches some part of the image well and then refine its mesh representation.

We have a 3 step process: homography finding, filtering, and refinement. In the homography finding step, we generate candidate homographies by running RANSAC on the set of sparse correspondences between features obtained from the two input images. We ensure that the set of homographies is diverse by a filtering step, which removes poor quality homographies and duplicates. In the refinement step, we solve a quadratic program (QP) to obtain an improved local warping mesh for each of the homographies that pass the filtering step.

Homography Finding Step. Given two input images \(I_0\) and \(I_1\), we first compute a set of sparse correspondences \(C=\{ (p^0_1, p^1_1), \ldots , (p^0_n,p^1_n) \}\), where each \(p^0_i \in I_0\), \(p^1_i \in I_1\) and \((p^0_i,p^1_i)\) is a pair of matched pixels. We run \(\tau _H\) iterations of a modified RANSAC algorithm to generate a set of potential homographies \(\mathcal {H}\). In each iteration t, we randomly choose a pixel p and consider correspondences within a distance \(r_H\); if there are enough nearby correspondences to allow us to estimate a homography \(H_t\) we add this to our set of candidates. The homography \(H_t\) is estimated using least median of squares as implemented in OpenCV [19].

Filtering Step. In order to simplify the seam finding step, it is desirable to limit the number of candidate homographies. We employ two strategies to achieve this: screening, which removes homographies from consideration as soon as they are found, and deduplication, which runs on the full set of homographies that remain after screening.

The screening procedure eliminates two kinds of homographies: those that are unlikely to give rise to realistic images, and those that are too close to the identity transformation to be useful in the final result. Homographies of the first type are eliminated by considering two properties: (1) whether the difference between a similarity motion that is obtained from the same set of seed points exceeds a fixed threshold [2, Sect. 3.2.1], and (2) whether the magnitude of the scaling parameters of the homography exceed a (different) fixed threshold. The intuition is that real world perspective changes are often close to similarities, and stitchable images are likely to be close to each other in scale. Homographies that are too close to I are eliminated by checking whether the overlap between the area covered by the original image and the area covered by the transformed image exceeds 95%. Finally, we reject homographies where either diagonal is shorter than half the length of the diagonal of the original image.

To determine the set of homographies that are near-duplicates of each other and of which all but one can therefore be safely discarded, we compute a set of inlier correspondences \(D_t\) for each \(H_t\) that passes screening. \(D_t\) is constructed iteratively, starting with all correspondences \((p^0_i, p^1_i) \in C_t'\), where \(C_t'\) is the subset of seed points that were chosen in iteration t for which the reprojection error is below a threshold \(T_H\). Correspondences containing points that lie within a distance \(r_D\) of some point already in \(D_t\) are then added until a fixpoint is reached. This step is a generalization of the strategy introduced in [2, §3.2.1].

Given the sets \(D_t\) computed for each \(H_t\), we define a similarity measure between homographies \(\text {sim}(H_a, H_b) = \cos (V_a, V_b)\), where \(\cos \) represents the cosine distance and \(V_a\) the 0-1 indicator vector for \(D_t\). Homographies are then considered in descending order of \(|D_t|\) and added to the set \(\mathcal {H}\) if their similarity to all the elements that have already been added to the set is below a threshold \(\theta _H\). We also enforce an upper limit \(N_H\) on the number of homographies considered, terminating the procedure early when this limit is reached.

Refinement Step. Our final step is motivated by the observation that our process sometimes produces homographies that cause reprojection errors of several pixels. This may occur even for large planar objects, such as the side of a building, which should be fit exactly by a homography. We make a final adjustment to our homography, then add spatial variation.

To adjust the homography, we define an objective function \( f(H) = \sum _{c_i \in C} S(e_{c_i;H}), \) where \(e_{c_i;H}\) is the reprojection error of correspondence \(c_i\) under H, and S is a smoothing function \( S(t) = 1 - \frac{1}{1+\exp (-(T_H-t))}. \) To generate a refined homography \(\hat{H}_i\) from an input \(H_i\), we minimize f using Ceres [20], initializing with \(H_i\). The resulting \(\hat{H}_i\) is a better-fitting homography that is in some sense near \(H_i\). The smoothing function S is designed to provide gradient in the right direction for correspondences that are close to being inliers while ignoring those that are outliers either because they are incorrect matches or because they are better explained by some other homography.

The homographies \(\hat{H}_i \in \mathcal {H}\) often do an imperfect job of aligning \(I_0\) and \(I_1\) in regions that are only mostly flat. In order to address this, we compute a finer-grained non-rigid registration \(\omega _i\) for each \(\hat{H}_i\) using a content-preserving warp (CPW) technique that is better able to capture the transformation between the two images [21]. We start from a uniform grid mesh \(M_i\) drawn over \(\hat{H}_i(I_1)\), and attempt to use CPW to get a new mesh \(\hat{M}_i\) to capture fine-grained local variations between \(I_0\) and \(H_i(I_1)\).

Finally, we denote by \(\omega _i(I_1)\) the warped candidate image \(I_1\) with \(\hat{M}_i\) applied.

3.2 Improved MRF Energy for Seam Finding

The final output of the registration stage is a set of proposed warps \(\{\omega _i(I_1) \}, (i = 1, 2, \ldots , N)\). For notational simplicity, we write \(\{ I^S_i\}\) where \(I^S_0 = I_0\), \(I^S_i = \omega _i(I_1)\) are the source images in the seam finding stage. These images are used to set up a Markov Random Field (MRF) inference problem, to decide how to combine regions of the different images in order to obtain the final stitched image. The label set for this MRF problem is given by \(\mathcal {L}= \{0, 1, \ldots , N\}\), and its purpose is to assign a label \(x_p \in \mathcal {L}\) to each pixel p in the stitching space, which indicates that the value of that pixel is copied from \(I^S_{x_p}\).

It would be natural to expect that we can use the standard MRF stitching energy function \(E^{\text {old}}(x) = \sum _{p} E^{\text {old}}_m(x_p) + \sum _{p, q \in \mathcal {N}} E^{\text {old}}_s(x_p, x_q)\) introduced by [5] (where \(\mathcal {N}\) is the 4-adjacent neighbors). However, we observed that this energy function is not suitable for the case of multiple registrations.

In this formulation, the data term \(E^{\text {old}}_m(x_p) = 0\) when pixel p has a valid color value in \(I^S_{x_p}\), and \(\lambda _m\) otherwise. This means we will impose a penalty \(\lambda _m\) for out-of-mask pixels but treat all the inside-mask pixels equally (they all have cost 0). However, we found that even state-of-the-art single-registration algorithms [1, 2], cannot align every single pixel. In contrast, our multiple registrations are designed to only capture a single region with each warp. We propose a new mask data term for multiple registrations and a warp data term to address this problem.

The traditional smoothness term is \(E^{\text {old}}_s(x_p, x_q) = \lambda _s (\Vert I^S_{x_p}(p) - I^S_{x_q}(p)\Vert + \Vert I^S_{x_p}(q) - I^S_{x_q}(q)\Vert )\) when \(x_p\ne x_q\), and 0 otherwise. It only enforces local similarity across the stitching seam to make it less visible, without any other global constraints. Note that there are a number of nice extensions to this basic idea that improve the smoothness term; for example [6, p. 62] describes several ways to pick better seams and avoid tearing. However, we may still duplicate content in the stitching result with a single registration due to parallax or motion. This problem can be more serious with multiple registrations since we may duplicate content \(N+1\) times instead of just twice. Therefore, we propose a new pairwise term to explicitly penalize duplications.

In sum, we compute the optimal seam by minimizing the energy function \(E(x) = \sum _p E_m(x_p) + \sum _p E_w(x_p) + \sum _{p, q \in \mathcal {N}} E_s(x_p, x_q) + E_d(x)\) using expansion moves [22]. We now describe our mask data term \(E_m\), warp data term \(E_w\), smoothness term \(E_s\) and duplication term \(E_d\) in turn.

Mask Data Term for Multiple Registrations. There is an immediate issue with the standard mask-based data term in the presence of multiple registrations. When one input is significantly larger than the others, the MRF will choose this warping for pixels where its mask is 1 and the other warping masks are 0. Worse, since the MRF itself imposes spatial coherence, this choice of input will be propagated to other parts of the image.

We handle this situation conservatively, by imposing a mask penalty \(\lambda _m\) on pixels that are not in the intersection of all the candidate warpings \(\bigcap _i \omega _i(I_1)\) when assigning them to a candidate image (i.e., \(x_p \ne 0\)). Pixels that lie inside the reference image (\(x_p = 0\)) are handled normally, in that they have no mask penalty with the reference image mask and \(\lambda _m\) mask penalty out of the mask. Note that this mask penalty is a soft constraint: pixels outside of the intersection \(\bigcap _i \omega _i(I_1)\) can be assigned an intensity from a candidate image, if it is promising enough by our other criteria.

Formally we can write our mask data term as

$$\begin{aligned} E_m(x_p) = {\left\{ \begin{array}{ll} \lambda _m \left( 1 - \mathsf {mask}_0(p) \right) , &{} x_p = 0, \\ \lambda _m \left( 1-\mathop {\prod }\nolimits _{i=1}^N\mathsf {mask}_i(p) \right) , &{} x_p \ne 0, \end{array}\right. } \end{aligned}$$
(1)

where \(\mathsf {mask}_i(p) = 1\) indicates \(I^S_i\) has a valid pixel at p, \(\mathsf {mask}_i(p) = 0\) otherwise.

Warp Data Term. In the presence of multiple registrations, we need a data term that makes significant distinctions among different proposed warps. There are two natural ways to determine whether a particular warp \(\omega \) is a good choice at the pixel p. First, we can determine how confident we are that \(\omega \) actually represents the motion of the scene at p. Second, for pixels in the reference image, we can check intensity/color similarity between \(I_0(p)\) and \(\omega (I_1)(p)\).

Since our warp is computed using features and RANSAC, we can identify inlier feature points in \(\omega _i(I_1)\) when the reprojection error is smaller than a parameter \(T_H\). Denoting these inliers as \(\mathcal {I}_i\), we place a Gaussian weight G(.) on each inlier, and define motion quality for pixel p in \(I^S_i\) as \(Q^i_m(p) = \sum _{q \in \mathcal {I}_i} G(\Vert p - q\Vert )\). This makes pixels closer to inliers have greater confidence in the warp.

For color similarity we use the \(L_2\) distance between the local patch around pixel p in the reference \(I^S_0\) and our warped image \(I^S_i\): \(Q^i_c(p) = \sum _{q \in \mathcal {B}_r(p)} \Vert I^S_0(p) - I^S_i(p) \Vert \), where \(\mathcal {B}_r(p)\) is the set of pixels within distance r to pixel p. So pixels with better image content alignment become more confident in the warp.

Putting them together, we have \(e^i_w(p) = -Q^i_m(p) + \lambda _c Q^i_c(p)\) to be our quality score for pixel p for warp \(\omega _i\) (lower means better, since we want to minimize the energy). Then we have a normalized score \(\hat{e}^i_w(p) \in [-1, 1]\) per warped image, and define the warp data term as: \(E_w(x_p) = \lambda _w \hat{e}^{x_p}_w(p)\) when \(x_p \ne 0\), and \(E_w(x_p) = 0\) otherwise.

Smoothness Terms. We adopt some standard smoothness terms used in state-of-the-art MRF stitching. Following [6, 7] these terms include:

  1. 1.

    The color-based seam penalty (introduced in [5, 17]) for local patches to encourage seams that introduce invisible transitions between source images,

  2. 2.

    The edge-based seam penalty introduced in [17] to discourage the seam from cutting through edges, hence reduce the “tearing” artifacts where only some part of an object appears in the stitched result,

  3. 3.

    A Potts term to encourage local label consistency.

Fig. 4.
figure 4

Illustration of the duplication term. Figure (c) provides a bad stitching result with the green triangles duplicated. The feature point correspondence between pixel p and q suggests duplication, and we introduce a term which penalizes this scenario. (Color figure online)

Duplication Avoidance Term. For stitching tasks with large parallax or motion, it is easy to duplicate scene content in the stitching result. We address this issue by explicitly formalizing a duplication avoidance term in our energy. If pixel p from the reference image \(I^S_0\) and q from the candidate image \(I^S_i\) form a true correspondence, then they refer to the same point (i.e., scene element) in the real world. Therefore, we penalize a labeling that contains both of them (i.e., \(x_p = 0, x_q = i\)), as shown in Fig. 4. Since our correspondence is sparse, we also apply this idea to the local region within a radius r of pixels p and q. We reweight the penalty by a Gaussian G since the farther away we are from these corresponding pixels, the more uncertain the correspondence.

Formally, our duplication term \(E_c\) is defined as

$$\begin{aligned} E_d(x) = \lambda _d \sum _{i = 1}^N \sum _{(p, q) \in \mathcal {C}_i} \sum _{\delta \in \mathcal {B}_r} e_r(x_{p + \delta }, x_{q + \delta }; \delta , i) \end{aligned}$$
(2)

where \(\mathcal {C}_i\) is the pixel correspondence between \(I^S_0\) and \(I^S_i\), and \(\mathcal {B}_r = \{(dx, dy) \in \mathcal {I}^2 \mid \Vert (dx, dy)\Vert \le r\}\) is a box of radius r. \(e_r(x_{p+\delta }, x_{q+\delta }; \delta , i) = G(\Vert \delta \Vert )\) when \(x_p = 0, x_q = i\), and 0 otherwise.

4 Experimental Results and Implementation Details

Experimental Setup. Our goal is to perform stitching on images whose degree of parallax and motion causes previous methods to fail. Ideally, there would be a standard dataset of images that are too difficult to stitch, along with an evaluation metric. Unfortunately this is not the case, in part due to the difficulty of defining ground truth for image stitching. We therefore had to rely on collecting challenging imagery ourselves, though we found one appropriate example (Fig. 5) whose stitching failures were widely shared on social media.

We implemented or obtained code for a number of alternative methods, as detailed below, and ran them on all of our examples, along with our technique using a single parameter setting. Since our images are so challenging, it was not uncommon for a competing method to return no output (“failing to stitch”). In the entire corpus of images we examined, we found numerous cases where competing techniques produced dramatic artifacts, while our algorithm had minimal if any errors. We have not found any example images where our technique produces dramatic artifacts and a competitor does not. However, we found a few less challenging images that are well handled by competitors but where we produce small artifacts. These examples, along with other data, images, and additional material omitted here are available online,Footnote 2 for reasons of space we focus here on images that provide useful insight. However, the images included here are representative of the performance we have observed on the entire corpus of challenging images we collected.

We follow the experimental setup of [2], who (very much like our work) describe a stitching approach that can handle images with too much parallax for previous techniques. The strongest overall competitor turns out to be Adobe Photoshop 2018’s stitcher Photomerge [11]. While experimental results reported in [2] compare their algorithm with Photoshop 2014, the 2018 version is substantially better, and does an excellent job of stitching images with too many motions for any other competing methods. Therefore, we take Photoshop’s failing on a dataset as a signal that dataset is particularly challenging; in this section, we show several examples in this section where we successfully stitch such datasets. In addition to Photoshop we downloaded and ran APAP [14], Autostitch [8], and NIS [10]. To produce stitching results from APAP we follow the approach of [2], who extended APAP with seam-finding. Results from all methods are shown in Figs. 9 and 10.

Implementation Details. For feature extraction and matching, we used DeepMatch [23]. The associated DeepFlow solver was used to generate flows for the optical flow-based warping. We used the Ceres solver [20] for the QP problems that arose when generating multiple registrations, as discussed in Sect. 3.1.

Visual Evaluation. Following [2] we review several images from our test set and highlight the strengths and weaknesses of our technique, as well as those of various methods from the literature. All of our results are shown for a single set of parameters.

We observed two classes of stitching errors: warping errors, where the algorithm fails to generate any candidate image that is well-aligned with the reference image; and stitching errors, where the MRF does not produce good output despite the presence of good candidate warps. An example of our technique making a warping error is shown in Fig. 9e, where no warp found by our algorithm continues the parking stall line, causing a visible seam. An example of a stitching error is given in Fig. 10e, where the remainder of the car’s wheel is available in the warp from which our mosaic draws the front wheelwell. Errors may manifest as a number of different kinds of artifacts, such as: tearing (e.g., the arm in Fig. 5b); wrong perspective (e.g., the tan background building in Fig. 9b); or duplication (e.g., the stop sign in Fig. 7b), ghosting (e.g., the bollards in Fig. 6b), or omission (e.g., the front door of the car in Fig. 10c) of scene content.

Quantitative Evaluation. The only quantitative metric used by previous stitching papers is seam quality (MRF energy). However, as we have shown, local seam quality is not indicative of stitch quality. Also, this technique requires the user to know the seam location, which precludes it from being run on black-box algorithms like Photoshop. Here we attempt to define a metric to address these problems.

We first observe that stitching can be viewed as a form of view synthesis with weaker assumptions regarding the camera placement or type. With this connection in mind, we redefine perspective stitching as extending the field of view of a reference image using the information in the candidate images. This redefintion naturally leads to an evaluation technique. We crop part of the reference image and then stitch the cropped image with the candidate image. This cropped region serves as a ground truth, which we can compare against the appropriate location in the stitch result. Note that in perspective stitching, the reference image’s size is not altered so we know the exact area where the cropped region should be. We then calculate MS-SSIM [24] or PSNR.

Fig. 5.
figure 5

“Ski” dataset. Photoshop tears the people and the fence. Our stitch has the fence stop abruptly but keeps the people in place. Note that the candidate provides no information that allows us to extend the fence.

Fig. 6.
figure 6

“Bike Mural” dataset. Autostitch has ghosting on the car, bridge, and poles. Our algorithm shortens the truck and deletes a pole, but has no perceptible ghosting or tearing of the objects.

Fig. 7.
figure 7

“Stop Sign” dataset. Photoshop duplicates the stop sign. Of all the implementations we tried, ours is the only visually plausible result, successfully avoiding duplicating the foreground.

Fig. 8.
figure 8

“Graffiti-Building” dataset. APAP deletes significant amounts of red graffiti, and introduces noticable curvature. Our result does not produce tearing, ghosting, or duplication. (Color figure online)

Fig. 9.
figure 9

“Parking lot” dataset. Autostitch fails to stitch. APAP duplicates the car’s hood, tears a background building, and introduces a corner in the roof of the trailer. Photoshop duplicates the front half of the car. NIS has substantial ghosting. Our result cuts out a part of a parking stall line, but avoids duplicating the car.

Fig. 10.
figure 10

“Cars” dataset. Autostitch fails to stitch. APAP and Photoshop shorten the car. APAP also introduces substantial curvature into the background building. NIS has substantial ghosting and shortens the car. Our result deletes part of the hood and front wheel; however, it is the only result which produces an artifact-free car body.

Fig. 11.
figure 11

An example of tearing and duplication produced by our method.“Cars” dataset.

We report this evaluation for 2 examples in Table 1: 50 pixels are cropped off the edge of the reference images in Stop Sign (left side of first image for Fig. 7) and Graffiti Building (right side of first image for Fig. 8). The stitch results for the cropped images appear almost identical to the stitch results for the whole images. Best score shown inbold. “Ground Truth” compares only the ground truth region to the appropriate location, while “Uncropped Reference” compares the uncropped reference.

Table 1. Evaluation scores for different algorithms.

Note that for Stop Sign, all algorithms performed reasonably in Ground Truth Region. However, both APAP and Photoshop include a duplicate of the stop sign that lowers their values for Uncropped Reference.

5 Conclusions, Limitations, and Future Work

We have demonstrated a novel formulation of the image stitching problem in which multiple candidate registrations are used. We have generalized MRF seam finding to this setting and proposed new terms to combat common artifacts such as object duplication. Our techniques outperform existing algorithms in large parallax and motion scenarios.

Our methods naturally generalize to other stitching surfaces such as cylinders or spheres via modifications to the warping function. Three or more input images can be handled by proposing multiple registrations of each candidate image, and letting the seam finder composite them. A potential problem is the presence of undetected sparse correspondences, which can lead to duplications or tears (Fig. 11). The use of dense correspondences may remedy this issue, but our preliminary experiments suggest that optical flows cannot easily capture motion in input images with large disparities, and do not produce correspondences of sufficient quality. A second issue is that it is unclear whether to populate regions of the output mosaic when only data from a single candidate image is present, as the constrained choice of candidate here may conflict with choices made in other regions of the mosaic. This can to some extent be handled with modifications to the data term, but compared to traditional methods, scene content may be lost. One example of this occurs in Fig. 10, where the front wheel of the car is omitted in the final output. These problems remain exciting challenges for future work.