Keywords

1 Introduction

An important task in visual object recognition is to design algorithms that are robust to dataset bias [1]. Dataset bias arises when labeled training instances are available from a source domain and test instances are sampled from a related, but different, target domain. For example, consider a person identification application in unmanned aerial vehicles (UAV), which is essential for a variety of tasks, such as surveillance, people search, and remote monitoring [2]. One of the critical tasks is to identify people from a bird’s-eye view; however collecting labeled data from that viewpoint can be very challenging. It is more desirable that a UAV can be trained on some already available on-the-ground labeled images (source), e.g., people photographs from social media, and then successfully applied to the actual UAV view (target). Traditional supervised learning algorithms typically perform poorly in this setting, since they assume that the training and test data are drawn from the same domain.

Domain adaptation attempts to deal with dataset bias using unlabeled data from the target domain so that the task of manual labeling the target data can be reduced. Unlabeled target data provides auxiliary training information that should help algorithms generalize better on the target domain than using source data only. Successful domain adaptation algorithms have large practical value, since acquiring a huge amount of labels from the target domain is often expensive or impossible. Although domain adaptation has gained increasing attention in object recognition, see [3] for a recent overview, the problem remains essentially unsolved since model accuracy has yet to reach a level that is satisfactory for real-world applications. Another issue is that many existing algorithms require optimization procedures that do not scale well as the size of datasets increases [410]. Earlier algorithms were typically designed for relatively small datasets, e.g., the Office dataset [11].

We consider a solution based on learning representations or features from raw data. Ideally, the learned feature should model the label distribution as well as reduce the discrepancy between the source and target domains. We hypothesize that a possible way to approximate such a feature is by (supervised) learning the source label distribution and (unsupervised) learning of the target data distribution. This is in the same spirit as multi-task learning in that learning auxiliary tasks can help the main task be learned better [12, 13]. The goal of this paper is to develop an accurate, scalable multi-task feature learning algorithm in the context of domain adaptation.

Contribution: To achieve the goal stated above, we propose a new deep learning model for unsupervised domain adaptation. Deep learning algorithms are highly scalable since they run in linear time, can handle streaming data, and can be parallelized on GPUs. Indeed, deep learning has come to dominate object recognition in recent years [14, 15].

We propose Deep Reconstruction-Classification Network (\( DRCN \)), a convolutional network that jointly learns two tasks: (i) supervised source label prediction and (ii) unsupervised target data reconstruction. The encoding parameters of the \( DRCN \) are shared across both tasks, while the decoding parameters are separated. The aim is that the learned label prediction function can perform well on classifying images in the target domain – the data reconstruction can thus be viewed as an auxiliary task to support the adaptation of the label prediction. Learning in \( DRCN \) alternates between unsupervised and supervised training, which is different from the standard pretraining-finetuning strategy [16, 17].

From experiments over a variety of cross-domain object recognition tasks, \( DRCN \) performs better than the state-of-the-art domain adaptation algorithm [18], with up to \(\sim 8\,\%\) accuracy gap. The \( DRCN \) learning strategy also provides a considerable improvement over the pretraining-finetuning strategy, indicating that it is more suitable for the unsupervised domain adaptation setting. We furthermore perform a visual analysis by reconstructing source images through the learned reconstruction function. It is found that the reconstructed outputs resemble the appearances of the target images suggesting that the encoding representations are successfully adapted. Finally, we present a probabilistic analysis to show the relationship between the \( DRCN \)’s learning objective and a semi-supervised learning framework [19], and also the soundness of considering only data from a target domain for the data reconstruction training.

2 Related Work

Domain adaptation is a large field of research, with related work under several names such as class imbalance [20], covariate shift [21], and sample selection bias [22]. In [23], it is considered as a special case of transfer learning. Earlier work on domain adaptation focused on text document analysis and NLP [24, 25]. In recent years, it has gained a lot of attention in the computer vision community, mainly for object recognition application, see [3] and references therein. The domain adaptation problem is often referred to as dataset bias in computer vision [1].

This paper is concerned with unsupervised domain adaptation in which labeled data from the target domain is not available [26]. A range of approaches along this line of research in object recognition have been proposed [4, 5, 9, 2730], most were designed specifically for small datasets such as the Office dataset [11]. Furthermore, they usually operated on the SURF-based features [31] extracted from the raw pixels. In essence, the unsupervised domain adaptation problem remains open and needs more powerful solutions that are useful for practical situations.

Deep learning now plays a major role in the advancement of domain adaptation. An early attempt addressed large-scale sentiment classification [32], where the concatenated features from fully connected layers of stacked denoising autoencoders have been found to be domain-adaptive [33]. In visual recognition, a fully connected, shallow network pretrained by denoising autoencoders has shown a certain level of effectiveness [34]. It is widely known that deep convolutional networks (ConvNets) [35] are a more natural choice for visual recognition tasks and have achieved significant successes [14, 15, 36]. More recently, ConvNets pretrained on a large-scale dataset, ImageNet, have been shown to be reasonably effective for domain adaptation [14]. They provide significantly better performances than the SURF-based features on the Office dataset [37, 38]. An earlier approach on using a convolutional architecture without pretraining on ImageNet, DLID, has also been explored [39] and performs better than the SURF-based features.

To further improve the domain adaptation performance, the pretrained ConvNets can be fine-tuned under a particular constraint related to minimizing a domain discrepancy measure [18, 4042]. Deep Domain Confusion (DDC) [41] utilizes the maximum mean discrepancy (MMD) measure [43] as an additional loss function for the fine-tuning to adapt the last fully connected layer. Deep Adaptation Network (DAN) [40] fine-tunes not only the last fully connected layer, but also some convolutional and fully connected layers underneath, and outperforms DDC. Recently, the deep model proposed in [42] extends the idea of DDC by adding a criterion to guarantee the class alignment between different domains. However, it is limited only to the semi-supervised adaptation setting, where a small number of target labels can be acquired.

The algorithm proposed in [18], which we refer to as ReverseGrad, handles the domain invariance as a binary classification problem. It thus optimizes two contradictory objectives: (i) minimizing label prediction loss and (ii) maximizing domain classification loss via a simple gradient reversal strategy. ReverseGrad can be effectively applied both in the pretrained and randomly initialized deep networks. The randomly initialized model is also shown to perform well on cross-domain recognition tasks other than the Office benchmark, i.e., large-scale handwritten digit recognition tasks. Our work in this paper is in a similar spirit to ReverseGrad in that it does not necessarily require pretrained deep networks to perform well on some tasks. However, our proposed method undertakes a fundamentally different learning algorithm: finding a good label classifier while simultaneously learning the structure of the target images.

3 Deep Reconstruction-Classification Networks

This section describes our proposed deep learning algorithm for unsupervised domain adaptation, which we refer to as Deep Reconstruction-Classification Networks (\( DRCN \)). We first briefly discuss the unsupervised domain adaptation problem. We then present the DRCN architecture, learning algorithm, and other useful aspects.

Let us define a domain as a probability distribution \({\mathbb D}_{XY}\) (or just \({\mathbb D}\)) on \({\mathcal X}\times {\mathcal Y}\), where \({\mathcal X}\) is the input space and \({\mathcal Y}\) is the output space. Denote the source domain by \({\mathbb P}\) and the target domain by \({\mathbb Q}\), where \({\mathbb P}\ne {\mathbb Q}\). The aim in unsupervised domain adaptation is as follows: given a labeled i.i.d. sample from a source domain \(S^s = \{(x^s_i, y^s_i) \}_{i=1}^{n_s} \sim {\mathbb P}\) and an unlabeled sample from a target domain \(S^t_u = \{(x^t_i) \}_{i=1}^{n_t} \sim {\mathbb Q}_X\), find a good labeling function \(f : {\mathcal X}\rightarrow {\mathcal Y}\) on \(S^t_u\). We consider a feature learning approach: finding a function \(g: {\mathcal X}\rightarrow {\mathcal F}\) such that the discrepancy between distribution \({\mathbb P}\) and \({\mathbb Q}\) is minimized in \({\mathcal F}\).

Ideally, a discriminative representation should model both the label and the structure of the data. Based on that intuition, we hypothesize that a domain-adaptive representation should satisfy two criteria: (i) classify well the source domain labeled data and (ii) reconstruct well the target domain unlabeled data, which can be viewed as an approximate of the ideal discriminative representation. Our model is based on a convolutional architecture that has two pipelines with a shared encoding representation. The first pipeline is a standard convolutional network for source label prediction [35], while the second one is a convolutional autoencoder for target data reconstruction [44, 45]. Convolutional architectures are a natural choice for object recognition to capture spatial correlation of images. The model is optimized through multitask learning [12], that is, jointly learns the (supervised) source label prediction and the (unsupervised) target data reconstruction tasks.Footnote 1 The aim is that the encoding shared representation should learn the commonality between those tasks that provides useful information for cross-domain object recognition. Figure 1 illustrates the architecture of \( DRCN \).

Fig. 1.
figure 1

Illustration of the \( DRCN \)’s architecture. It consists of two pipelines: (i) label prediction and (ii) data reconstruction pipelines. The shared parameters between those two pipelines are indicated by the red color. (Color figure online)

We now describe \( DRCN \) more formally. Let \(f_c: {\mathcal X}\rightarrow {\mathcal Y}\) be the (supervised) label prediction pipeline and \(f_r: {\mathcal X}\rightarrow {\mathcal X}\) be the (unsupervised) data reconstruction pipeline of \( DRCN \). Define three additional functions: (1) an encoder/feature mapping \(g_{\mathrm {enc}} : {\mathcal X}\rightarrow {\mathcal F}\), (2) a decoder \(g_{\mathrm {dec}} : {\mathcal F}\rightarrow {\mathcal X}\), and (3) a feature labeling \(g_{\mathrm {lab}}: {\mathcal F}\rightarrow {\mathcal Y}\). For m-class classification problems, the output of \(g_{\mathrm {lab}}\) usually forms an m-dimensional vector of real values in the range [0, 1] that add up to 1, i.e., softmax output. Given an input \(x \in {\mathcal X}\), one can decompose \(f_c\) and \(f_r\) such that

$$\begin{aligned} f_c(x)= & {} (g_{\mathrm {lab}} \circ g_{\mathrm {enc}} ) (x) , \end{aligned}$$
(1)
$$\begin{aligned} f_r(x)= & {} (g_{\mathrm {dec}} \circ g_{\mathrm {enc}} ) (x) . \end{aligned}$$
(2)

Let \(\varTheta _c= \{\varTheta _{\mathrm {enc}}, \varTheta _{\mathrm {lab}} \}\) and \(\varTheta _r= \{\varTheta _{\mathrm {enc}}, \varTheta _{\mathrm {dec}} \}\) denote the parameters of the supervised and unsupervised model. \(\varTheta _{\mathrm {enc}}\) are shared parameters for the feature mapping \(g_{\mathrm {enc}}\). Note that \(\varTheta _{\mathrm {enc}}, \varTheta _{\mathrm {dec}}, \varTheta _{\mathrm {lab}}\) may encode parameters of multiple layers. The goal is to seek a single feature mapping \(g_{\mathrm {enc}}\) model that supports both \(f_c\) and \(f_r\).

Learning algorithm: The learning objective is as follows. Suppose the inputs lie in \({\mathcal X}\subseteq {\mathbb R}^d\) and their labels lie in \({\mathcal Y}\subseteq {\mathbb R}^m\). Let \(\ell _c: {\mathcal Y}\times {\mathcal Y}\rightarrow {\mathbb R}\) and \(\ell _r: {\mathcal X}\times {\mathcal X}\rightarrow {\mathbb R}\) be the classification and reconstruction loss respectively. Given labeled source sample \(S^s = \{({\mathbf x}^s_i, {\mathbf y}^s_i) \}_{i=1}^{n_s} \sim {\mathbb P}\), where \({\mathbf y}_i \in \{ 0, 1\}^m\) is a one-hot vector, and unlabeled target sample \(S^t_u = \{({\mathbf x}^t_j) \}_{j=1}^{n_t} \sim {\mathbb Q}\), we define the empirical losses as:

$$\begin{aligned} {\mathcal L}^{n_s}_c( \{ \varTheta _{\mathrm {enc}}, \varTheta _{\mathrm {lab}} \} ) := \sum _{i=1}^{n_s} \ell _c\left( f_c({\mathbf x}^s_i; \{ \varTheta _{\mathrm {enc}}, \varTheta _{\mathrm {lab}} \}), {\mathbf y}^s_i\right) , \end{aligned}$$
(3)
$$\begin{aligned} {\mathcal L}^{n_t}_r( \{ \varTheta _{\mathrm {enc}}, \varTheta _{\mathrm {dec}} \} ) := \sum _{j=1}^{n_t} \ell _r\left( f_r({\mathbf x}^t_j; \{ \varTheta _{\mathrm {enc}}, \varTheta _{\mathrm {dec}} \}), {\mathbf x}^t_j\right) . \end{aligned}$$
(4)

Typically, \(\ell _c\) is of the form cross-entropy loss \(\displaystyle \sum _{k=1}^m y_k \log [f_c({\mathbf x})]_k\) (recall that \(f_c({\mathbf x})\) is the softmax output) and \(\ell _r\) is of the form squared loss \(\displaystyle \Vert {\mathbf x}- f_r({\mathbf x}) \Vert _2^2\).

Our aim is to solve the following objective:

$$\begin{aligned} \min \lambda {\mathcal L}^{n_s}_c( \{ \varTheta _{\mathrm {enc}}, \varTheta _{\mathrm {lab}} \} ) + (1-\lambda ) {\mathcal L}^{n_t}_r( \{ \varTheta _{\mathrm {enc}}, \varTheta _{\mathrm {dec}} \} ), \end{aligned}$$
(5)

where \(0 \le \lambda \le 1\) is a hyper-parameter controlling the trade-off between classification and reconstruction. The objective is a convex combination of supervised and unsupervised loss functions. We justify the approach in Sect. 5.

Objective (5) can be achieved by alternately minimizing \({\mathcal L}^{n_s}_c\) and \({\mathcal L}^{n_t}_r\) using stochastic gradient descent (SGD). In the implementation, we used RMSprop [46], the variant of SGD with a gradient normalization – the current gradient is divided by a moving average over the previous root mean squared gradients. We utilize dropout regularization [47] during \({\mathcal L}^{n_s}_c\) minimization, which is effective to reduce overfitting. Note that dropout regularization is applied in the fully-connected/dense layers only, see Fig. 1.

The stopping criterion for the algorithm is determined by monitoring the average reconstruction loss of the unsupervised model during training – the process is stopped when the average reconstruction loss stabilizes. Once the training is completed, the optimal parameters \(\hat{\varTheta }_{\mathrm {enc}}\) and \(\hat{\varTheta }_{\mathrm {lab}}\) are used to form a classification model \(f_c({\mathbf x}^t; \{ \hat{\varTheta }_{\mathrm {enc}}, \hat{\varTheta }_{\mathrm {lab}}\})\) that is expected to perform well on the target domain. The \( DRCN \) learning algorithm is summarized in Algorithm 1 and implemented using Theano [48].

figure a

Data augmentation and denoising: We use two well-known strategies to improve \( DRCN \)’s performance: data augmentation and denoising. Data augmentation generates additional training data during the supervised training with respect to some plausible transformations over the original data, which improves generalization, see e.g. [49]. Denoising involves reconstructing clean inputs given their noisy counterparts. It is used to improve the feature invariance of denoising autoencoders (DAE) [33]. Generalization and feature invariance are two properties needed to improve domain adaptation. Since \( DRCN \) has both classification and reconstruction aspects, we can naturally apply these two tricks simultaneously in the training stage.

Let \({\mathbb Q}_{\tilde{X} | X}\) denote the noise distribution given the original data from which the noisy data are sampled from. The classification pipeline of \( DRCN \) \(f_c\) thus actually observes additional pairs \(\{ (\tilde{{\mathbf x}}^s_i, y^s_i) \}_{i=1}^{n_s}\) and the reconstruction pipeline \(f_r\) observes \(\{ (\tilde{{\mathbf x}}^t_i, {\mathbf x}^t_i) \}_{i=1}^{n_t}\). The noise distribution \({\mathbb Q}_{\tilde{X} | X}\) are typically geometric transformations (translation, rotation, skewing, and scaling) in data augmentation, while either zero-masked noise or Gaussian noise is used in the denoising strategy. In this work, we combine all the fore-mentioned types of noise for denoising and use only the geometric transformations for data augmentation.

4 Experiments and Results

This section reports the evaluation results of \( DRCN \). It is divided into two parts. The first part focuses on the evaluation on large-scale datasets popular with deep learning methods, while the second part summarizes the results on the Office dataset [11].

4.1 Experiment I: SVHN, MNIST, USPS, CIFAR, and STL

The first set of experiments investigates the empirical performance of \( DRCN \) on five widely used benchmarks: MNIST [35], USPS [50], Street View House Numbers (SVHN) [51], CIFAR [52], and STL [53], see the corresponding references for more detailed configurations. The task is to perform cross-domain recognition: taking the training set from one dataset as the source domain and the test set from another dataset as the target domain. We evaluate our algorithm’s recognition accuracy over three cross-domain pairs: (1) MNIST vs USPS, (2) SVHN vs MNIST, and (3) CIFAR vs STL.

MNIST (mn) vs USPS (us) contains 2D grayscale handwritten digit images of 10 classes. We preprocessed them as follows. USPS images were rescaled into \(28 \times 28\) and pixels were normalized to [0, 1] values. From this pair, two cross-domain recognition tasks were performed: mn \(\rightarrow \) us and us \(\rightarrow \) mn.

In SVHN (sv) vs MNIST (mn) pair, MNIST images were rescaled to \(32 \times 32\) and SVHN images were grayscaled. The [0, 1] normalization was then applied to all images. Note that we did not preprocess SVHN images using local contrast normalization as in [54]. We evaluated our algorithm on sv \(\rightarrow \) mn and mn \(\rightarrow \) sv cross-domain recognition tasks.

STL (st) vs CIFAR (ci) consists of RGB images that share eight object classes: airplane, bird, cat, deer, dog, horse, ship, and truck, which forms 4, 000 (train) and 6, 400 (test) images for STL, and 40, 000 (train) and 8, 000 (test) images for CIFAR. STL images were rescaled to \(32 \times 32\) and pixels were standardized into zero-mean and unit-variance. Our algorithm was evaluated on two cross-domain tasks, that is, st \(\rightarrow \) ci and ci \(\rightarrow \) st.

The architecture and learning setup: The \( DRCN \) architecture used in the experiments is adopted from [44]. The label prediction pipeline has three convolutional layers: 100 5\(\,\times \,\)5 filters (conv1), 150 5\(\,\times \,\)5 filters (conv2), and 200 3\(\,\times \,\)3 filters (conv3) respectively, two max-pooling layers of size 2\(\,\times \,\)2 after the first and the second convolutional layers (pool1 and pool2), and three fully-connected layers (fc4, fc5, and fc \(\_\)out) – fc \(\_\)out is the output layer. The number of neurons in fc4 or fc5 was treated as a tunable hyper-parameter in the range of [300, 350, ..., 1000], chosen according to the best performance on the validation set. The shared encoder \(g_\mathrm {enc}\) has thus a configuration of conv1-pool1-conv2-pool2-conv3-fc4-fc5. Furthermore, the configuration of the decoder \(g_\mathrm {dec}\) is the inverse of that of \(g_\mathrm {enc}\). Note that the unpooling operation in \(g_\mathrm {dec}\) performs by upsampling-by-duplication: inserting the pooled values in the appropriate locations in the feature maps, with the remaining elements being the same as the pooled values.

We employ ReLU activations [55] in all hidden layers and linear activations in the output layer of the reconstruction pipeline. Updates in both classification and reconstruction tasks were computed via RMSprop with learning rate of \(10^{-4}\) and moving average decay of 0.9. The control penalty \(\lambda \) was selected according to accuracy on the source validation data – typically, the optimal value was in the range [0.4, 0.7].

Benchmark algorithms: We compare DRCN with the following methods. (1) \( ConvNet _{src}\): a supervised convolutional network trained on the labeled source domain only, with the same network configuration as that of \( DRCN \)’s label prediction pipeline, (2) SCAE: ConvNet preceded by the layer-wise pretraining of stacked convolutional autoencoders on all unlabeled data [44], (3) \( SCAE _t\): similar to SCAE, but only unlabeled data from the target domain are used during pretraining, (4) \( SDA _{sh}\) [32]: the deep network with three fully connected layers, which is a successful domain adaptation model for sentiment classification, (5) Subspace Alignment (SA) [27],Footnote 2 and (6) ReverseGrad [18]: a recently published domain adaptation model based on deep convolutional networks that provides the state-of-the-art performance.

All deep learning based models above have the same architecture as DRCN for the label predictor. For ReverseGrad, we also evaluated the “original architecture” devised in [18] and chose whichever performed better of the original architecture or our architecture. Finally, we applied the data augmentation to all models similarly to \( DRCN \). The ground-truth model is also evaluated, that is, a convolutional network trained from and tested on images from the target domain only (\( ConvNet _{tgt}\)), to measure the difference between the cross-domain performance and the ideal performance.

Classification accuracy: Table 1 summarizes the cross-domain recognition accuracy (mean ± std) of all algorithms over ten independent runs. \( DRCN \) performs best in all but one cross-domain tasks, better than the prior state-of-the-art ReverseGrad. Notably on the sv \(\rightarrow \) mn task, \( DRCN \) outperforms ReverseGrad with \(\sim 8\,\%\) accuracy gap. \( DRCN \) also provides a considerable improvement over ReverseGrad (\(\sim 5\,\%\)) on the reverse task, mn \(\rightarrow \) sv, but the gap to the groundtruth is still large – this case was also mentioned in previous work as a failed case [18]. In the case of ci \(\rightarrow \) st, the performance of \( DRCN \) almost matches the performance of the target baseline.

\( DRCN \) also convincingly outperforms the greedy-layer pretraining-based algorithms (\( SDA _{sh}\), SCAE, and \( SCAE _t\)). This indicates the effectiveness of the simultaneous reconstruction-classification training strategy over the standard pretraining-finetuning in the context of domain adaptation.

Table 1. Accuracy (mean ± std \(\%\)) on five cross-domain recognition tasks over ten independent runs. Bold and underline indicate the best and second best domain adaptation performance. \( ConvNet _{tgt}\) denotes the ground-truth model: training and testing on the target domain only.

Comparison of different DRCN flavors: Recall that \( DRCN \) uses only the unlabeled target images for the unsupervised reconstruction training. To verify the importance of this strategy, we further compare different flavors of \( DRCN \): \( DRCN _s\) and \( DRCN _{st}\). Those algorithms are conceptually the same but different only in utilizing the unlabeled images during the unsupervised training. \( DRCN _s\) uses only unlabeled source images, whereas \( DRCN _{st}\) combines both unlabeled source and target images.

The experimental results in Table 2 confirm that \( DRCN \) always performs better than \( DRCN _s\) and \( DRCN _{st}\). While \( DRCN _{st}\) occasionally outperforms ReverseGrad, its overall performance does not compete with that of \( DRCN \). The only case where \( DRCN _s\) and \( DRCN _{st}\) flavors can closely match \( DRCN \) is on mn \(\rightarrow \) us. This suggests that the use of unlabeled source data during the reconstruction training do not contribute much to the cross-domain generalization, which verifies the \( DRCN \) strategy in using the unlabeled target data only.

Table 2. Accuracy (\(\%\)) of \( DRCN _s\) and \( DRCN _{st}\).
Fig. 2.
figure 2

Data reconstruction after training from SVHN \(\rightarrow \) MNIST. Figure (a)–(b) show the original input pixels, and (c)–(f) depict the reconstructed source images (SVHN). The reconstruction of \( DRCN \) appears to be MNIST-like digits, see the main text for a detailed explanation.

Data reconstruction: A useful insight was found when reconstructing source images through the reconstruction pipeline of \( DRCN \). Specifically, we observe the visual appearance of \(f_r(x^s_1), \ldots , f_r(x^s_m)\), where \(x^s_1, \ldots , x^s_m\) are some images from the source domain. Note that \(x^s_1, \ldots , x^s_m\) are unseen during the unsupervised reconstruction training in \( DRCN \). We visualize such a reconstruction in the case of sv \(\rightarrow \) mn training in Fig. 2. Figure 2(a) and (b) display the original source (SVHN) and target (MNIST) images.

The main finding of this observation is depicted in Fig. 2(c): the reconstructed images produced by \( DRCN \) given some SVHN images as the source inputs. We found that the reconstructed SVHN images resemble MNIST-like digit appearances, with white stroke and black background, see Fig. 2(b). Remarkably, \( DRCN \) still can produce “correct” reconstructions of some noisy SVHN images. For example, all SVHN digits 3 displayed in Fig. 2(a) are clearly reconstructed by \( DRCN \), see the fourth row of Fig. 2(c). \( DRCN \) tends to pick only the digit in the middle and ignore the remaining digits. This may explain the superior cross-domain recognition performance of \( DRCN \) on this task. However, such a cross-reconstruction appearance does not happen in the reverse task, mn \(\rightarrow \) sv, which may be an indicator for the low accuracy relative to the groundtruth performance.

We also conduct such a diagnostic reconstruction on other algorithms that have the reconstruction pipeline. Figure 2(d) depicts the reconstructions of the SVHN images produced by ConvAE trained on the MNIST images only. They do not appear to be digits, suggesting that ConvAE recognizes the SVHN images as noise. Figure 2(e) shows the reconstructed SVHN images produced by \( DRCN _{st}\). We can see that they look almost identical to the source images shown in Fig. 2(a), which is not surprising since the source images are included during the reconstruction training.

Finally, we evaluated the reconstruction induced by \( ConvNet _{src}\) to observe the difference with the reconstruction of \( DRCN \). Specifically, we trained ConvAE on the MNIST images in which the encoding parameters were initialized from those of \( ConvNet _{src}\) and not updated during training. We refer to the model as ConvAE+\( ConvNet _{src}\). The reconstructed images are visualized in Fig. 2(f). Although they resemble the style of MNIST images as in the \( DRCN \)’s case, only a few source images are correctly reconstructed.

To summarize, the results from this diagnostic data reconstruction correlate with the cross-domain recognition performance. More visualization on other cross-domain cases can be found in the Supplemental materials.

4.2 Experiments II: Office Dataset

In the second experiment, we evaluated \( DRCN \) on the standard domain adaptation benchmark for visual object recognition, Office [11], which consists of three different domains: amazon (a), dslr (d), and webcam (w). Office has 2817 labeled images in total distributed across 31 object categories. The number of images is thus relatively small compared to the previously used datasets.

We applied the \( DRCN \) algorithm to finetune AlexNet [14], as was done with different methods in previous work [18, 40, 41].Footnote 3 The fine-tuning was performed only on the fully connected layers of AlexNet, fc6 and fc7, and the last convolutional layer, conv5. Specifically, the label prediction pipeline of \( DRCN \) contains conv4-conv5-fc6-fc7-label and the data reconstruction pipeline has conv4-conv5-fc6-fc7-\(fc6'\)-\(conv5'\)-\(conv4'\) (the \('\) denotes the inverse layer) – it thus does not reconstruct the original input pixels. The learning rate was selected following the strategy devised in [40]: cross-validating the base learning rate between \(10^{-5}\) and \(10^{-2}\) with a multiplicative step-size \(10^{1/2}\).

We followed the standard unsupervised domain adaptation training protocol used in previous work [7, 39, 40], that is, using all labeled source data and unlabeled target data. Table 3 summarizes the performance accuracy of \( DRCN \) based on that protocol in comparison to the state-of-the-art algorithms. We found that \( DRCN \) is competitive against DAN and ReverseGrad – the performance is either the best or the second best except for one case. In particular, \( DRCN \) performs best with a convincing gap in situations when the target domain has relatively many data, i.e., \(\textsc {amazon}\) as the target dataset.

Table 3. Accuracy (mean ± std \(\%\)) on the Office dataset with the standard unsupervised domain adaptation protocol used in [7, 39].

5 Analysis

This section provides a first step towards a formal analysis of the DRCN algorithm. We demonstrate that optimizing (5) in \( DRCN \) relates to solving a semi-supervised learning problem on the target domain according to a framework proposed in [19]. The analysis suggests that unsupervised training using only unlabeled target data is sufficient. That is, adding unlabeled source data might not further improve domain adaptation.

Denote the labeled and unlabeled distributions as \({\mathbb D}_{XY} =: {\mathbb D}\) and \({\mathbb D}_{X}\) respectively. Let \(P^\theta (\cdot )\) refer to a family of models, parameterized by \(\theta \in \varTheta \), that is used to learn a maximum likelihood estimator. The \( DRCN \) learning algorithm for domain adaptation tasks can be interpreted probabilistically by assuming that \(P^\theta (x)\) is Gaussian and \(P^\theta (y|x)\) is a multinomial distribution, fit by logistic regression.

The objective in Eq. (5) is equivalent to the following maximum likelihood estimate:

$$\begin{aligned} \hat{\theta } = \mathop {\mathrm{argmax}}\limits _{\theta } \lambda \sum _{i=1}^{n_s} \log P^{\theta }_{Y|X} (y^s_i | x^s_i) + (1 - \lambda ) \sum _{j=1}^{n_t} \log P^{\theta }_{X | \tilde{X}}(x^t_j | \tilde{x}^t_j), \end{aligned}$$
(6)

where \(\tilde{x}\) is the noisy input generated from \({\mathbb Q}_{\tilde{X} | X }\). The first term represents the model learned by the supervised convolutional network and the second term represents the model learned by the unsupervised convolutional autoencoder. Note that the discriminative model only observes labeled data from the source distribution \({\mathbb P}_X\) in objectives (5) and (6).

We now recall a semi-supervised learning problem formulated in [19]. Suppose that labeled and unlabeled samples are taken from the target domain \({\mathbb Q}\) with probabilities \(\lambda \) and \((1-\lambda )\) respectively. By Theorem 5.1 in [19], the maximum likelihood estimate \(\zeta \) is

$$\begin{aligned} \zeta = \mathop {\mathrm{argmax}}\limits _{\zeta } \lambda {\mathop {{{\mathrm{\mathbb E}}}}_{{\mathbb Q}}} [\log P^{\zeta }(x, y)]+ (1 - \lambda ) {\mathop {{{\mathrm{\mathbb E}}}}_{{\mathbb Q}_X }}[\log P^{\zeta }_X(x)] \end{aligned}$$
(7)

The theorem holds if it satisfies the following assumptions: consistency, the model contains true distribution, so the MLE is consistent; and smoothness and measurability [56]. Given target data \( (x_1^t, y_1^t), \ldots , (x_{n_t}^t, y_{n_t}^t) \sim {\mathbb Q}\), the parameter \(\zeta \) can be estimated as follows:

$$\begin{aligned} \hat{\zeta } = \mathop {\mathrm{argmax}}\limits _{\zeta } \lambda \sum _{i=1}^{n_t} [\log P^{\zeta }(x_i^t, y_i^t)] + (1 - \lambda ) \sum _{i=1}^{n_t}[\log P^{\zeta }_X(x_i^t)] \end{aligned}$$
(8)

Unfortunately, \(\hat{\zeta }\) cannot be computed in the unsupervised domain adaptation setting since we do not have access to target labels.

Next we inspect a certain condition where \(\hat{\theta }\) and \(\hat{\zeta }\) are closely related. Firstly, by the covariate shift assumption [21]: \({\mathbb P}\ne {\mathbb Q}\) and \({\mathbb P}_{Y|X} = {\mathbb Q}_{Y | X}\), the first term in (7) can be switched from an expectation over target samples to source samples:

$$\begin{aligned} {\mathop {{{\mathrm{\mathbb E}}}}_{{\mathbb Q}}} \Big [\log P^{\zeta }(x, y)\Big ] = {\mathop {{{\mathrm{\mathbb E}}}}_{{\mathbb P}}}\left[ \frac{{\mathbb Q}_X(x)}{{\mathbb P}_X(x)}\cdot \log P^{\zeta }(x, y)\right] . \end{aligned}$$
(9)

Secondly, it was shown in [57] that \(P^\theta _{X | \tilde{X}}(x | \tilde{x})\), see the second term in (6), defines an ergodic Markov chain whose asymptotic marginal distribution of X converges to the data-generating distribution \({\mathbb P}_X\). Hence, Eq. (8) can be rewritten as

$$\begin{aligned} \hat{\zeta } \approx \mathop {\mathrm{argmax}}\limits _{\zeta } \lambda \sum _{i=1}^{n_s} \frac{{\mathbb Q}_X(x_i^s)}{{\mathbb P}_X(x_i^s)} \log P^\zeta (x_i^s, y_i^s) + (1 - \lambda ) \sum _{j=1}^{n_t}[\log P^{\zeta }_{X|\tilde{X}}(x_j^t | \tilde{x}_j^t)]. \end{aligned}$$
(10)

The above objective differs from objective (6) only in the first term. Notice that \(\hat{\zeta }\) would be approximately equal \(\hat{\theta }\) if the ratio \(\frac{{\mathbb Q}_X(x_i^s)}{{\mathbb P}_X(x_i^s)}\) is constant for all \(x^s\). In fact, it becomes the objective of \( DRCN _{st}\). Although the constant ratio assumption is too strong to hold in practice, comparing (6) and (10) suggests that \(\hat{\zeta }\) can be a reasonable approximation to \(\hat{\theta }\).

Finally, we argue that using unlabeled source samples during the unsupervised training may not further contribute to domain adaptation. To see this, we expand the first term of (10) as follows

$$\begin{aligned} \lambda \sum _{i=1}^{n_s} \frac{{\mathbb Q}_X(x^s_i)}{{\mathbb P}_X(x^s_i)} \log P^\zeta _{Y|X}(y^s_i | x^s_i) + \lambda \sum _{i=1}^{n_s} \frac{{\mathbb Q}_X(x^s_i)}{{\mathbb P}_X(x^s_i)} \log P^\zeta _X(x^s_i). \end{aligned}$$

Observe the second term above. As \(n_s \rightarrow \infty \), \(P^\theta _X\) will converge to \({\mathbb P}_X\). Hence, since \(\int _{x \sim {\mathbb P}_X} \frac{{\mathbb Q}_X(x)}{{\mathbb P}_X(x)} \log {\mathbb P}_X(x) \le \int _{x \sim {\mathbb P}_X} {\mathbb P}_X^t(x)\), adding more unlabeled source data will only result in a constant. This implies an optimization procedure equivalent to (6), which may explain the uselessness of unlabeled source data in the context of domain adaptation.

Note that the latter analysis does not necessarily imply that incorporating unlabeled source data degrades the performance. The fact that \( DRCN _{st}\) performs worse than \( DRCN \) could be due to, e.g., the model capacity, which depends on the choice of the architecture.

6 Conclusions

We have proposed Deep Reconstruction-Classification Network (\( DRCN \)), a novel model for unsupervised domain adaptation in object recognition. The model performs multitask learning, i.e., alternately learning (source) label prediction and (target) data reconstruction using a shared encoding representation. We have shown that \( DRCN \) provides a considerable improvement for some cross-domain recognition tasks over the state-of-the-art model. It also performs better than deep models trained using the standard pretraining-finetuning approach. A useful insight into the effectiveness of the learned \( DRCN \) can be obtained from its data reconstruction. The appearance of \( DRCN \)’s reconstructed source images resemble that of the target images, which indicates that \( DRCN \) learns the domain correspondence. We also provided a theoretical analysis relating the \( DRCN \) algorithm to semi-supervised learning. The analysis was used to support the strategy in involving only the target unlabeled data during learning the reconstruction task.