Elsevier

Neural Networks

Volume 137, May 2021, Pages 1-17
Neural Networks

Review
Robustifying models against adversarial attacks by Langevin dynamics

https://doi.org/10.1016/j.neunet.2020.12.024Get rights and content

Abstract

Adversarial attacks on deep learning models have compromised their performance considerably. As remedies, a number of defense methods were proposed, which however, have been circumvented by newer and more sophisticated attacking strategies. In the midst of this ensuing arms race, the problem of robustness against adversarial attacks still remains a challenging task. This paper proposes a novel, simple yet effective defense strategy where off-manifold adversarial samples are driven towards high density regions of the data generating distribution of the (unknown) target class by the Metropolis-adjusted Langevin algorithm (MALA) with perceptual boundary taken into account. To achieve this task, we introduce a generative model of the conditional distribution of the inputs given labels that can be learned through a supervised Denoising Autoencoder (sDAE) in alignment with a discriminative classifier. Our algorithm, called MALA for DEfense (MALADE), is equipped with significant dispersion—projection is distributed broadly. This prevents white box attacks from accurately aligning the input to create an adversarial sample effectively. MALADE is applicable to any existing classifier, providing robust defense as well as off-manifold sample detection. In our experiments, MALADE exhibited state-of-the-art performance against various elaborate attacking strategies.

Introduction

Deep neural networks (DNNs) (He et al., 2016, Krizhevsky et al., 2012, LeCun et al., 1998, Simonyan and Zisserman, 2015, Szegedy et al., 2015) have shown excellent performance in many applications, while they are known to be susceptible to adversarial attacks, i.e., examples crafted intentionally by adding slight noise to the input (Athalye, Engstrom et al., 2018, Evtimov et al., 2018, Goodfellow et al., 2015, Nguyen et al., 2015, Papernot et al., 2017, Szegedy et al., 2014). These two aspects are considered to be two sides of the same coin: deep structure induces complex interactions between weights of different layers, which provides flexibility in expressing complex input–output relation with relatively small degrees of freedom, while it can make the output function unpredictable in spots where training samples exist sparsely. If adversarial attackers manage to find such spots in the input space close to a real sample, they can manipulate the behavior of classification, which can lead to a critical risk of security in applications, e.g., self-driving cars, for which high reliability is required. Different types of defense strategies were proposed, including adversarial training (Gu and Rigazio, 2015, Kannan et al., 2018, Lamb et al., 2018, Liu et al., 2018, Madry et al., 2018, Papernot, McDaniel, Wu et al., 2016, Strauss et al., 2017, Tramèr et al., 2018, Xie et al., 2019) which incorporates adversarial samples in the training phase, projection methods (Ilyas et al., 2017, Jin et al., 2019, Lamb et al., 2018, Samangouei et al., 2018, Schott et al., 2018, Song et al., 2018) which denoise adversarial samples by projecting them onto the data manifold, and preprocessing methods (Guo et al., 2018, Liao et al., 2018, Meng and Chen, 2017, Xie et al., 2018) which try to destroy elaborate spatial coherence hidden in adversarial samples. Although those defense strategies were shown to be robust against the attacking strategies that had been proposed before, most of them have been circumvented by newer attacking strategies. Another type of approaches, called certification-based methods (Dvijotham et al., 2018, Raghunathan et al., 2018, Wong and Kolter, 2018, Wong et al., 2018, Xie et al., 2019), minimize (bounds of) the worst case loss over a defined range of perturbations, and provide theoretical guarantees on robustness against any kind of attacks. However, the guarantee holds only for small perturbations, and the performance of those methods against existing attacks are typically inferior to the state-of-the-art. Thus, the problem of robustness against adversarial attacks still remains unsolved.

In this paper, we propose a novel defense strategy, which drives adversarial samples towards high density regions of the data distribution. Fig. 1 explains the idea of our approach. Assume that an attacker created an adversarial sample (red circle) by moving an original sample (black circle) to an untrained spot where the target classifier gives a wrong prediction. We can assume that the spot is in a low density area of the training data, i.e., off the data manifold, where the classifier is not able to perform well, but still close to the original high density area so that the adversarial pattern is imperceptible to a human. Our approach is to relax the adversarial sample by the Metropolis-adjusted Langevin algorithm (MALA) (Roberts and Rosenthal, 1998, Roberts et al., 1996), in order to project the adversarial sample back to the original high density area.

MALA requires the gradient of the energy function, which corresponds to the gradient xlogp(x) of the log probability, a.k.a., the score function, of the input distribution. As discussed in Alain and Bengio (2014), one can estimate this score function by a Denoising Autoencoder (DAE) (Vincent, Larochelle, Bengio, & Manzagol, 2008) (see Fig. 2a). However, naively applying MALA would have an apparent drawback: if there exist high density regions (clusters) close to each other but not sharing the same label, MALA could drive a sample into another cluster (see the green line in Fig. 1), which degrades the classification accuracy. To overcome this drawback, we perform MALA driven by the score function of the conditional distribution p(x|y) given label y. We will show that the score function can be estimated by a Supervised DAE (sDAE) (Lee et al., 2018, Lehman et al., 2018) (see Fig. 2b) with the weights for the reconstruction loss and the classification loss appropriately set.

By using sDAE, our novel defense method, called MALA for DEfense (MALADE), relaxes the adversarial sample based on the conditional gradient xlogp(x|y) without knowing the label y of the test sample. Thus, MALADE drives the adversarial sample towards high density regions of the data generating distribution for the original class (see the blue line in Fig. 1), where the classifier is well trained to predict the correct label.

Our proposed MALADE can be seen as one of the projection methods, most of which have been circumvented by recent attacking methods. However, MALADE has two essential differences from the previous projection methods:

    Significant dispersion

    Most projection methods, including Magnet (Meng & Chen, 2017), Defense-GAN (Samangouei et al., 2018), PixelDefend (Song et al., 2018), and others (Ilyas et al., 2017, Jin et al., 2019), try to pull the adversarial sample back to the original point (so that the adversarial pattern is removed). On the other hand, MALADE drives the input sample to anywhere (randomly) in the closest cluster having the original label. In this sense, MALADE has much larger inherent randomness and thus resilience than the previous projection methods,

    Perceptual boundary taken into account

    All previous projection methods pull the input sample into the closest point on the data manifold without the label information into account. On the other hand, MALADE is designed to drive the input sample into the data manifold of the original class.

The previous projection methods were broken down by aligning adversarial samples such that the classifier is fooled even after the projection or by finding adversarial samples that is not significantly moved by the projector (Athalye, Carlini et al., 2018, Carlini and Wagner, 2017). Significant dispersion of MALADE makes these attacking strategies harder: it prevents any whitebox attack from aligning the input so that MALADE stably moves it to a targeted untrained spot. Here, the second property is essential: when making dispersion of projection broad, it can happen that Langevin dynamics carries a sample from the original cluster to a neighboring cluster with different label, which results in a wrong prediction. sDAE, taking the perceptual boundary into account, allows us to safely perform Langevin Dynamics within the clusters of the correct label.

Concisely, our contributions in this paper are three fold:

  • We prove that a sDAE can provide an estimator for the conditional gradient xlogp(x|y), without knowing the label at the test time.

  • We propose to perform a modified version of MALA suited for defense, i.e., MALADE, which drives samples towards the high density area of the conditional, instead of the marginal.

  • We empirically show that MALADE alone can protect the standard classifiers to get robust performance on MNIST. On ImageNet, the standard classifiers are completely broken down, and MALADE alone cannot make them robust. However, MALADE improves the performance of adversarially trained classifiers. A combined strategy of detection and defense enhances the performance, and achieves state-of-the-art results in countering adversarial samples on ImageNet.

This paper is organized as follows. We first summarize existing attacking and defense strategies in Section 2. Then, we propose our method with a novel conditional gradient estimator in Section 3. In Section 4, we evaluate our defense method against various attacking strategies, and show advantages over the state-of-the-art defense methods. Section 5 concludes.

Section snippets

Existing methods

In this section, we introduce existing attacking and defense strategies.

Proposed method

In this section, we propose our novel defense strategy, which drives the input sample (if it lies in low density regions) towards high density regions. We achieve this by using Langevin dynamics.

Experiments

In this section, we empirically evaluate our proposed MALADE against various attacking strategies, and compare it with the state-of-the-art baseline defense strategies.

Concluding discussion

The threat of adversarial sample still remains an unresolved issue, even on a small toy dataset like MNIST. State-of-the-art robust methods do not scale well to larger data or models.

In this work, we have proposed to use the Metropolis-adjusted Langevin algorithm (MALA) which is guided through a supervised DAE—MALA for DEfense (MALADE). This framework allows us to drive adversarial samples towards the underlying data manifold and thus towards the high density regions of the data generating

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

KRM was supported in part by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea Government (No. 2019-0-00079, Artificial Intelligence Graduate School Program, Korea University), and was partly supported by the German Ministry for Education and Research (BMBF) under Grants 01IS14013A-E, 01GQ1115, 01GQ0850, 01IS18025A, 031L0207D and 01IS18037A; the German Research Foundation (DFG) under Grant Math+, EXC 2046/1, Project ID 390685689. WS

References (75)

  • MontavonG. et al.

    Methods for interpreting and understanding deep neural networks

    Digital Signal Processing

    (2018)
  • AbadiM. et al.

    Tensorflow: a system for large-scale machine learning

  • AlainG. et al.

    What regularized auto-encoders learn from the data-generating distribution

    Journal of Machine Learning Research

    (2014)
  • AthalyeA. et al.

    Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples

  • AthalyeA. et al.

    Synthesizing robust adversarial examples

  • BengioY. et al.

    Generalized denoising auto-encoders as generative models

  • BrendelW. et al.

    Decision-based adversarial attacks: Reliable attacks against black-box machine learning models

  • CarliniN. et al.

    On evaluating adversarial robustness

    (2019)
  • CarliniN. et al.

    Magnet and“ efficient defenses against adversarial attacks” are not robust to adversarial examples

    (2017)
  • CarliniN. et al.

    Towards evaluating the robustness of neural networks

  • Chen, P., Sharma, Y., Zhang, H., Yi, J., & Hsieh, C. (2018). EAD: Elastic-net attacks to deep neural networks via...
  • Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). ImageNet: A large-scale hierarchical image database....
  • DombrowskiA.-K. et al.

    Explanations can be manipulated and geometry is to blame

  • Dong, Y., Liao, F., Pang, T., Su, H., Zhu, J., & Hu, X., et al. (2018). Boosting adversarial attacks with momentum. In...
  • Dvijotham, K., Stanforth, R., Gowal, S., Mann, T. A., & Kohli, P. (2018). A dual approach to scalable verification of...
  • ElsayedG. et al.

    Large margin deep networks for classification

  • EngstromL. et al.

    Evaluating and understanding the robustness of adversarial logit pairing

    (2018)
  • EvtimovI. et al.

    Robust physical-world attacks on machine learning models

  • FawziA. et al.

    Robustness of classifiers: from adversarial to random noise

  • FrosstN. et al.

    DARCCC: Detecting adversaries by reconstruction from class conditional capsules

    (2018)
  • GirolamiM. et al.

    Riemann Manifold langevin and hamiltonian monte carlo methods

    Journal of the Royal Statistical Society. Series B. Statistical Methodology

    (2011)
  • GoodfellowI.J. et al.

    Explaining and harnessing adversarial examples

  • GuS. et al.

    Towards deep neural network architectures robust to adversarial examples

  • GuoC. et al.

    Countering adversarial images using input transformations

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE...
  • IlyasA. et al.

    The robust manifold defense: Adversarial training using generative models

    (2017)
  • Jégou, S., Drozdzal, M., Vazquez, D., Romero, A., & Bengio, Y. (2017). The one hundred layers tiramisu: Fully...
  • JinG. et al.

    APE-Gan: adversarial perturbation elimination with GAN

  • KannanH. et al.

    Adversarial logit pairing

    (2018)
  • KrizhevskyA. et al.

    Imagenet classification with deep convolutional neural networks

  • KurakinA. et al.

    Adversarial machine learning at scale

  • LambA. et al.

    Fortified networks: Improving the robustness of deep networks by modeling the manifold of hidden representations

    (2018)
  • LapuschkinS. et al.

    Unmasking clever hans predictors and assessing what machines really learn

    Nature Communications

    (2019)
  • LeCunY. et al.

    Gradient-based learning applied to document recognition

    Proceedings of the IEEE

    (1998)
  • LeeJ. et al.

    Deterministic non-autoregressive neural sequence modeling by iterative refinement

    Proceedings of the 2018 conference on empirical methods in natural language processing

    (2018)
  • LehmanE.P. et al.
  • Liao, F., Liang, M., Dong, Y., Pang, T., Hu, X., & Zhu, J. (2018). Defense against adversarial attacks using high-level...
  • Cited by (17)

    • Adversarial attacks on deep learning networks in image classification based on Smell Bees Optimization Algorithm

      2023, Future Generation Computer Systems
      Citation Excerpt :

      Globally, the employment of adversarial attacks in deep learning was largely addressed to video recognition models [32], license plate recognition [33], text classifiers [34–36], novelty detection [37], object detection [38], and natural language processing [39]. Recently, some works [10,40–44] summarizing adversarial attacks and defenses in some domains were presented. Furthermore, we cite three works that targeted the same vision of our studies by performing algorithms based on bio-inspired techniques for fooling deep learning models.

    • Boosting the transferability of adversarial examples via stochastic serial attack

      2022, Neural Networks
      Citation Excerpt :

      Gu and Rigazio (2014) first introduce input reconstruction to filter the perturbation, which designs a denoising autoencoder to reconstruct the original examples from adversarial ones. Further, to improve the reconstruction effect, Srinivasan et al. (2021) improve the supervised denoising autoencoder to drive adversarial examples towards high-density regions of the target class. Oregi, Del Ser, Pérez, and Lozano (2020) use elastic similarity measures to detect changes in images.

    View all citing articles on Scopus
    View full text