Perspective The following article is Open access

Phases of learning dynamics in artificial neural networks in the absence or presence of mislabeled data

and

Published 19 July 2021 © 2021 The Author(s). Published by IOP Publishing Ltd
, , Citation Yu Feng and Yuhai Tu 2021 Mach. Learn.: Sci. Technol. 2 043001 DOI 10.1088/2632-2153/abf5b9

2632-2153/2/4/043001

Abstract

Despite the tremendous success of deep neural networks in machine learning, the underlying reason for their superior learning capability remains unclear. Here, we present a framework based on statistical physics to study the dynamics of stochastic gradient descent (SGD), which drives learning in neural networks. Using the minibatch gradient ensemble, we construct order parameters to characterize the dynamics of weight updates in SGD. In the case without mislabeled data, we find that the SGD learning dynamics transitions from a fast learning phase to a slow exploration phase, which is associated with large changes in the order parameters that characterize the alignment of SGD gradients and their mean amplitude. In a more complex case, with randomly mislabeled samples, the SGD learning dynamics falls into four distinct phases. First, the system finds solutions for the correctly labeled samples in phase I; it then wanders around these solutions in phase II until it finds a direction that enables it to learn the mislabeled samples during phase III, after which, it finds solutions that satisfy all training samples during phase IV. Correspondingly, the test error decreases during phase I and remains low during phase II; however, it increases during phase III and reaches a high plateau during phase IV. The transitions between different phases can be understood by examining changes in the order parameters that characterize the alignment of the mean gradients for the two datasets (correctly and incorrectly labeled samples) and their (relative) strengths during learning. We find that individual sample losses for the two datasets are separated the most during phase II, leading to a data cleansing process that eliminates mislabeled samples and improves generalization. Overall, we believe that an approach based on statistical physics and stochastic dynamic systems theory provides a promising framework for describing and understanding learning dynamics in neural networks, which may also lead to more efficient learning algorithms.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction: learning as a stochastic dynamical system

Modern artificial neural network (ANN)-based algorithms, in particular, deep-learning neural networks (DLNNs) [1, 2] have enjoyed a long string of tremendous successes, achieving human-level performance in image recognition [3], machine translation [4], games [5], and even solving long-standing grand-challenge scientific problems, such as protein folding [6]. However, despite DLNNs' successes, the underlying mechanism of how they work remains unclear. For example, one key ingredient in powerful DLNNs is a relatively simple iterative method called stochastic gradient descent (SGD) [7, 8]. However, the reason why SGD is so effective at finding highly generalizable solutions in high-dimensional nonconvex loss-function landscapes remains unclear. Random elements due to subsampling in SGD seem to be key for learning, yet the inherent noise in SGD also makes it difficult to understand.

From thermodynamics and statistical physics, we know that physical systems with many degrees of freedom are subject to stochastic fluctuations, e.g., thermal noise that drives Brownian motion, and powerful tools have been developed to understand collective behaviors in stochastic processes [9]. In this paper, we propose to consider the SGD-based learning process as a stochastic dynamic system and to investigate SGD-based learning dynamics using concepts and methods from statistical physics.

In an ANN, the model is parameterized by its weights, represented as an Np -dimensional vector: $w = (w_1,w_2,...,w_{N_p})$, where Np is the number of parameters (weights). The dynamics of learning in ANN can thus be described by the motion of a "learner" particle (with coordinates $w$) in the weight space. Supervised learning uses a set of N training samples, each with an input vector Xk and a correct output vector Zk for k = 1, 2, ..., N. For each input Xk , the learning system predicts an output vector $Y_k = G(X_k,w)$, where the output function G depends on the architecture of the NN as well as its weights, w. The goal of learning is to discover the weight parameters that minimize the difference between the predicted and correct output characterized by an overall loss function (or energy function):

Equation (1)

where $l_k = d(Y_k,Z_k)$ is the loss for sample k that measures the distance between Yk and Zk . A popular choice for d is the cross-entropy loss, which is what we use in this paper.

One learning strategy is to update the weights by following the gradient of L directly. However, this direct gradient descent (GD) scheme is computationally prohibitive for large datasets and it also has the obvious shortfall of being trapped by local minima or saddle points. SGD was first introduced to circumvent the large dataset problem by updating the weights according to a subset (minibatch) of samples randomly chosen at each iteration [7]. Specifically, the change of weight wi (i = 1, 2,..., Np ) for iteration t in SGD is given by

Equation (2)

where α is the learning rate and µ(t) represents the random minibatch used for iteration t. The mini loss function for a minibatch µ of size B is defined as follows:

Equation (3)

where µl (l = 1, 2,..., B) labels the B randomly chosen training samples.

In addition to the computational advantage of SGD, the inherent noise due to random subsampling in SGD allows the system to escape local traps. In SGD, noise originates from the difference between the minibatch loss function $L\,^\mu$ and the whole-batch loss function, L: $\delta L^{\mu} \equiv L^{\mu}-L$. Using the continuous time approximation of equation (2), the SGD learning dynamics can be described by a Langevin equation:

Equation (4)

where the first term on the right-hand side (RHS) of equation (4) is the usual deterministic GD term, and the second term corresponds to SGD noise, defined as: $\eta \equiv -\alpha \nabla \delta L\,^\mu$. The SGD noise has a zero mean $\langle \eta \rangle_\mu = 0$, and its strength is characterized by the noise matrix $ \Delta_{ij}\equiv \langle \eta_i \eta_{j} \rangle = \alpha^2 C_{ij}$, where the covariance matrix $\boldsymbol{\mathrm{C}}$ can be written as follows:

Equation (5)

According to equation (4), the SGD-based learning dynamics can be considered as the stochastic motion of the learner particle in the high-dimensional weight space. The stochastic dynamics of physical systems that are in thermal equilibrium can also be described by Langevin equations with the same deterministic term as in equation (4), but with a much simpler noise term that describes the isotropic and homogeneous thermal fluctuations. Indeed, as first pointed out by Chaudhari and Soatto [10], SGD noise is neither isotropic nor homogeneous in the weight space. In this sense, SGD noise is highly nonequilibrium. As a result of nonequilibrium SGD noise, the steady-state distribution of weights is not the Boltzmann distribution seen in equilibrium systems, and the SGD dynamics exhibits much richer behavior than simply minimizing a global loss function (free energy).

How can we understand SGD-based learning in ANN? Here, we propose to bring useful concepts and tools from statistical physics [11] and stochastic processes [9] to bear on characterizing and investigating the SGD learning process/dynamics. In the rest of this paper, we describe a systematic way to characterize SGD dynamics based on order parameters that are defined over the minibatch gradient ensemble. We show how this approach allows us to identify and understand various phases of the learning process with and without labeling noise, which may lead to useful algorithms that improve generalization in the presence of mislabeled data. Throughout our study, we use realistic but simple datasets to demonstrate the principles of our approach, and pay less attention to absolute performance.

2. Characterizing SGD learning dynamics: the minibatch gradient ensemble and order parameters

To characterize the stochastic learning dynamics in SGD, we introduce the concept of a minibatch ensemble {µ}, where each member of the ensemble is a minibatch with B samples chosen randomly from the whole training dataset (of size N). Based on the minibatch ensemble, we can define an ensemble of minibatch loss functions $L\,^\mu$ or, equivalently, an ensemble of gradients $\{g\,^\mu (\equiv -\nabla L^{\mu}(w))\}$ at each weight vector w.

The SGD learning dynamics is fully characterized by the statistical properties of the gradient ensemble in weight space $\{g\,^\mu (w)\}$. At each point in the weight space, the ensemble average of the minibatch gradients is the gradient over the whole dataset: $g(w) \equiv \langle g\,^\mu (w) \rangle_\mu ( = \nabla L(w))$, and fluctuations of the gradients around their mean give rise to the noise matrix (equation (5)). To measure the alignment between the minibatch gradients, we define an alignment parameter R:

Equation (6)

where $\hat{g}\,^\mu = g\,^\mu/\|g\,^\mu\|$ is the unit vector in the gradient direction $g\,^\mu$. The alignment parameter is the cosine of the relative angle between the two gradients averaged over all pairs of minibatches (µ, ν) in the ensemble.

To analyze the gradient fluctuations in different directions, we can project the minibatch gradient $g\,^\mu$ onto the mean, g, and write it as follows:

Equation (7)

where $\lambda_\mu = (g\,^\mu \cdot g)/\|g\|^2$ is the projection constant and $g\,^\mu_{\bot}$ is the residue gradient perpendicular to g: $g\,^\mu_{\bot} \cdot g = 0$. Analogously to kinetic energy, we use the square of the gradient to measure the learning activity. The ensemble averaged activity (A) can be split into two parts:

Equation (8)

where $A_{\|}$ and $A_{\bot}$ represent activities along the mean gradient and orthogonal to it, respectively.

The total variance, D, of fluctuations in all directions is the trace of the covariance matrix $\boldsymbol{\mathrm{C}}$:

Equation (9)

where $D_{\|} = \sigma_{\lambda}^2\|g\|^2$ is the variance along the direction of the batch gradient g and $\sigma^2_\lambda\equiv \langle \lambda^2_\mu\rangle_\mu -1$ is the variance of $\lambda_\mu$ (Note that $\langle \lambda_\mu\rangle_\mu = 1$ by definition); $A_{\bot}$ is the total variance in the orthogonal directions. The mean learning activity can be written as: $A = A_0 +A_{\bot} + D_{\|}$, where $A_0\equiv \|g\|^2$ represents the directed activity in the direction of the mean gradient; $A_{\bot}$ and $D_{\|}$ represent the diffusive search activities in the directions orthogonal and parallel to the mean gradient, respectively.

All these quantities (A, A0, R, $\sigma^2_\lambda$) depend on the weights (w). Along an SGD learning trajectory in weight space, we can evaluate these order parameters and their relative values at any given time t to characterize different phases of the SGD learning dynamics. For example, we use A and A0 to measure the total learning activity and the activity in the mean gradient direction, respectively. The alignment between different minibatch gradients is measured by R, which is related to the fractionally aligned activity A0/A. The fluctuations of the minibatch gradients projected onto the mean gradient are measured by $\sigma^2_\lambda$. In our previous work [12], we used time averaging to approximate some of these order parameters for computational convenience. However, the properties of the SGD dynamics at any given point in weight space are precisely defined by these ensemble-averaged order parameters, and are used hereafter.

As previously mentioned, SGD noise is anisotropic and varies in weight space. The positive-definite eigenvalue el of the symmetric covariance matrix $\boldsymbol{\mathrm{C}}$ is the noise strength in the corresponding eigen-direction (l = 1, 2,..., Np, where Np is the number of weights or the dimensions of the weight space). The overall noise strength $D = Tr(\boldsymbol{\mathrm{C}}) = \sum_{l = 1}^{N_p} e_l$ describes the total search activity, and the eigenvalue spectrum $\{e_l, \; l = 1,2,...,N_p\}$ tells us how much of the total search activity is spent in each eigen-direction. From the noise spectrum, we can define the effective dimension of the search activity Ds (w) as the number of dimensions wherein the variance in the subspace of parameters accounts for a certain large percentage (e.g. 90%) of the total variance D.

3. Phases of SGD learning dynamics in the absence of mislabeled data

We first study the learning dynamics without mislabeled data, e.g., the original MNIST dataset (details of all numerical experiments can be found in the supplemental material (available online at stacks.iop.org/MLST/2/043001/mmedia)). As shown in figure 1, the dynamics of the overall loss function L suggests that there are two phases in learning. There is an initial fast learning phase, where L decreases quickly, followed by an exploration phase where the training error εtr reaches zero (or nearly zero), while L still decreases, but much more slowly. These two learning phases exist independently of hyperparameters (e.g. α and B) and network architectures (all connected networks or CNNs) used for different datasets (e.g., MNIST and CIFAR). The weights reached in the exploration phase can be considered as solutions to the problem, given that the training error vanishes.

Figure 1.

Figure 1. Two phases of learning without labeling noise. (A) Training loss L, training error εtr , and order parameters A, R, and $\sigma^2_\lambda$ versus (training) time. The fast learning phase corresponds to a directed (finite $R\gt0$, $\sigma^2_\lambda\sim 1$) and fast (large A) motion in weight space; the exploration phase corresponds to a diffusive (R ≈ 0, $\sigma^2_\lambda\gg 1$) and slow (small A) motion in weight space. The dotted line shows R = 0. The green bar highlights the transition region. The MNIST data and a fully connected network with two hidden layers (30 × 30) are used here. (B) Illustration of the normalized minibatch gradient ensemble (blue dotted arrows) and their means (black solid arrows) in the two learning phases.

Standard image High-resolution image

The dynamics of the order parameters A(t), R(t), and $\sigma^2_\lambda$ along the trajectory can be used to characterize and understand the two phases. As shown in figure 1(A), at the beginning of the learning process, the learning activity A is relatively large, and the alignment parameter R is finite. In this initial phase of learning, the minibatch gradients have a high degree of alignment, resulting in a strongly directed motion of the learner particle and a rapid decrease of L toward a solution region in the weight space with low L and zero training error εtr . In the exploration phase, the average learning activity A becomes much smaller, while the average alignment parameter R approaches zero. This means that the motion of the weighted particle becomes mostly diffusive (weakly directed) and the decrease of L slows. This diffusive motion of the weights allows the system to explore the solution space. The transition from a directed motion to a diffusive motion is also reflected in the large increase in the variance $\sigma^2_\lambda$ at the transition. Due to the finite size of the system, the transition is not infinitely sharp, like the phase transitions that occur in physical systems in the thermodynamic limit (infinite system limit). As shown in figure 1(A), the training error εtr becomes zero during the transition regime and it stays at zero during the exploration phase. These results confirm the results of our previous study, which used time-averaged ordered parameters [12]. The key differences between the two phases in terms of the alignment of minibatch gradients and the mean gradient strength are illustrated in figure 1(B). These two phases are independent of the network size, and they also appear in other neural network architectures, such as convolutional neural networks and residual networks. See figure S1 in the supplementary material for details.

We have also studied the noise spectra in the two phases. As shown in figure 2, unlike isotropic thermal noise, SGD noise has a highly anisotropic structure with most of its variance (strength) concentrated in a relatively small number of directions. The normalized noise spectra are similar in both phases and the total noise strength (variance) D is much higher in the fast learning phase. The effective dimension, defined as the number of directions that contains 90% of the total variance, is Ds ∼ 110, which is much smaller than the number of weights (parameters), and remains roughly constant as the number of parameters increases.

Figure 2.

Figure 2. The noise spectra, i.e. rank-ordered eigenvalues $\{e_l,\; l = 1,2...,N_p\}$ in the fast learning phase (black) and the exploration phase (red). The inset shows the normalized accumulated variance $D^{-1}\sum_{i = 1}^{l} e_i$. The two spectra are similar, except for their total variance, D. The effective dimension Ds ∼ 110, which is much smaller than the number of parameters (Np  = 900), is roughly the same in both phases. The data and network used here are the same as in figure 1.

Standard image High-resolution image

4. Phases of SGD learning dynamics in the presence of mislabeled data

There has been much interest in deep learning in the presence of mislabeled data. This was triggered by a recent study [13], in which the authors showed that random labels can easily be fitted by deep networks in the over-parameterized regime and that such overfitting destroys generalization. Here, we report some new results using the dynamic systems approach developed in the previous sections to study SGD learning dynamics with labeling noise.

In a dataset with Nc correctly labeled training samples and Nw incorrectly (randomly) labeled samples, the overall loss function L consists of two parts, Lc and Lw , which originate from the correctly labeled samples and the randomly labeled samples, respectively:

Equation (10)

where $N = N_c+N_w$ is the total number of training samples and ρ = Nw /N is the fraction of mislabeled samples. The loss function for a correctly labeled sample is the cross-entropy l between the output $Y_k(X_k,w)$ of the network with weight vector w and the correct label vector Zk : $l_k = l(Y_k,Z_k)$, while the loss function for a mislabeled sample is: $\tilde{l}_k = l(Y_k,Z^r_k)$, where $Z^r_k$ is a random label vector.

We conducted experiments using MNIST and CIFAR10 with different fractions of mislabeled data (ρ). As shown in figure 3(A) for MNIST, the whole learning process can be divided into four phases (the study of the CIFAR10 dataset showed similar results):

  • Phase I: During this initial fast learning phase (0–10 epochs in figure 3(A)), the test error εte decreases quickly as the system learns the correctly labeled data. The error εc from the correctly labeled training data follows the exact same trend as εte, and the error εw from the mislabeled training data actually increases slightly, indicating that the learning in phase I is dominated by the correctly labeled training data.
  • Phase II: After the initial fast learning phase, the test error εte stays roughly the same during phase II (10–70 epochs in figure 3(A)). Both εw and εc remain flat, indicating that learning activities for the correct and incorrect samples are balanced during phase II. This can also be seen in the plateau in the total training error $\epsilon_{tr} = (1-\rho) \epsilon_c +\rho \epsilon_w$.
  • Phase III: At the end of phase II (∼70 epochs), the test error εte starts to increase quickly, while the training errors for both the correct and the incorrect training data (εc , εw ) decrease to zero during phase III (70–200 epochs). During phase III, the system finally manages to find (learn) a solution that satisfies both the correct and incorrect training data.
  • Phase IV: Phase IV corresponds to the slow exploration phase after the system reaches the solution space for the whole dataset. The test error reaches a high plateau in phase IV.

Figure 3.

Figure 3. Learning dynamics in the presence of labeling noise. (A) The training error εtr , the test error εte , the training error for correctly labeled data εc , and the training error for mislabeled data εw are shown for a subset of MNIST data with 400 samples per digit and a fully connected network with two hidden layers (50 hidden units per layer). SGD hyper-parameters: B = 25, α = 0.01. (B) εte dynamics for different values of ρ. (C) The dependence of the time scales (tm and tf ) on ρ. (D) The dependence of the minimum and final test errors (εm and εf ) on ρ.

Standard image High-resolution image

The four distinct phases in the presence of labeling noise, and the corresponding 'U'-shaped behavior of the test error, are general for a wide range of noise levels (ρ), see figure 3(B). Quantitatively, the dynamics of the test error εte (t) during these four phases can be characterized by two timescales: tm —the time when the test error reaches its minimum and tf —the time when the training loss function reaches its minimum, and the two corresponding test errors: εm and εf . All four parameters depend on ρ. As shown in figure 3(C), tm is almost independent of ρ, which means that learning the correctly labeled data is independent of the data size, as long as the data size is large enough. However, tf increases with ρ, which means that the network needs more time to memorize the incorrectly labeled data as the number of mislabeled samples increases. As shown in figure 3(D), the final test error εf increases almost linearly with ρ, which is caused by the increased fraction of mislabeled data. The minimum error εm remains roughly the same when ρ is small, but increases sharply after a threshold and approaches εf when $\rho \gt0.85$. This also makes sense, because when ρ is large, learning is dominated by mislabeled data and the correctly labeled data no longer drives the learning dynamics.

Here, we try to understand the different phases and the transitions between them by using order parameters that are modified for the case with labeling noise. In particular, each minibatch µ now consists of two smaller minibatches, µc and µw, for correctly and incorrectly labeled data ($\mu = \mu_c +\mu_w$) with average sizes of Bc  = (1 − ρ)B and Bw  = ρB, respectively. The minbatch loss function can be decomposed into two minibatch loss functions, $L^{\mu_c}$ and $L^{\mu_w}$, defined separatelyfor µc and µw : $L\,^\mu = L^{\mu_c}+L^{\mu_w}$. At a given point in weight space, the ensemble-averaged gradient and activity for the correctly and incorrectly labeled data can be defined separately:

Equation (11)

Equation (12)

The alignment of the two gradients gc and gw can be characterized by the cosine of their relative angle:

Equation (13)

from which we obtain the ensemble-averaged gradient and activity for the whole dataset:

Equation (14)

Equation (15)

From the basic ordered parameters defined above, we can define the directed activity $A_{0,c}\equiv (1-\rho)^2\|g_c\|^2$, $A_{0,w}\equiv \rho^2 \|g_w\|^2$, and $A_0 \equiv \|g\|^2 = A_{0,c} + A_{0,w} +2 [A_{0,w}A_{0,c}]^{\frac{1}{2}} C_{cw}$; and the alignments between g and gc , and between g and gw are: $R_{aw}\equiv \frac{g \cdot g_w}{\|g\| \| g_w\|}$, $R_{ac}\equiv \frac{g \cdot g_c}{\|g\| \| g_c\|}$. We can also define alignment order parameters among members within the different gradient ensembles ({µc }, {µw }, and {µ}).

We studied three groups of order parameters: the total activities (A, Ac , Aw ); the directed activities (A0, A0,c , A0,w ) and their alignments (Rcw , Raw , Rac ) to understand the learning dynamics in the presence of labeling noise. In figure 4, we show how these order parameters change during training for the case with ρ = 50%. As shown in figures 4(A) and (B), all the learning activity order parameters (A's and A0's) show a consistent trend of increasing during phases I, II, and III before decreasing during phase IV. This is in contrast to the behavior of learning activity A in the absence of labeling noise, which shows a relatively flat or slightly decreasing trend during the fast learning phase (see figure 1). This continuously elevated learning activity in phases I–III suggests an increasing frustration between the two separate learning tasks (of learning the correctly and incorrectly labeled datasets) before a consistent solution is found in phase IV.

Figure 4.

Figure 4. Dynamics of the order parameters during phases of learning with mislabeled data. (A) Total activities (A, Aw , Ac ). (B) Directed activities (A0, A0,w , A0,c ), the inset shows the ratio $A_{0,c}/A_{0,w}$. (C) Alignment parameters (Rcw , $R_{ac}, R_{aw}$). The dotted line shows R = 0. (D) Illustration of the four different phases in terms of the relative strength and direction of the two mean gradients (gc and gw ). ρ = 0.5 is used for (A)–(C).

Standard image High-resolution image

The difference between the learning phases I, II, and III can be understood by studying the relation between the two mean gradients gw and gc characterized by the alignment order parameter Rcw (see figure 4(C)) and the relative strength of the two directed activities A0,c and A0,w .

  • Phase I: $A_{0,c}\gg A_{0,w}$, $R_{cw}\lt0$. In phase I, the directed activity from the correctly labeled data is much larger than that from the incorrectly labeled data (see inset in figure 4(B)). This is due to the fact that samples from the correctly labeled dataset are consistent with each other in terms of their labels, which leads to a much larger mean gradient toward learning a solution for the correctly labeled data. In phase I, gc and gw are not aligned ($R_{cw}\lt0$). Due to the fact that $A_{0,c}\gg A_{0,w} $, we have $R_{aw}\lt0$, which means that there is an increase of Lw during phase I, as observed in figure 3(A).
  • Phase II: $A_{0,w}\approx A_{0,c}$, $R_{cw}\lt 0$. As the system approaches a solution for the correctly labeled data during the late stage of phase I, the directed learning activity from the mislabeled data (A0,w ) increases sharply, and A0,w becomes comparable with A0,c in phase II (see the inset of the middle panel in figure 4). In addition, the two mean gradients (gc and gw ) are opposite to each other, with Rcw ≈ − 1. As a result of the balanced gradients between the two datasets, the overall directed activity is small $A_0 \ll A_{0,c(w)}$ and the loss functions (Lc , Lw , and L) remain relatively flat during phase II (see figure 3(A)).
  • Phase III: $A_{0,w}\approx A_{0,c}$, $R_{cw}\gt0$. The system enters phase III when it finally finds a direction that decreases both loss functions (Lw and Lc ) as evidenced by the alignment of gc and gw , which only happens during phase III. This alignment ($R_{cw}\gt0$) means that the system can finally learn a solution for all the training data.
  • Phase IV: $A_{0,w}\approx A_{0,c}$, $R_{cw}\lt0$. Once the system finds a solution for all data, learning slows down to explore other solutions nearby. Phase IV is similar to the exploration phase without mislabeled data, where learning activity is much reduced compared to that of phases I–III.

The key differences between the four phases, in terms of the strength and relative direction of the two mean gradients (gc and gw ), are illustrated in figure 4(D).

We have also analyzed the noise spectra in the different learning phases in the presence of labeling noise. As shown in figure 5, the normalized spectra remain roughly the same in different learning phases and the effective dimensions are DI,II,III,IV ≈ 43, 58, 140, 95, which are much smaller than the number of parameters. We note that both the noise spectra and the effective noise dimensions are similar to those without labeling noise (figure 2).

Figure 5.

Figure 5. The noise spectra, i.e. rank-ordered eigenvalues $\{e_l,\; l = 1,2...,N_p\}$ in different phases of learning with labeling noise (using the tsame settings as in figure 4). The inset shows the normalized accumulated variance $D^{-1}\sum_{i = 1}^{l} e_i$. The spectra are similar, except for their total variance D. In the different phases, the effective dimension Ds varies in the range of 50–150, which is much smaller than the number of parameters (Np  = 2500).

Standard image High-resolution image

5. Identifying and cleansing the mislabeled samples in phase II

Our study so far has used various ensemble-averaged properties to demonstrate the different phases of learning dynamics. We now investigate the distribution of losses for individual samples and how the individual loss distribution evolves with time. In figure 6(A), we show the probability distribution functions (PDFs)—Pc (l, t) and Pw (l, t)—for the individual losses of the correctly and incorrectly labeled samples at different times during training. Starting with an identical distribution at time zero, the two distributions quickly separate during phase I as Pc (l, t) moves to smaller losses while Pw (l, t) moves to slightly higher losses. The separation between the two distributions increases during phase I and reaches its maximum during phase II. After the system enters phase III, the gap between the two distributions closes quickly as the system learns the mislabeled data and Pw (l, t) catches up with Pc (l, t) at small losses. In phase IV, these two distributions become indistinguishable again as they both become highly concentrated at near-zero losses.

Figure 6.

Figure 6. The individual loss distribution and the cleansing method. (A) The loss distributions of correctly labeled samples (red) and mislabeled samples (blue) in different learning phases. (B) The bimodal distribution in phase II can be fitted by a Gaussian mixture model (red line), which is used to determine a threshold lc for cleansing. (C) The mean losses (symbols) predicted by the Gaussian mixture model agree with their true values from experiments (lines). A cleansing time tc can be determined when $\Delta L(\equiv m_w-m_c)$ reaches its maximum. (D) The test accuracy without cleansing (an ), with cleansing (ac ), and with only the correctly labeled training data (ap ) versus training time. The labeling noise level ρ = 50% for (A)–(D). (E) an , ac , and ap versus ρ. The slight decrease in ap as ρ increases is due to the decreasing size of the correctly labeled dataset. The MNIST dataset and network used here are the same as those in figure 3.

Standard image High-resolution image

As a result of the different dynamics of the two distributions, the overall individual loss distribution $P(l) = (1-\rho)P_c(l)+\rho P_w(l)$ exhibits a bimodal behavior, which is most pronounced during phase II. We can fit the overall distribution using a Gaussian mixture model: $l\sim (1-r) \mathcal{N}(m_c,s_c^2) + r \mathcal{N}(m_w,s_w^2)$, with the following fitting parameters: fraction r, means mc,w , and variances $s^2_{c,w}$. As shown in figure 6(B), the Gaussian mixture model fits P(l) well, and furthermore, the fitted means mc and mw agree with the mean losses (Lc , and Lw ) obtained from experiments.

The separation of individual loss distribution functions has recently been used to devise sophisticated methods to improve generalization, such as those reported in [14, 15]. Here, we demonstrate the basic idea by presenting a simple method to identify and clean the mislabeled samples based on the understanding of different learning phases. In particular, according to our analysis, such a cleansing process can be best done during phase II. For simplicity, we set the time tc for cleansing to be when the difference $\Delta L(\equiv m_w-m_c)$ reaches its maximum. At t = tc , we can set a threshold lc , which best separates the two distributions. For example, we can set lc as the loss when the two PDFs are equal or simply as the average of mc and mw (we do not observe significant differences between the two choices). We can then remove all the data that have a loss larger than lc and continue training with the cleansed dataset. Alternatively, we can stop the training altogether at t = tc , i.e. early stopping. In our experiments, we did not observe significant differences between these two choices. In figure 6(D), the test accuracies an (without cleansing), ac (with cleansing), and ap (with only the correctly labeled data) are shown for MNIST data with ρ = 50% labeling noise. The performance of the cleansing algorithm can be measured by $Q = \frac{a_c-a_n}{a_p-a_n}$, which depends on the noise level ρ. As shown in figure 6(E), the cleansing method can achieve a significant improvement in generalization ($Q\gt50\%$) for noise levels of up to ρ = 80% noise level. The details of the data cleansing procedure are described in the supplementary materials.

6. Summary

DLNNs have demonstrated tremendous capability for learning and problem solving in diverse domains. However, the mechanism underlying this seemingly magical learning ability is not well understood. For example, modern DNNs often contain more parameters than training samples, which allow them to interpolate (memorize) all the training samples, even if their labels are replaced by pure noise [16, 17]. Remarkably, despite their huge capacity, DNNs can achieve small generalization errors on real data (this phenomenon has been formalized as the so-called 'double descent' curve [1823]). The learning system/model seems to be able to self-tune its complexity in accordance with the data to find the simplest possible solution in a highly over-parameterized weight space. However, the way in which the system adjusts its complexity dynamically, and how SGD seeks out simple and more generalizable solutions for realistic learning tasks remain poorly understood.

In this paper, we demonstrate that our approach based on statistical physics and stochastic dynamical systems provides a useful theoretical framework (an alternative to the traditional theorem-proving approach) for studying SGD-based machine learning by applying it to the identification and characterization of the different phases of SGD-based learning, with and without labeling noise. In an earlier work [12], we used this approach to study the relation between SGD dynamics and the loss function landscape, and discovered an inverse relation between weight variance and the loss landscape flatness that is the opposite of the fluctuation–dissipation relation (akin to the Einstein relation) in equilibrium systems. We believe this framework may pave the way for a deeper understanding of deep learning by bringing powerful ideas (e.g., phase transitions in critical phenomena) and tools (e.g., renormalization group theory and replica methods) from statistical physics to bear on understanding ANN. It would be interesting to use this general framework to address other fundamental questions in machine learning, such as generalization [2426] (in particular, the mechanism for the double descent behavior in learning as described above), the relation between task complexity and network architecture, and information flow in DNNs [27, 28], as well as building a solid theoretical foundation for important applications, such as transfer learning [29], curriculum learning [30], and continuous learning [3133].

Acknowledgments

We thank Mark Wegman, Haifeng Qian and Tom Theis for discussions. The work by Y.F. was done when he was an IBM intern.

Data availability statement

The data that support the findings of this study are available upon reasonable request from the authors.

Please wait… references are loading.