Elsevier

Journal of Hydrology

Volume 476, 7 January 2013, Pages 97-111
Journal of Hydrology

A comparison of methods to avoid overfitting in neural networks training in the case of catchment runoff modelling

https://doi.org/10.1016/j.jhydrol.2012.10.019Get rights and content

Summary

Artificial neural networks (ANNs) becomes very popular tool in hydrology, especially in rainfall–runoff modelling. However, a number of issues should be addressed to apply this technique to a particular problem in an efficient way, including selection of network type, its architecture, proper optimization algorithm and a method to deal with overfitting of the data. The present paper addresses the last, rarely considered issue, namely comparison of methods to prevent multi-layer perceptron neural networks from overfitting of the training data in the case of daily catchment runoff modelling. Among a number of methods to avoid overfitting the early stopping, the noise injection and the weight decay have been known for about two decades, however only the first one is frequently applied in practice. Recently a new methodology called optimized approximation algorithm has been proposed in the literature.

Overfitting of the training data leads to deterioration of generalization properties of the model and results in its untrustworthy performance when applied to novel measurements. Hence the purpose of the methods to avoid overfitting is somehow contradictory to the goal of optimization algorithms, which aims at finding the best possible solution in parameter space according to pre-defined objective function and available data. Moreover, different optimization algorithms may perform better for simpler or larger ANN architectures. This suggest the importance of proper coupling of different optimization algorithms, ANN architectures and methods to avoid overfitting of real-world data – an issue that is also studied in details in the present paper.

The study is performed for Annapolis River catchment, characterized by significant seasonal changes in runoff, rapid floods during winter and spring, moderately dry summers, severe winters with snowfall, snow melting, frequent freeze and thaw, and presence of river ice. The present paper shows that the elaborated noise injection method may prevent overfitting slightly better than the most popular early stopping approach. However, the implementation of noise injection to real-world problems is difficult and the final model performance depends significantly on a number of very technical details, what somehow limits its practical applicability. It is shown that optimized approximation algorithm does not improve the results obtained by older methods, possibly due to over-simplified criterion of stopping the algorithm. Extensive calculations reveal that Evolutionary Computation-based algorithm performs better for simpler ANN architectures, whereas classical gradient-based Levenberg–Marquardt algorithm is able to benefit from additional input variables, representing precipitation and snow cover from one more previous day, and from more complicated ANN architectures. This confirms that the curse of dimensionality has severe impact on the performance of Evolutionary Computing methods.

Highlights

► Comparison of different methods to avoid ANN overfitting to data. ► Noise injection outperform early stopping and optimized approximation algorithm methods. ► Different results were obtained using Evolutionary Computation-based and gradient-based algorithms for ANN training. ► Rainfall–runoff modelling by means of MLP ANN provide reasonable results.

Introduction

During the last 20 years artificial neural networks (ANNs) (Haykin, 1999) have become very popular in various scientific disciplines (Paliwar and Kumar, 2009, Wen et al., 2009, Al-Garni, 2010). Within the field of hydrology different Artificial Intelligence methods, including ANNs, also gained much popularity (Maier and Dandy, 2000, Cheng et al., 2002, Cheng et al., 2008, Muttil and Chau, 2006, Lin et al., 2006, Piotrowski et al., 2007, Maier et al., 2010, Acharya et al., 2012, Huo et al., 2012, Nourani et al., 2012). ANN applications to rainfall–runoff modelling are plentiful and include: ASCE Task Committee, 2000, Solomatine, 2003, Cherkassy et al. (2006), Dawson et al., 2006, Piotrowski et al., 2006, Solomatine and Ostfeld, 2008, Wu et al., 2009, Siou et al., 2011 and Wu and Chau (2011). Among different ANN types, multi-layer perceptron neural networks (MLPs) are especially popular due to their simplicity, relatively low number of parameters, clear biological inspirations and a debate whether they may be considered as universal approximators or not (Hecht-Nielsen, 1987, Girosi and Poggio, 1989, Nakamura et al., 1993, Braun and Griebel, 2009). MLPs are of special interest also in the present paper.

Recently Wang et al. (2009) and Elshorbagy et al. (2010) showed a comparison of various Artificial Intelligence techniques for rainfall–runoff forecasting, encouraging the search for novel methods to improve ANN training and selecting their different features. Apart from choosing the neural network type, the successful application of a neural networks to a particular problem requires the determination of a model architecture (which defines number of parameters), an optimization algorithm and a method to avoid overfitting. However, in practice ANNs are frequently used at hand without discussing such details, what may have significant impact on model performance.

The present paper is a continuation of Piotrowski and Napiorkowski (2011) study, which aimed at choosing the best optimization method for MLPs training applied for daily catchment runoff forecasting in colder climate zones. The main objective of the current paper is the comparison of different methods to avoid overfitting when MLPs are applied to similar task. We put special attention to noise injection approach, which is rarely considered in hydrological applications. Also the performance of a new method called optimized approximation algorithm and of early stopping, the most popular approach to deal with overfitting, are studied in details. However, the methods to avoid overfitting cannot be compared or discussed apart of ANN architecture and training algorithms. For instance some optimization algorithms perform poorly in uncertain environments (Jin and Branke, 2005) of which neural networks with noise injection method to avoid overfitting may be an example. This emphasizes the importance of proper coupling methods to avoid overfitting with training algorithms. On the other hand some optimization methods may quickly converge to good solutions for simple ANN architecture with small number of parameters, but perform poorly for more complicated ones with more parameters. One may note that different ANN architectures means different number of parameters and different fitness landscapes, hence formally different problems. It is well known that the performance of optimization algorithms depends on the problem. It was verified empirically (e.g. Epitropakis et al., 2011); it was also proved (Wolpert and Macready, 1997, Wolpert and Macready, 2005) that under certain assumptions the performance of any two algorithms averaged over all possible problems (fitness landscapes) is equal. Of course, in practice few people may be interested in all problems, but the proof presented in Wolpert and Macready (1997) results in important warning: an optimization algorithm applied to a novel task can fail even if it was successful in solving some other problems. In the present paper at first based on the previous findings in the literature two training methods are chosen, one gradient-based and one based on Evolutionary Computation (EC). Then the optimal set of input variables and MLP architecture for each training algorithm is found experimentally. Finally three different methods to avoid ANN overfitting of data are compared for chosen ANN architectures and training algorithms. Below we very briefly introduce the main features connected with MLP training algorithms, architecture and methods to avoid overfitting.

A number of studies addressed the problem of application of different optimization algorithms to the ANN training for various regression problems. The most popular optimization approaches are gradient-based methods and among them the Levenberg–Marquardt (LM) algorithm (Press et al., 2006) is considered as one of the most efficient (see e.g. Adamowski and Karapataki, 2010). In a few studies EC methods were applied to the same problems – with various opinions on their performance. In some papers the advantage of the EC algorithms over the gradient-based methods applied to different ANN training was suggested (Sexton and Gupta, 2000, Jain and Srinivasulu, 2004, Martinez-Estudillo et al., 2006, Chau, 2006, Zhang et al., 2007, Huang et al., 2009), but a number of other studies showed that EC approaches are at least not better than the gradient-based algorithms in terms of ANN model performance and are of course much slower (Mandischer, 2002; Ilonen et al., 2003; Socha and Blum, 2007). Motivated by this diversity of opinions, authors of the present study conducted a detailed survey of recently developed EC methods from two “families” – Differential Evolution (DE) (Storn and Price, 1995) and Particle Swarm Optimization (Kennedy and Eberhart, 1995). Eight EC algorithms were compared with LM method for MLP training with early stopping approach for rainfall–runoff modelling at Annapolis River, Nova Scotia, Canada (Piotrowski and Napiorkowski, 2011). The results are generally in agreement with Mandischer (2002), Ilonen et al. (2003) and Socha and Blum (2007) opinions. Only one of the EC algorithms, namely the Differential Evolution with Global and Local neighbors (DEGL, Das et al., 2009) showed similar performance to LM method, with much slower convergence. It is worth noting that DEGL was also found to be among the best EC-based algorithms, along with Grouped Differential Evolution (GDE) (Piotrowski and Napiorkowski, 2010) and Self-Adaptive Differential Evolution (SADE) (Qin et al., 2009) in training MLP applied to estimation of longitudinal dispersion coefficient in rivers (Piotrowski et al., 2012b). In that study number of data was very small (under 100 observations) imposing the use of very simple MLP architecture, objective function was non-differentiable and a kind of noise injection method to avoid overfitting was used. This suggests that DEGL may be well suited for MLP training in general. Based on above findings, in the present paper LM and DEGL methods are used as training algorithms.

The architecture of MLP defines the number of parameters to be optimized. This architecture should always be adopted to the problem (Zhang et al., 1998, Mahmoud and Ben-Nahki, 2003, Siou et al., 2011, De et al., 2011), as it depends on the number of input and output variables, the number and the quality of available data, the presence of noise in the data, etc. Over-parameterization may have significant negative impact on the performance of neural networks, also in the case of rainfall–runoff modelling (Gaume and Gosset, 2003). The smaller architecture usually allows to obtain better generalization properties (Haykin, 1999) and is easier to train, especially by means of the non-gradient-based methods. Some Evolutionary Computation algorithms were proposed to determine an optimal architecture of ANN (Castillo et al., 2000, Huang et al., 2009) and were applied to hydrological problems (Chen and Chang, 2009), however they are applicable rather to problems when neither expert-knowledge is available nor physically-based choice of input variables and model complexity is possible. Although a number of other methods how to develop ANN architecture exists (Sietsma and Dow, 1991; Wang et al., 1994, Islam et al., 2009, Ssegane et al., 2012, Nourani and Sayyah Frad, 2012), they usually rely on heuristic or subjective decisions and none is widely applied (Zhang et al., 1998).

Impact of different methods used to avoid overfitting on ANN performance, which is the main focus of the present paper, was rarely studied in the literature and such papers usually dealt with classification problems (Holmstrom and Koistinen, 1992, Hua et al., 2006, Zur et al., 2009) or used artificial functions for comparison (Holmstrom and Koistinen, 1992, Reed et al., 1995). Only Giustolisi and Laucelli (2005) studied the impact of a number of methods to avoid ANN overfitting for hydrological data, namely the case of rainfall–runoff modelling for two very small catchments (up to 5 km2) in Italy. However, EC-based optimization methods were not used and the popular noise injection method based on maximization of the cross-validated likelihood (Holmstrom and Koistinen, 1992) was not compared. The early stopping technique led to poor results in Giustolisi and Laucelli (2005) what may be surprising as this is very popular and usually successful approach to avoid overfitting. Moreover, recently a novel methodology called optimized approximation algorithm (Liu et al., 2008) was proposed and gained much interest. The present paper tries to fill the gaps left by Giustolisi and Laucelli, 2005, Hua et al., 2006 and Zur et al. (2009) and presents the comparison of the catchment runoff modelling results obtained when the mentioned three techniques designed to avoid overfitting are coupled with MLPs of different architectures and gradient-based or EC-based optimization algorithms. Neural networks are applied to Annapolis River runoff forecasting, which is located in moderate climate zone.

Section snippets

Study area and hydro-meteorological data

The present paper is a continuation of Piotrowski and Napiorkowski (2011) study for the same catchment, namely upper part of Annapolis River (Nova Scotia, Canada) up to Wilmot settlement, with area of 546 km2. Hydrological and meteorological data are available from Water Survey of Canada and Canada’s National Climate Data and Information Archive for the gauge station situated in Wilmot settlement, (44°56′57″N, 65°01′45″W) and meteorological station at Greenwood Airfield (44°58′40″N, 64°55′33″W),

Multi-layer perceptron artificial neural networks and optimization algorithms

MLP neural network (Haykin, 1999) is a nonlinear data-based model that approximates the values of output variables (y) dependent on the set of input variables (x). MLP is formed by several nodes arranged in groups called layers (see Fig. 3). Usually three layers, an input layer, a hidden layer, and an output layer are sufficient in practice (Haykin, 1999, see also real-world data applications in De et al., 2011 and Siou et al., 2011). The number of nodes in input and output layers is determined

Methods to avoid neural network overfitting

To be successfully applied in practice, ANN should have abilities to generalize input–output mapping. In other words, model should be able to correctly approximate observations not included in training set (Geman et al., 1992). In the case of catchment runoff modelling it means the ability to make good runoff predictions for the future hydro-meteorological conditions. To allow proper generalization capabilities one must avoid ANN overfitting of the training data, i.e. model should be fitted

Selection of input variables and MLP architecture

To predict one lead day runoff Q(t + 1), different variants of input variables are considered (see Table 2). The best MLP architecture is chosen according to MSE criterion, Eq. (3). In the simplest combination of inputs, only the most recent measurements of meteorological variables are used, namely UT(t), LT(t), RF(t), SF(t), SC(t) together with two last runoff measurements Q(t), Q(t  1), what gives seven input variables in total. Each of them is physically important for runoff forecasting – UT

Results and discussion

Three criteria are used in the present paper to compare the results obtained by means of different methods to avoid ANN overfitting, namely mean square error (MSE), mean absolute error (MAE), and Nash–Sutcliffe coefficient (NSC).

During the optimization MSE is used as the objective function (Eq. (3)). MAE is defined asMAE=1Nn=1NynP-yn

Very popular in river runoff forecasting NSC is computed according to the following equationNSC=1-1Nn=1N(ynP-yn)21Nn=1L(yn-yna)2,whereyna=1Nn=1NynThe maximum

Conclusions

The present paper aims at comparison of applications of a number of techniques to avoid ANN overfitting in case of catchment runoff modelling in the area located in moderately cold climate zone. Three methods were considered, namely the noise injection with spread factor h estimated by means of maximizing cross-validation likelihood function (Holmstrom and Koistinen, 1992), optimized approximation algorithm proposed by Liu et al. (2008) and the most popular early stopping (Prechlet, 1998,

Acknowledgments

This work has been supported by the Inner Grant of the Institute of Geophysics, Polish Academy of Sciences Nr. 1b/IGF PAN/2012/MŁ.

References (88)

  • H.R. Maier et al.

    Methods used for the development of neural networks for the prediction of water resource variables in river systems: current status and future directions

    Environ. Modell. Softw.

    (2010)
  • M. Mandischer

    A comparison of evolution strategies and backpropagation for neural network training

    Neurocomputing

    (2002)
  • A. Martinez-Estudillo et al.

    Evolutionary product unit based neural networks for regression

    Neural Networks

    (2006)
  • V. Nourani et al.

    Sensitivity analysis of the artificial neural network outputs in simulation of the evaporation process at different climatological regimes

    Adv. Eng. Softw.

    (2012)
  • A.P. Piotrowski et al.

    Optimizing neural networks for river flow forecasting – evolutionary computation methods versus the Levenberg–Marquardt approach

    J. Hydrol.

    (2011)
  • R.S. Sexton et al.

    Comparative valuation of genetic algorithm and backpropagation for training neural networks

    Inf. Sci.

    (2000)
  • J. Sietsma et al.

    Creating artificial neural networks that generalize

    Neural Networks

    (1991)
  • L.K.A. Siou et al.

    Complexity selection of a neural network model for karst flood forecasting: the case of the Lez Basin (southern France)

    J. Hydrol.

    (2011)
  • H. Ssegane et al.

    Advances in variable selection methods I: causal selection methods versus stepwise regression and principal component analysis on data of known and unknown functional relationships

    J. Hydrol.

    (2012)
  • A. Varhola et al.

    Forest canopy effect on snow accumulation and ablation: an integrative review of empirical results

    J. Hydrol.

    (2010)
  • W.C. Wang et al.

    A comparison of performance of several artificial intelligence methods for forecasting monthly discharge time series

    J. Hydrol.

    (2009)
  • Z. Wang et al.

    A procedure for determining the topology of multilayer feedforward neural networks

    Neural Networks

    (1994)
  • U.P. Wen et al.

    A review of Hopfield neural networks for solving mathematical programming problems

    Eur. J. Oper. Res.

    (2009)
  • C.L. Wu et al.

    Methods to improve neural network performance in daily flow prediction

    J. Hydrol.

    (2009)
  • C.L. Wu et al.

    Rainfall–runoff modeling using artificial neural network coupled with singular spectrum analysis

    J. Hydrol.

    (2011)
  • G. Zhang et al.

    Forecasting with artificial neural networks: the state of the art

    Int. J. Forecast.

    (1998)
  • J.R. Zhang et al.

    A hybrid particle swarm optimization – back-propagation algorithm for feedforward neural network training

    Appl. Math. Comput.

    (2007)
  • N. Acharya et al.

    A neurocomputing approach to predict monsoon rainfall in monthly scale using SST anomaly as a predictor

    Acta Geophys.

    (2012)
  • J. Adamowski et al.

    Comparison of multivariate regression and artificial neural networks for peak urban water-demand forecasting: evaluation of different ANN learning algorithms

    J. Hydrol. Eng.

    (2010)
  • M.A. Al-Garni

    Interpretation of spontaneous potential anomalies from some simple geometrically shaped bodies using neural network inversion

    Acta Geophys.

    (2010)
  • S. Amari et al.

    Asymptotic statistical theory of overfitting and cross-validation

    IEEE Trans. Neural Networks

    (1997)
  • G. An

    The effect of adding noise during backpropagation training on a generalization performance

    Neural Comput.

    (1996)
  • ASCE Task Committee, 2000. Artificial neural networks in hydrology. II: hydrologic applications. J. Hydrol. Eng. 5(2),...
  • J. Braun et al.

    On a constructive proof of the Kolmogorov’s superposition theorem

    Constr. Approx.

    (2009)
  • W.M. Brown et al.

    Use of noise to augment training data: a neural network method of mineral–potential mapping in regions of limited known deposit examples

    Nat. Resour. Res.

    (2003)
  • C.T. Cheng et al.

    Optimizing hydropower reservoir operation using hybrid genetic algorithm and chaos

    Water Resour. Manage.

    (2008)
  • V. Cherkassy et al.

    Computational intelligence in earth sciences and environmental applications: issues and challenges

    Neural Networks

    (2006)
  • S. Das et al.

    Differential evolution using a neighborhood-based mutation operator

    IEEE Trans. Evol. Comput.

    (2009)
  • S. Das et al.

    Differential evolution – a survey of the state-of-the-art

    IEEE Trans. Evol. Comput.

    (2011)
  • S.S. De et al.

    Identification of the best architecture of a multilayer perceptron in modelling daily total ozone concentration in Kolkata, India

    Acta Geophys.

    (2011)
  • B. Dorronsoro et al.

    Improving classical and decentralized differential evolution with new mutation operator and population topologies

    IEEE Trans. Evol. Comput.

    (2011)
  • A. Elshorbagy et al.

    Experimental investigation of the predictive capabilities of data driven modeling techniques in hydrology – part 2: application

    Hydrol. Earth Syst. Sci.

    (2010)
  • M.G. Epitropakis et al.

    Enhancing differential evolution utilizing proximity-based mutation operators

    IEEE Trans. Evol. Comput.

    (2011)
  • E. Gaume et al.

    Over-parameterisation, a major obstacle to the use of artificial neural networks in hydrology?

    Hydrol. Earth Syst. Sci.

    (2003)
  • Cited by (205)

    View all citing articles on Scopus
    View full text