Research papersComparison of four learning-based methods for predicting groundwater redox status
Graphical abstract
Introduction
The sustainability of New Zealand’s freshwater resource is facing increasing pressure from agricultural nitrate leaching (Ministry for the Environment, 2007), irrigation demand, and current and future climate change effects (Robertson et al., 2016). The integrity of groundwater quality is of increasing concern because about 40% of the population depends on groundwater for drinking water supply (Ministry for the Environment, 2007), and nutrient-rich baseflow contributions are impacting the health of lowland streams (Ministry for the Environment, 2007). To address the deterioration of freshwater quality, regulators are required to establish nitrate leaching and water-quality limits by 2025 (Ministry for the Environment, 2007). To sustain agricultural production, these limits need to account for attenuation that occurs along groundwater flow paths. Knowing the location where groundwater denitrification occurs, or by proxy the groundwater redox status, thus forms an important link between agricultural leaching sources and water-quality objectives.
The best tools currently available for mapping the redox status for groundwater denitrification are predictive models (Koch et al., 2019). As the complexity of real-world groundwater systems increase from catchment to regional or national scales, it becomes difficult and often impractical to make spatial predictions based on process-based models. Learning-based modeling is an alternative approach to predict the distribution of groundwater redox status based solely on the analysis of available measurements. This approach is possible because learning-based models build relationships between state variables (input, internal and output variables) using a limited number of assumptions about the physical behaviour of the system (Solomatine et al., 2009). That said, the development of a learning-based model is often challenging because failure can occur at any one of several model-building steps: choice of response variables, choice of predictor variables, choice of model architecture, choice of model structure and complexity, choice of model parameters, model training, testing and validation, prediction, and uncertainty quantification. Ultimately, the model performance using this approach is limited by the quality of available data.
Learning-based groundwater-quality models are grouped based on the type of problem being solved, such as regression or classification (Solomatine et al., 2009). Regression problems typically involve predicting a single response variable, such as nitrate concentration, based on learning a function that maps inputs to outputs. Classification is a special form of learning-based modeling in which the problem involves identifying the sub-population to which a new observation belongs, such as redox status (oxic, mixed, or anoxic). Early groundwater studies mainly used linear models, such as logistic regression to generate probabilistic maps of diffuse nitrate contamination (Nolan et al., 2002, Gurdak and Qi, 2012) and depths to the oxic-suboxic interface (Tesoriero et al., 2015). Spatial predictions of groundwater redox status using Linear Discriminant Analysis (LDA) have previously been made by Lee et al., 2008, Close et al., 2016, Wilson et al., 2018. These methods tacitly assume that redox status can be modelled as a linear combination of characteristics whose class samples are continuous (not missing) and normally distributed (Martinez and Kak, 2001). These considerations pose limitations when attempting to build models for predicting 3D regional redox conditions with data that are biased (type, frequency, and spatial sampling), disparate (different physics), and sparse (missing samples). Although linear learning-based modeling is used to study various aspects of groundwater systems, the linkages and interactions among climate, hydrological and biogeochemical cycles across spatiotemporal scales more appropriately favour nonlinear learning-based modeling.
Nonlinear learning-based modeling includes supervised, unsupervised and hybrid machine-Learning (ML) algorithms (Green et al., 2016). The methods associated with these algorithms are known to fit nonlinear relationships while accommodating missing data and interactions among the different predictor variables. Supervised ML algorithms analyze training data and produce an inferred function which can be used for mapping new examples. Unsupervised ML algorithms build models by deducing structures present in the input data; this process may be used to extract general rules, reduce redundancy, or organize data by similarity.
Recent applications of supervised ML algorithms involve testing efficacy of the random forest regression method (Koch et al., 2019) to model depth of the redox interface across Denmark, and the Artificial Neural Network (ANN), Bayesian Network (BN) and Boosted Regression Trees (BRT) methods to predict nitrate concentrations in groundwater of the Central Valley, California (Nolan et al., 2015). In former and later studies, the cross-validation results gave respective R2 for independent validation tests of 0.48 and 0.25. The cross-validation results for these methods suggest that the models did not generalize well to independent data possibly due to overfitting given the relatively large number of predictor variables. Also noteworthy in the earlier study is that generalizing the models to holdout (independent) data resulted in a bias with the model overpredicting low concentrations and underpredicting high concentrations. This phenomenon reveals one of the potential challenges when using learning-based models in the presence of sample frequency bias.
Another potential reason for poor predictive performance of learning-based models is attributed to the high degree of model correlation among pairs of predictor variables (Guyon and Elisseeff, 2003). In the case of perfectly correlated predictor variables, the increases in one variable offset corresponding decreases in the second variable with no effect on the training response but with increases in associated prediction uncertainty (Low et al., 2013). These findings underscore the need for identifying an optimal set of predictor variables through some quantitative feature selection process (Singh et al., 2014). Feature selection algorithms fall into two categories, the filter model or the wrapper model (Das, 2001, Kohavi and John, 1997).
The filter model relies on general characteristics of the training data to select some features without involving any learning algorithm. For example, Povak et al. (2014) selectively removed one variable of each paired linear predictor variables characterized by strong Pearson coefficients (ρ > 0.77). By reducing the number of collinear predictor variables (from 50 to 33), the cross-validation statistics for both BRT and Random Forest (RF) models increased (R2 > 0.85) suggesting the subsequent model generalized well to independent data. Other investigators applied a form of backward stepwise regression to BRT and RF models that begins with a full model and at each step eliminates variables to find a reduced model that best explains the data (Ransom et al., 2017, Nolan et al., 2018). The removal of unimportant variables is necessary for building robust models that generalize to independent data, but this stepwise process is not likely to be reliable when considering relative importance reported by the BRT and RF methods. The reason is that the relative importance is reported as normalized values. For models characterized by strong correlations among predictor variables, these variables will split importance, thereby reducing their apparent influence in the prediction process and giving the false impression of their true ranked importance.
The wrapper model requires one predetermined learning algorithm in feature selection and uses its performance to evaluate and determine which features are selected (Yu and Liu, 2003, Calvet et al., 2017). The benefits of nonlinear feature selection in groundwater-quality modeling using Learn Heuristics (Metaheuristics in Machine Learning) was demonstrated by Friedel and Buscema (2016) and using evaluated filters, embedded and wrapper methods was demonstrated by Rodriguez-Galiano et al. (2018). These studies demonstrated that the overfitting of predictor variables can be reduced when identifying an optimal number of predictor variables in accordance with the principle of Occam’s razor – unnecessarily complex models should not be preferred to simpler ones (Encyclopaedia Britannica, 2010).
The aim of this study is to predict groundwater redox status across selected agricultural regions of New Zealand. We hypothesize that the redox status classified from groundwater chemistry sampled across regional monitoring networks can be combined with well depths and nationally available climate, geology, hydrology, soils, and topography coverages to provide mutual information (measure of entropy describing mutual dependence among random variables) suitable for learning-based model building and redox class prediction. To test this hypothesis, we evaluate supervised (LDA, BRT, and RF) and unsupervised (Modified Self-Organizing Map, MSOM) learning-based methods. The objectives are to compare the performance of these methods for predicting the probability of groundwater redox status (oxic, mixed, anoxic) and associated depths in groundwater systems of the Tasman, Waikato, and Wellington regions. This study extends the work of Close et al., 2016, Wilson et al., 2018 who, because of the single-dependent model response variable restriction when using LDA, developed separate models for classifying redox status over shallow and deep zones. In addition to evaluating supervised learning-based methods, we develop an innovative unsupervised learning-based workflow to simultaneously predict four response functions: oxic, mixed, anoxic, and depth. In this approach, the relative benefit of mutual information content in the predictor variables can be evaluated by comparing the probability of predicted redox conditions. In information theory, this concept reflects the mutual pairwise dependence among random variables.
Section snippets
Sources of water-quality data
The Tasman, Waikato and Wellington regions of New Zealand were selected to test and compare learning-based methods for predicting groundwater redox status (Fig. 1). Groundwater resources are considered nationally significant with many water-quality monitoring well data that are collected by the regional councils. Tasman District is in the northern part of the South Island and covers an area of 9650 km2. The main groundwater resources in the district lie within the alluvial terraces and coastal
Response variables
Regional groundwater sampling data are inherently biased for both the response variable (redox status) and associated predictor variables (spatial attributes). For example, most samples in the groundwater dataset are oxic (Table 2). This bias in redox status is attributed to the increased frequency of sampling in areas where greater water demand is concomitant with oxic conditions, and/or the increased frequency of shallow sampling across the New Zealand landscape dominated by oxic groundwater (
Conclusions
Results from the unsupervised learning-based method (MSOM) are statistically superior to the three commonly used supervised learning-based methods (LDA, BRT, and RF). The MSOM can simultaneously predict oxic, mixed, and anoxic redox status and their associated depths (four response functions) at unstructured grid locations associated with the predictor variables. The statistical robustness of redox predictions is attributed to using learn heuristics for identifying optimal sets of regional
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
We thank staff at the Waikato Regional Council, Greater Wellington Regional Council, Tasman District Council for providing sample data, Institute of Geological and Nuclear Sciences Limited (Rogier Westerhoff), National Institute of Water and Atmospheric Research (Roddy Henderson), Landcare Research (James Barringer) for providing spatial attribute data. Funding for this project was from the New Zealand Ministry of Business, Innovation and Employment as a component of the National Science
References (59)
- et al.
Predicting groundwater redox status on a regional scale using linear discriminant analysis
J. Contam. Hydrol.
(2016) - et al.
Review of the self-organizing map (SOM) approach in water resources: analysis, modeling and application
Environ. Model. Softw.
(2008) - et al.
Wrappers for feature subset selection
Artif. Intell.
(1997) - et al.
Soil and informatics science combine to develop S-map: a new generation soil information system for New Zealand
Geoderma
(2012) - et al.
Impact of feature selection on the accuracy and spatial uncertainty of per-field crop classification using support vector machines
ISPRS J. Photogramm. Remote Sens.
(2013) - et al.
A statistical learning framework for groundwater nitrate models
J. Hydrol.
(2015) - et al.
Metamodeling and mapping of nitrate flux in the unsaturated zone and groundwater, Wisconsin, USA
J. Hydrol.
(2018) - et al.
Neural virtual sensor for the inferential prediction of product quality form process variables
Comput. Chem. Eng.
(2002) - et al.
A hybrid machine learning model to predict and visualize nitrate concentration throughout the Central Valley aquifer, California, USA
Sci. Total Environ.
(2017) - et al.
Feature selection approaches for predictive modelling of groundwater nitrate pollution: an evaluation of filters, embedded and wrapper methods
Sci. Total Environ.
(2018)
Applying linear discriminant analysis to predict groundwater redox conditions conducive to denitrification
J. Hydrol.
Spatial and temporal patterns in the frequency of events exceeding three times the median flow (FRE3) across New Zealand
J. Hydrol. (NZ)
Random forests
Mach. Learn.
Genetic doping algorithm (GenD): theory and applications
Expert Syst.
Training with input selection and testing (TWIST) algorithm: a significant advance in pattern recognition performance of machine learning
J. Intell. Learn. Syst.
Learnheuristics: hybridizing metaheuristics with machine learning for optimization with dynamic inputs
Open Mathemat.
A coefficient of agreement for nominal scales
Educ. Psychol. Meas.
Filters, wrappers and a boosting-based hybrid for feature selection
Boosted trees for ecological modeling and prediction
Ecology
Nitrate and phosphorous leaching in New Zealand: a national perspective
N. Z. J. Agric. Res.
A working guide to boosted regression trees
J. Anim. Ecol.
Mapping fractional soils and vegetation components from Hyperion satellite imagery using an unsupervised machine-learning workflow
Int. J. Digital Earth
Big data bioinformatics
J. Cell Physiol.
Cited by (22)
Predicting coastal harmful algal blooms using integrated data-driven analysis of environmental factors
2024, Science of the Total EnvironmentPreemptive warning and control strategies for algal blooms in the downstream of Han River, China
2022, Ecological IndicatorsMachine learning predictions of nitrate in groundwater used for drinking supply in the conterminous United States
2022, Science of the Total EnvironmentA novel deep neural network architecture for real-time water demand forecasting
2021, Journal of Hydrology
- 1
Pacific Northwest National Laboratory, P.O. Box 999, Richland, WA 99352, United States.