Elsevier

Journal of Hydrology

Volume 580, January 2020, 124200
Journal of Hydrology

Research papers
Comparison of four learning-based methods for predicting groundwater redox status

https://doi.org/10.1016/j.jhydrol.2019.124200Get rights and content

Highlights

  • Supervised learning methods (LDA, BRT, RF) did not generalize well to independent data.

  • Unsupervised learning method (MSOM) generalized to independent data.

  • MSOM redox and depth predictions used to make 3D maps of anoxic probability.

  • Regional redox sequence (oxic-mixed-anoxic) indicates predominantly vertical recharge.

  • Local redox sequences diverse, indicating heterogeneity of flow path orientation and electron donors.

Abstract

Knowing the location where groundwater denitrification occurs, or by proxy the groundwater redox status (oxic, mixed, and anoxic), is valuable information for assessing and managing potential agricultural land-use impacts on freshwater quality. We compare the efficacy of supervised (Linear Discriminant Analysis LDA; Boosted Regression Trees, BRT; and Random Forest, RF) and unsupervised (Modified Self-Organizing Map, MSOM) learning-based methods to predict groundwater redox status in the agriculturally dominated Tasman, Waikato, and Wellington regions of New Zealand. Thresholds applied to regional groundwater-quality samples provide redox status variables and learn heuristics constrained by these variables and applied to spatial factors (climate, elevation, geologic, hydrology soils, and well depth) identify optimal sets of regional predictor variables. A split-sample approach is used to train and test the learning methods ability to predict redox status using the optimal predictor variables. Overall, the supervised methods demonstrate a prediction bias toward oxic conditions and inability to perform statistically well when using independent regional data; for example, consider kappa statistics for BRT (Tasman: 0.42, Waikato: 0.38, Wellington: 0.17), RF (Tasman: 0.42, Waikato: 0.47, Wellington: 0.17 and LDA (Tasman: 0.46, Waikato: 0.32, Wellington: 0.17). By contrast, the unsupervised method performs statistically well when predicting oxic, mixed, and anoxic conditions and corresponding depths when using independent regional data; for example, consider MSOM kappa statistics for Tasman: 0.78, Waikato: 0.80, Wellington: 0.76. The unsupervised learning method provides the added benefits of being (1) able to combine predictions into 3D regional anoxic probability plots for interpreting the spatial influence of paleosols and groundwater flowpaths on redox status, and (2) readily extended to map 3D redox status across New Zealand and other countries despite data bias and sparsity.

Introduction

The sustainability of New Zealand’s freshwater resource is facing increasing pressure from agricultural nitrate leaching (Ministry for the Environment, 2007), irrigation demand, and current and future climate change effects (Robertson et al., 2016). The integrity of groundwater quality is of increasing concern because about 40% of the population depends on groundwater for drinking water supply (Ministry for the Environment, 2007), and nutrient-rich baseflow contributions are impacting the health of lowland streams (Ministry for the Environment, 2007). To address the deterioration of freshwater quality, regulators are required to establish nitrate leaching and water-quality limits by 2025 (Ministry for the Environment, 2007). To sustain agricultural production, these limits need to account for attenuation that occurs along groundwater flow paths. Knowing the location where groundwater denitrification occurs, or by proxy the groundwater redox status, thus forms an important link between agricultural leaching sources and water-quality objectives.

The best tools currently available for mapping the redox status for groundwater denitrification are predictive models (Koch et al., 2019). As the complexity of real-world groundwater systems increase from catchment to regional or national scales, it becomes difficult and often impractical to make spatial predictions based on process-based models. Learning-based modeling is an alternative approach to predict the distribution of groundwater redox status based solely on the analysis of available measurements. This approach is possible because learning-based models build relationships between state variables (input, internal and output variables) using a limited number of assumptions about the physical behaviour of the system (Solomatine et al., 2009). That said, the development of a learning-based model is often challenging because failure can occur at any one of several model-building steps: choice of response variables, choice of predictor variables, choice of model architecture, choice of model structure and complexity, choice of model parameters, model training, testing and validation, prediction, and uncertainty quantification. Ultimately, the model performance using this approach is limited by the quality of available data.

Learning-based groundwater-quality models are grouped based on the type of problem being solved, such as regression or classification (Solomatine et al., 2009). Regression problems typically involve predicting a single response variable, such as nitrate concentration, based on learning a function that maps inputs to outputs. Classification is a special form of learning-based modeling in which the problem involves identifying the sub-population to which a new observation belongs, such as redox status (oxic, mixed, or anoxic). Early groundwater studies mainly used linear models, such as logistic regression to generate probabilistic maps of diffuse nitrate contamination (Nolan et al., 2002, Gurdak and Qi, 2012) and depths to the oxic-suboxic interface (Tesoriero et al., 2015). Spatial predictions of groundwater redox status using Linear Discriminant Analysis (LDA) have previously been made by Lee et al., 2008, Close et al., 2016, Wilson et al., 2018. These methods tacitly assume that redox status can be modelled as a linear combination of characteristics whose class samples are continuous (not missing) and normally distributed (Martinez and Kak, 2001). These considerations pose limitations when attempting to build models for predicting 3D regional redox conditions with data that are biased (type, frequency, and spatial sampling), disparate (different physics), and sparse (missing samples). Although linear learning-based modeling is used to study various aspects of groundwater systems, the linkages and interactions among climate, hydrological and biogeochemical cycles across spatiotemporal scales more appropriately favour nonlinear learning-based modeling.

Nonlinear learning-based modeling includes supervised, unsupervised and hybrid machine-Learning (ML) algorithms (Green et al., 2016). The methods associated with these algorithms are known to fit nonlinear relationships while accommodating missing data and interactions among the different predictor variables. Supervised ML algorithms analyze training data and produce an inferred function which can be used for mapping new examples. Unsupervised ML algorithms build models by deducing structures present in the input data; this process may be used to extract general rules, reduce redundancy, or organize data by similarity.

Recent applications of supervised ML algorithms involve testing efficacy of the random forest regression method (Koch et al., 2019) to model depth of the redox interface across Denmark, and the Artificial Neural Network (ANN), Bayesian Network (BN) and Boosted Regression Trees (BRT) methods to predict nitrate concentrations in groundwater of the Central Valley, California (Nolan et al., 2015). In former and later studies, the cross-validation results gave respective R2 for independent validation tests of 0.48 and 0.25. The cross-validation results for these methods suggest that the models did not generalize well to independent data possibly due to overfitting given the relatively large number of predictor variables. Also noteworthy in the earlier study is that generalizing the models to holdout (independent) data resulted in a bias with the model overpredicting low concentrations and underpredicting high concentrations. This phenomenon reveals one of the potential challenges when using learning-based models in the presence of sample frequency bias.

Another potential reason for poor predictive performance of learning-based models is attributed to the high degree of model correlation among pairs of predictor variables (Guyon and Elisseeff, 2003). In the case of perfectly correlated predictor variables, the increases in one variable offset corresponding decreases in the second variable with no effect on the training response but with increases in associated prediction uncertainty (Low et al., 2013). These findings underscore the need for identifying an optimal set of predictor variables through some quantitative feature selection process (Singh et al., 2014). Feature selection algorithms fall into two categories, the filter model or the wrapper model (Das, 2001, Kohavi and John, 1997).

The filter model relies on general characteristics of the training data to select some features without involving any learning algorithm. For example, Povak et al. (2014) selectively removed one variable of each paired linear predictor variables characterized by strong Pearson coefficients (ρ > 0.77). By reducing the number of collinear predictor variables (from 50 to 33), the cross-validation statistics for both BRT and Random Forest (RF) models increased (R2 > 0.85) suggesting the subsequent model generalized well to independent data. Other investigators applied a form of backward stepwise regression to BRT and RF models that begins with a full model and at each step eliminates variables to find a reduced model that best explains the data (Ransom et al., 2017, Nolan et al., 2018). The removal of unimportant variables is necessary for building robust models that generalize to independent data, but this stepwise process is not likely to be reliable when considering relative importance reported by the BRT and RF methods. The reason is that the relative importance is reported as normalized values. For models characterized by strong correlations among predictor variables, these variables will split importance, thereby reducing their apparent influence in the prediction process and giving the false impression of their true ranked importance.

The wrapper model requires one predetermined learning algorithm in feature selection and uses its performance to evaluate and determine which features are selected (Yu and Liu, 2003, Calvet et al., 2017). The benefits of nonlinear feature selection in groundwater-quality modeling using Learn Heuristics (Metaheuristics in Machine Learning) was demonstrated by Friedel and Buscema (2016) and using evaluated filters, embedded and wrapper methods was demonstrated by Rodriguez-Galiano et al. (2018). These studies demonstrated that the overfitting of predictor variables can be reduced when identifying an optimal number of predictor variables in accordance with the principle of Occam’s razor – unnecessarily complex models should not be preferred to simpler ones (Encyclopaedia Britannica, 2010).

The aim of this study is to predict groundwater redox status across selected agricultural regions of New Zealand. We hypothesize that the redox status classified from groundwater chemistry sampled across regional monitoring networks can be combined with well depths and nationally available climate, geology, hydrology, soils, and topography coverages to provide mutual information (measure of entropy describing mutual dependence among random variables) suitable for learning-based model building and redox class prediction. To test this hypothesis, we evaluate supervised (LDA, BRT, and RF) and unsupervised (Modified Self-Organizing Map, MSOM) learning-based methods. The objectives are to compare the performance of these methods for predicting the probability of groundwater redox status (oxic, mixed, anoxic) and associated depths in groundwater systems of the Tasman, Waikato, and Wellington regions. This study extends the work of Close et al., 2016, Wilson et al., 2018 who, because of the single-dependent model response variable restriction when using LDA, developed separate models for classifying redox status over shallow and deep zones. In addition to evaluating supervised learning-based methods, we develop an innovative unsupervised learning-based workflow to simultaneously predict four response functions: oxic, mixed, anoxic, and depth. In this approach, the relative benefit of mutual information content in the predictor variables can be evaluated by comparing the probability of predicted redox conditions. In information theory, this concept reflects the mutual pairwise dependence among random variables.

Section snippets

Sources of water-quality data

The Tasman, Waikato and Wellington regions of New Zealand were selected to test and compare learning-based methods for predicting groundwater redox status (Fig. 1). Groundwater resources are considered nationally significant with many water-quality monitoring well data that are collected by the regional councils. Tasman District is in the northern part of the South Island and covers an area of 9650 km2. The main groundwater resources in the district lie within the alluvial terraces and coastal

Response variables

Regional groundwater sampling data are inherently biased for both the response variable (redox status) and associated predictor variables (spatial attributes). For example, most samples in the groundwater dataset are oxic (Table 2). This bias in redox status is attributed to the increased frequency of sampling in areas where greater water demand is concomitant with oxic conditions, and/or the increased frequency of shallow sampling across the New Zealand landscape dominated by oxic groundwater (

Conclusions

Results from the unsupervised learning-based method (MSOM) are statistically superior to the three commonly used supervised learning-based methods (LDA, BRT, and RF). The MSOM can simultaneously predict oxic, mixed, and anoxic redox status and their associated depths (four response functions) at unstructured grid locations associated with the predictor variables. The statistical robustness of redox predictions is attributed to using learn heuristics for identifying optimal sets of regional

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

We thank staff at the Waikato Regional Council, Greater Wellington Regional Council, Tasman District Council for providing sample data, Institute of Geological and Nuclear Sciences Limited (Rogier Westerhoff), National Institute of Water and Atmospheric Research (Roddy Henderson), Landcare Research (James Barringer) for providing spatial attribute data. Funding for this project was from the New Zealand Ministry of Business, Innovation and Employment as a component of the National Science

References (59)

  • S. Wilson et al.

    Applying linear discriminant analysis to predict groundwater redox conditions conducive to denitrification

    J. Hydrol.

    (2018)
  • D.J. Booker

    Spatial and temporal patterns in the frequency of events exceeding three times the median flow (FRE3) across New Zealand

    J. Hydrol. (NZ)

    (2013)
  • Booker, D.J, 2015. Hydrological Indices for National Environmental Reporting. NIWA report prepared for Ministry for the...
  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • M. Buscema

    Genetic doping algorithm (GenD): theory and applications

    Expert Syst.

    (2004)
  • M. Buscema et al.

    Training with input selection and testing (TWIST) algorithm: a significant advance in pattern recognition performance of machine learning

    J. Intell. Learn. Syst.

    (2013)
  • L. Calvet et al.

    Learnheuristics: hybridizing metaheuristics with machine learning for optimization with dynamic inputs

    Open Mathemat.

    (2017)
  • J. Cohen

    A coefficient of agreement for nominal scales

    Educ. Psychol. Meas.

    (1960)
  • S. Das

    Filters, wrappers and a boosting-based hybrid for feature selection

  • G. Death

    Boosted trees for ecological modeling and prediction

    Ecology

    (2007)
  • Dietterich, T.G., 2000. Ensemble Methods in Machine Learning, Proceedings of the First International Workshop on...
  • J.R. Dymond et al.

    Nitrate and phosphorous leaching in New Zealand: a national perspective

    N. Z. J. Agric. Res.

    (2013)
  • Efron B., Tibshirani, R.J., 1993. An introduction to the bootstrap. In: Monographs on statistics and applied...
  • J. Elith et al.

    A working guide to boosted regression trees

    J. Anim. Ecol.

    (2008)
  • Encyclopædia Britannica. Encyclopædia Britannica Online. 2010. “Ockham's razor”. Archived from the original on 23...
  • Friedel, M.J., Buscema, M., 2016. Aquatic ecosystem modeling under natural and anthropogenic stresses: using an...
  • M.J. Friedel et al.

    Mapping fractional soils and vegetation components from Hyperion satellite imagery using an unsupervised machine-learning workflow

    Int. J. Digital Earth

    (2017)
  • Geographx 2012. NZ 8m DEM. Available from...
  • C.S. Green et al.

    Big data bioinformatics

    J. Cell Physiol.

    (2016)
  • Cited by (22)

    View all citing articles on Scopus
    1

    Pacific Northwest National Laboratory, P.O. Box 999, Richland, WA 99352, United States.

    View full text