Constructing a spatiotemporally coherent long-term PM2.5 concentration dataset over China during 1980–2019 using a machine learning approach

https://doi.org/10.1016/j.scitotenv.2020.144263Get rights and content

Highlights

  • Long-term PM2.5 is essential due to the coverage deficiency of surface observations.

  • A machine learning model with visibility and many auxiliary data inputs is applied.

  • A 1-degree gridded daily PM2.5 dataset over China for 1980–2019 is constructed.

  • The model performs well with a high coefficient of determination and low bias.

  • It will be a promising tool for assessing related impacts on environment and climate.

Abstract

The lack of long-term observations and satellite retrievals of health-damaging fine particulate matter in China has demanded the estimates of historical PM2.5 (particulate matter less than 2.5 μm in diameter) concentrations. This study constructs a gridded near-surface PM2.5 concentration dataset across China covering 1980–2019 using the space-time random forest model with atmospheric visibility observations and other auxiliary data. The modeled daily PM2.5 concentrations are in excellent agreement with ground measurements, with a coefficient of determination of 0.95 and mean relative error of 12%. Besides the atmospheric visibility which explains 30% of total importance of variables in the model, emissions and meteorological conditions are also key factors affecting PM2.5 predictions. From 1980 to 2014, the model-predicted PM2.5 concentrations increased constantly with the maximum growth rate of 5–10 μg/m3/decade over eastern China. Due to the clean air actions, PM2.5 concentrations have decreased effectively at a rate over 50 μg/m3/decade in the North China Plain and 20–50 μg/m3/decade over many regions of China during 2014–2019. The newly generated dataset of 1-degree gridded PM2.5 concentrations for the past 40 years across China provides a useful means for investigating interannual and decadal environmental and climate impacts related to aerosols.

Introduction

Particulate matter is one of the major health-damaging components in the atmosphere, especially those with the aerodynamic diameters smaller than 2.5 μm (PM2.5). Long-term exposure to PM2.5 can increase risks of many adverse health issues, including respiratory and cardiovascular diseases, lung cancer and premature death (Crouse et al., 2012; Pope et al., 2002; Xing et al., 2016; Zhang et al., 2017). The presence of high concentrations of PM2.5 also reduces atmospheric visibility, influences the public transportation, and thus adversely affects social and economic activities (Zhang et al., 2014). These aerosol particles also influence climate via aerosol-radiation and aerosol-cloud interactions (Boucher et al., 2013; Yang et al., 2020). Through long-range transport, the local environmental and climatic impacts of aerosols near major source regions can be extended globally (Ren et al., 2020; Wang et al., 2014).

Aerosol concentrations in China are experiencing great changes in recent decades. The rapid industrial development and urbanization were primarily responsible for the increasing tendency of PM2.5 concentrations before 2010 (Yang et al., 2016; Cohen et al., 2017). From 2013 to 2017, the PM pollution was alleviated, with the national averaged concentration reduced by one third, primarily owing to the implementation of clean air actions in China (Huang et al., 2018). Following the growing public health concern, many air quality monitoring stations have been established to measure real-time PM2.5 concentrations since 2013. However, the measurements are limited to a short temporal coverage and have uneven spatial distributions (Wang et al., 2019; Zhao et al., 2020), so they are insufficient to describe the long-term characteristics of PM2.5 in China. Because the spatiotemporal variation of PM2.5 and its relationship with changes in emissions, weather and climate can be used to improve the current understanding of pollution formation and provide the scientific basis of air quality improvement to policy makers, it is essential to produce a long-term dataset of gridded surface PM2.5 concentrations based on real observed data in China.

To overcome the spatiotemporal coverage deficiency of surface PM2.5 observations, satellite remote sensing data have been widely used to estimate surface PM2.5 concentrations recently (Fang et al., 2016; Wei et al., 2019). In general, aerosol optical depth (AOD) derived from satellite has a positive correlation with near-surface PM2.5 concentrations. Based on this, a variety of statistical models, including multiple linear regression (Chelani, 2019), geographically weighted regression (Ma et al., 2014; Guo et al., 2017), linear mixed-effect model (Zheng et al., 2016), and two-stage model (Ma et al., 2016; Yao et al., 2019), have been applied to assess PM2.5. In addition, machine learning has become a modern tool for a regression task nowadays due to its computational efficiency and state-of-the-art performance (Stafoggia et al., 2019). Wei et al. (2019) produced PM2.5 concentrations at 1-km resolution in China for 2016 based on satellite AOD using the space-time random forest (STRF) model, with a cross-validation (CV) coefficient of determination (R2) of 0.85. Li et al. (2017) estimated PM2.5 in 2015 over China using a geo-intelligent deep learning model together with satellite AOD data, in which the CV R2 increases from 0.42 to 0.88 relative to the traditional neural network method. However, these estimated PM2.5 data still have some limitations in certain aspects. First of all, the Moderate Resolution Imaging Spectroradiometer (MODIS) data were not available until 1999 and Suomi National Polar-orbiting Partnership (S-NPP) satellite was launched in 2011. Most of the studies mentioned above used AOD derived from these two satellites to predict PM2.5, and consequently the PM2.5 data are not available before 2000 (van Donkelaar et al., 2015; Xue et al., 2019). Additionally, AOD represents aerosol loading in the entire atmospheric column and its relationship with near-surface PM2.5 concentrations is largely influenced by planetary boundary layer height, relative humidity, temperature, and other factors (Liu et al., 2009). Moreover, algorithm bias, signal uncertainty, and cloud contamination induce biases to the PM2.5 estimation from AOD (Stafoggia et al., 2019; Xiao et al., 2017).

Atmospheric visibility measurements, which have been available for several decades in China, were demonstrated to be a promising alternative for estimating near-surface PM2.5 concentrations (Shen et al., 2016). Li et al. (2020) derived PM2.5 concentrations over North China in 2014 using the combination of visibility observations and GEOS-Chem model simulations and reported that the estimated PM2.5 was highly correlated to surface observations in time and space, with a correlation coefficient of 0.96 and 0.79, respectively. Liu et al. (2017) estimated historical (1957–1964 and 1973–2014) PM2.5 in China using visibility measurements and a statistical approach, and found that the model can accurately estimate PM2.5 concentrations with the CV R2 of 0.71. They also reported an increasing trend of 1.9 μg/m3/decade averaged over China during 1957–2014. Due to the better abilities in dealing with non-linear and complex relationships between variables than traditional statistical approaches, machine learning methods can also be used in the visibility-PM2.5 prediction. By using a machine learning model (the Extreme Gradient Boosting), Gui et al. (2020) constructed surface PM2.5 concentrations in 2018 over China based on visibility and meteorological data, which offered the potential in reconstructing long-term PM2.5 data in China with a machine learning method. Furthermore, in addition to visibility and meteorology, other factors such as emissions, topography, population and land use data, should be considered in the machine learning model to simulate PM2.5 concentrations and spatiotemporal distributions.

In this study, we construct a gridded dataset of near-surface PM2.5 concentrations across China covering 1980–2019 using the STRF model along with atmospheric visibility and other auxiliary data (e.g., meteorology, anthropogenic emissions, land use, topography, population density and spatiotemporal information), which have a longer time coverage and are more representative of the near-surface aerosols than the data based on satellite AOD. The performance of the STRF model in estimating PM2.5 in China is evaluated and the long-term variations of PM2.5 are characterized.

Section snippets

Datasets

We utilize existing hourly observed surface PM2.5 concentrations during recent years (2014–2019), long-term atmospheric visibility and auxiliary data (e.g., meteorological variables, anthropogenic emissions, land use, national population, topography, and geographic and time variables of observations). The sources and preprocessing of data are elaborated below.

Model performance and importance of input variables

Fig. 2a presents the density scatterplot of fitting performance of the STRF model. The validation data of daily surface PM2.5 observations for model evaluation are 372,596 in total across China during 2014–2018. The STRF model slightly underestimates the PM2.5 concentrations, with a slope of 0.86 in the regression model. The values of R2, MAE, RMSE and MRE are 0.95, 5.02 μg/m3, 8.92 μg/m3 and 12%, respectively, indicating a good agreement between the estimated PM2.5 and surface observations.

The

Conclusion and discussions

In this study, the STRF machine learning model is trained with the input of atmospheric visibility observations, meteorology, land use, topography, anthropogenic emissions, population, and relevant spatiotemporal information to construct a 1-degree gridded near-surface daily PM2.5 concentration dataset from 1980 to 2019. This spatiotemporally coherent historical PM2.5 dataset is useful to study the long-term aerosol variations over China.

The PM2.5 estimates are well correlated with near-surface

CRediT authorship contribution statement

Huimin Li: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft. Yang Yang: Conceptualization, Data curation, Formal analysis, Project administration, Supervision, Writing – review & editing. Hailong Wang: Formal analysis, Writing – review & editing. Baojie Li: Data curation. Pinya Wang: Data curation, Formal analysis. Jiandong Li: Data curation. Hong Liao: Writing – review & editing.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This study was supported by the National Natural Science Foundation of China (grant 41975159) and the National Key Research and Development Program of China (grant 2020YFA0607803 and 2019YFA0606800). HW acknowledges the support by the U.S. Department of Energy (DOE), Office of Science, Office of Biological and Environmental Research (BER), as part of the Earth and Environmental System Modeling program. The Pacific Northwest National Laboratory (PNNL) is operated for DOE by the Battelle Memorial

References (46)

  • M. Stafoggia et al.

    Estimation of daily PM10 and PM2.5 concentrations in Italy, 2013–2015, using a spatiotemporal land-use random-forest model

    Environ. Int.

    (2019)
  • W. Wang et al.

    Estimation of PM2.5 concentrations in China using a spatial back propagation neural network

    Sci. Rep.

    (2019)
  • J. Wei et al.

    Estimating 1-km-resolution PM2.5 concentrations across China using the space-time random forest approach

    Remote Sens. Environ.

    (2019)
  • Q. Xiao et al.

    Full-coverage high-resolution daily PM2.5 estimation using MAIAC AOD in the Yangtze River Delta of China. Remote Sens

    Environ.

    (2017)
  • T. Xue et al.

    Spatiotemporal continuous estimates of PM2.5 concentrations in China, 2000–2016: a machine learning method with inputs from satellites, chemical transport model, and ground observations

    Environ. Int.

    (2019)
  • F. Yao et al.

    A spatially structured adaptive two-stage model for retrieving ground-level PM2.5 concentrations from VIIRS AOD in China. ISPRS J. Photogramm

    Remote Sens.

    (2019)
  • C. Zhao et al.

    Estimating the daily PM2.5 concentration in the Beijing-Tianjin-Hebei region using a random forest model with a 0.01° × 0.01° spatial resolution

    Environ. Int.

    (2020)
  • Y. Zheng et al.

    Estimating ground-level PM2.5 concentrations over three megalopolises in China using satellite-derived aerosol optical depth measurements

    Atmos. Environ.

    (2016)
  • Boucher, O., Randall, D., Artaxo, P., Bretherton, C., Feingold, G., Forster, P., Kerminen, V.M., Kondo, Y., Liao, H.,...
  • Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32....
  • CMA, 2014. Forecasting and Networking Department of China Meteorological Administration released letter No.4: Notice on...
  • D.L. Crouse et al.

    Risk of nonaccidental and cardiovascular mortality in relation to long-term exposure to low concentrations of fine particulate matter: a Canadian national-level cohort study

    Environ. Health Perspect.

    (2012)
  • A. van Donkelaar et al.

    Use of satellite observations for long-term exposure assessment of global concentrations of fine particulate matter

    Environ. Health Perspect.

    (2015)
  • Cited by (42)

    View all citing articles on Scopus
    View full text