Constructing a spatiotemporally coherent long-term PM2.5 concentration dataset over China during 1980–2019 using a machine learning approach
Graphical abstract
Introduction
Particulate matter is one of the major health-damaging components in the atmosphere, especially those with the aerodynamic diameters smaller than 2.5 μm (PM2.5). Long-term exposure to PM2.5 can increase risks of many adverse health issues, including respiratory and cardiovascular diseases, lung cancer and premature death (Crouse et al., 2012; Pope et al., 2002; Xing et al., 2016; Zhang et al., 2017). The presence of high concentrations of PM2.5 also reduces atmospheric visibility, influences the public transportation, and thus adversely affects social and economic activities (Zhang et al., 2014). These aerosol particles also influence climate via aerosol-radiation and aerosol-cloud interactions (Boucher et al., 2013; Yang et al., 2020). Through long-range transport, the local environmental and climatic impacts of aerosols near major source regions can be extended globally (Ren et al., 2020; Wang et al., 2014).
Aerosol concentrations in China are experiencing great changes in recent decades. The rapid industrial development and urbanization were primarily responsible for the increasing tendency of PM2.5 concentrations before 2010 (Yang et al., 2016; Cohen et al., 2017). From 2013 to 2017, the PM pollution was alleviated, with the national averaged concentration reduced by one third, primarily owing to the implementation of clean air actions in China (Huang et al., 2018). Following the growing public health concern, many air quality monitoring stations have been established to measure real-time PM2.5 concentrations since 2013. However, the measurements are limited to a short temporal coverage and have uneven spatial distributions (Wang et al., 2019; Zhao et al., 2020), so they are insufficient to describe the long-term characteristics of PM2.5 in China. Because the spatiotemporal variation of PM2.5 and its relationship with changes in emissions, weather and climate can be used to improve the current understanding of pollution formation and provide the scientific basis of air quality improvement to policy makers, it is essential to produce a long-term dataset of gridded surface PM2.5 concentrations based on real observed data in China.
To overcome the spatiotemporal coverage deficiency of surface PM2.5 observations, satellite remote sensing data have been widely used to estimate surface PM2.5 concentrations recently (Fang et al., 2016; Wei et al., 2019). In general, aerosol optical depth (AOD) derived from satellite has a positive correlation with near-surface PM2.5 concentrations. Based on this, a variety of statistical models, including multiple linear regression (Chelani, 2019), geographically weighted regression (Ma et al., 2014; Guo et al., 2017), linear mixed-effect model (Zheng et al., 2016), and two-stage model (Ma et al., 2016; Yao et al., 2019), have been applied to assess PM2.5. In addition, machine learning has become a modern tool for a regression task nowadays due to its computational efficiency and state-of-the-art performance (Stafoggia et al., 2019). Wei et al. (2019) produced PM2.5 concentrations at 1-km resolution in China for 2016 based on satellite AOD using the space-time random forest (STRF) model, with a cross-validation (CV) coefficient of determination (R2) of 0.85. Li et al. (2017) estimated PM2.5 in 2015 over China using a geo-intelligent deep learning model together with satellite AOD data, in which the CV R2 increases from 0.42 to 0.88 relative to the traditional neural network method. However, these estimated PM2.5 data still have some limitations in certain aspects. First of all, the Moderate Resolution Imaging Spectroradiometer (MODIS) data were not available until 1999 and Suomi National Polar-orbiting Partnership (S-NPP) satellite was launched in 2011. Most of the studies mentioned above used AOD derived from these two satellites to predict PM2.5, and consequently the PM2.5 data are not available before 2000 (van Donkelaar et al., 2015; Xue et al., 2019). Additionally, AOD represents aerosol loading in the entire atmospheric column and its relationship with near-surface PM2.5 concentrations is largely influenced by planetary boundary layer height, relative humidity, temperature, and other factors (Liu et al., 2009). Moreover, algorithm bias, signal uncertainty, and cloud contamination induce biases to the PM2.5 estimation from AOD (Stafoggia et al., 2019; Xiao et al., 2017).
Atmospheric visibility measurements, which have been available for several decades in China, were demonstrated to be a promising alternative for estimating near-surface PM2.5 concentrations (Shen et al., 2016). Li et al. (2020) derived PM2.5 concentrations over North China in 2014 using the combination of visibility observations and GEOS-Chem model simulations and reported that the estimated PM2.5 was highly correlated to surface observations in time and space, with a correlation coefficient of 0.96 and 0.79, respectively. Liu et al. (2017) estimated historical (1957–1964 and 1973–2014) PM2.5 in China using visibility measurements and a statistical approach, and found that the model can accurately estimate PM2.5 concentrations with the CV R2 of 0.71. They also reported an increasing trend of 1.9 μg/m3/decade averaged over China during 1957–2014. Due to the better abilities in dealing with non-linear and complex relationships between variables than traditional statistical approaches, machine learning methods can also be used in the visibility-PM2.5 prediction. By using a machine learning model (the Extreme Gradient Boosting), Gui et al. (2020) constructed surface PM2.5 concentrations in 2018 over China based on visibility and meteorological data, which offered the potential in reconstructing long-term PM2.5 data in China with a machine learning method. Furthermore, in addition to visibility and meteorology, other factors such as emissions, topography, population and land use data, should be considered in the machine learning model to simulate PM2.5 concentrations and spatiotemporal distributions.
In this study, we construct a gridded dataset of near-surface PM2.5 concentrations across China covering 1980–2019 using the STRF model along with atmospheric visibility and other auxiliary data (e.g., meteorology, anthropogenic emissions, land use, topography, population density and spatiotemporal information), which have a longer time coverage and are more representative of the near-surface aerosols than the data based on satellite AOD. The performance of the STRF model in estimating PM2.5 in China is evaluated and the long-term variations of PM2.5 are characterized.
Section snippets
Datasets
We utilize existing hourly observed surface PM2.5 concentrations during recent years (2014–2019), long-term atmospheric visibility and auxiliary data (e.g., meteorological variables, anthropogenic emissions, land use, national population, topography, and geographic and time variables of observations). The sources and preprocessing of data are elaborated below.
Model performance and importance of input variables
Fig. 2a presents the density scatterplot of fitting performance of the STRF model. The validation data of daily surface PM2.5 observations for model evaluation are 372,596 in total across China during 2014–2018. The STRF model slightly underestimates the PM2.5 concentrations, with a slope of 0.86 in the regression model. The values of R2, MAE, RMSE and MRE are 0.95, 5.02 μg/m3, 8.92 μg/m3 and 12%, respectively, indicating a good agreement between the estimated PM2.5 and surface observations.
The
Conclusion and discussions
In this study, the STRF machine learning model is trained with the input of atmospheric visibility observations, meteorology, land use, topography, anthropogenic emissions, population, and relevant spatiotemporal information to construct a 1-degree gridded near-surface daily PM2.5 concentration dataset from 1980 to 2019. This spatiotemporally coherent historical PM2.5 dataset is useful to study the long-term aerosol variations over China.
The PM2.5 estimates are well correlated with near-surface
CRediT authorship contribution statement
Huimin Li: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft. Yang Yang: Conceptualization, Data curation, Formal analysis, Project administration, Supervision, Writing – review & editing. Hailong Wang: Formal analysis, Writing – review & editing. Baojie Li: Data curation. Pinya Wang: Data curation, Formal analysis. Jiandong Li: Data curation. Hong Liao: Writing – review & editing.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This study was supported by the National Natural Science Foundation of China (grant 41975159) and the National Key Research and Development Program of China (grant 2020YFA0607803 and 2019YFA0606800). HW acknowledges the support by the U.S. Department of Energy (DOE), Office of Science, Office of Biological and Environmental Research (BER), as part of the Earth and Environmental System Modeling program. The Pacific Northwest National Laboratory (PNNL) is operated for DOE by the Battelle Memorial
References (46)
- et al.
Effects of dust storm on public health in desert fringe area: case study of northeast edge of Taklimakan Desert, China
Atmos. Pollut. Res.
(2015) - et al.
Categorisation of air quality monitoring stations by evaluation of PM10 variability
Sci. Total Environ.
(2015) Estimating PM2.5 concentration from satellite derived aerosol optical depth and meteorological variables using a combination model
Atmos. Pollut. Res.
(2019)- et al.
Estimates and 25-year trends of the global burden of disease attributable to ambient air pollution: an analysis of data from the global burden of diseases study 2015
Lancet
(2017) - et al.
Satellite-based ground PM2.5 estimation using timely structure adaptive modeling. Remote Sens
Environ.
(2016) - et al.
Construction of a virtual PM2.5 observation network in China based on high-density surface meteorological observations using the Extreme Gradient Boosting model
Environ. Int.
(2020) - et al.
Estimating ground-level PM2.5 concentrations in Beijing using a satellite-based geographically and temporally weighted regression model. Remote Sens
Environ.
(2017) - et al.
Retrieval of surface PM2.5 mass concentrations over North China using visibility measurements and GEOS-Chem simulations
Atmos. Environ.
(2020) - et al.
Exposure to particulate matter in India: a synthesis of findings and future directions
Environ. Res.
(2016) - et al.
Retrieving historical ambient PM2.5 concentrations using existing visibility measurements in Xi’an, Northwest China
Atmos. Environ.
(2016)
Estimation of daily PM10 and PM2.5 concentrations in Italy, 2013–2015, using a spatiotemporal land-use random-forest model
Environ. Int.
Estimation of PM2.5 concentrations in China using a spatial back propagation neural network
Sci. Rep.
Estimating 1-km-resolution PM2.5 concentrations across China using the space-time random forest approach
Remote Sens. Environ.
Full-coverage high-resolution daily PM2.5 estimation using MAIAC AOD in the Yangtze River Delta of China. Remote Sens
Environ.
Spatiotemporal continuous estimates of PM2.5 concentrations in China, 2000–2016: a machine learning method with inputs from satellites, chemical transport model, and ground observations
Environ. Int.
A spatially structured adaptive two-stage model for retrieving ground-level PM2.5 concentrations from VIIRS AOD in China. ISPRS J. Photogramm
Remote Sens.
Estimating the daily PM2.5 concentration in the Beijing-Tianjin-Hebei region using a random forest model with a 0.01° × 0.01° spatial resolution
Environ. Int.
Estimating ground-level PM2.5 concentrations over three megalopolises in China using satellite-derived aerosol optical depth measurements
Atmos. Environ.
Risk of nonaccidental and cardiovascular mortality in relation to long-term exposure to low concentrations of fine particulate matter: a Canadian national-level cohort study
Environ. Health Perspect.
Use of satellite observations for long-term exposure assessment of global concentrations of fine particulate matter
Environ. Health Perspect.
Cited by (42)
A review of machine learning for modeling air quality: Overlooked but important issues
2024, Atmospheric ResearchHaze weather and mortality in China from 2014 to 2020: Definitions, vulnerability, and effect modification by haze characteristics
2024, Journal of Hazardous MaterialsOpposite trends of cold surges over South China Sea and Philippines Sea and their different impacts on PM<inf>2.5</inf> in eastern China
2024, Science of the Total Environment