Constructing a spatiotemporally coherent long-term PM2.5 concentration dataset over China during 1980–2019 using a machine learning approach

doi:10.1016/j.scitotenv.2020.144263

Science of The Total Environment

Volume 765, 15 April 2021, 144263

https://doi.org/10.1016/j.scitotenv.2020.144263 Get rights and content

Highlights

•
Long-term PM_2.5 is essential due to the coverage deficiency of surface observations.
•
A machine learning model with visibility and many auxiliary data inputs is applied.
•
A 1-degree gridded daily PM_2.5 dataset over China for 1980–2019 is constructed.
•
The model performs well with a high coefficient of determination and low bias.
•
It will be a promising tool for assessing related impacts on environment and climate.

Abstract

The lack of long-term observations and satellite retrievals of health-damaging fine particulate matter in China has demanded the estimates of historical PM_2.5 (particulate matter less than 2.5 μm in diameter) concentrations. This study constructs a gridded near-surface PM_2.5 concentration dataset across China covering 1980–2019 using the space-time random forest model with atmospheric visibility observations and other auxiliary data. The modeled daily PM_2.5 concentrations are in excellent agreement with ground measurements, with a coefficient of determination of 0.95 and mean relative error of 12%. Besides the atmospheric visibility which explains 30% of total importance of variables in the model, emissions and meteorological conditions are also key factors affecting PM_2.5 predictions. From 1980 to 2014, the model-predicted PM_2.5 concentrations increased constantly with the maximum growth rate of 5–10 μg/m³/decade over eastern China. Due to the clean air actions, PM_2.5 concentrations have decreased effectively at a rate over 50 μg/m³/decade in the North China Plain and 20–50 μg/m³/decade over many regions of China during 2014–2019. The newly generated dataset of 1-degree gridded PM_2.5 concentrations for the past 40 years across China provides a useful means for investigating interannual and decadal environmental and climate impacts related to aerosols.

Graphical abstract

Introduction

Particulate matter is one of the major health-damaging components in the atmosphere, especially those with the aerodynamic diameters smaller than 2.5 μm (PM_2.5). Long-term exposure to PM_2.5 can increase risks of many adverse health issues, including respiratory and cardiovascular diseases, lung cancer and premature death (Crouse et al., 2012; Pope et al., 2002; Xing et al., 2016; Zhang et al., 2017). The presence of high concentrations of PM_2.5 also reduces atmospheric visibility, influences the public transportation, and thus adversely affects social and economic activities (Zhang et al., 2014). These aerosol particles also influence climate via aerosol-radiation and aerosol-cloud interactions (Boucher et al., 2013; Yang et al., 2020). Through long-range transport, the local environmental and climatic impacts of aerosols near major source regions can be extended globally (Ren et al., 2020; Wang et al., 2014).

Aerosol concentrations in China are experiencing great changes in recent decades. The rapid industrial development and urbanization were primarily responsible for the increasing tendency of PM_2.5 concentrations before 2010 (Yang et al., 2016; Cohen et al., 2017). From 2013 to 2017, the PM pollution was alleviated, with the national averaged concentration reduced by one third, primarily owing to the implementation of clean air actions in China (Huang et al., 2018). Following the growing public health concern, many air quality monitoring stations have been established to measure real-time PM_2.5 concentrations since 2013. However, the measurements are limited to a short temporal coverage and have uneven spatial distributions (Wang et al., 2019; Zhao et al., 2020), so they are insufficient to describe the long-term characteristics of PM_2.5 in China. Because the spatiotemporal variation of PM_2.5 and its relationship with changes in emissions, weather and climate can be used to improve the current understanding of pollution formation and provide the scientific basis of air quality improvement to policy makers, it is essential to produce a long-term dataset of gridded surface PM_2.5 concentrations based on real observed data in China.

To overcome the spatiotemporal coverage deficiency of surface PM_2.5 observations, satellite remote sensing data have been widely used to estimate surface PM_2.5 concentrations recently (Fang et al., 2016; Wei et al., 2019). In general, aerosol optical depth (AOD) derived from satellite has a positive correlation with near-surface PM_2.5 concentrations. Based on this, a variety of statistical models, including multiple linear regression (Chelani, 2019), geographically weighted regression (Ma et al., 2014; Guo et al., 2017), linear mixed-effect model (Zheng et al., 2016), and two-stage model (Ma et al., 2016; Yao et al., 2019), have been applied to assess PM_2.5. In addition, machine learning has become a modern tool for a regression task nowadays due to its computational efficiency and state-of-the-art performance (Stafoggia et al., 2019). Wei et al. (2019) produced PM_2.5 concentrations at 1-km resolution in China for 2016 based on satellite AOD using the space-time random forest (STRF) model, with a cross-validation (CV) coefficient of determination (R²) of 0.85. Li et al. (2017) estimated PM_2.5 in 2015 over China using a geo-intelligent deep learning model together with satellite AOD data, in which the CV R² increases from 0.42 to 0.88 relative to the traditional neural network method. However, these estimated PM_2.5 data still have some limitations in certain aspects. First of all, the Moderate Resolution Imaging Spectroradiometer (MODIS) data were not available until 1999 and Suomi National Polar-orbiting Partnership (S-NPP) satellite was launched in 2011. Most of the studies mentioned above used AOD derived from these two satellites to predict PM_2.5, and consequently the PM_2.5 data are not available before 2000 (van Donkelaar et al., 2015; Xue et al., 2019). Additionally, AOD represents aerosol loading in the entire atmospheric column and its relationship with near-surface PM_2.5 concentrations is largely influenced by planetary boundary layer height, relative humidity, temperature, and other factors (Liu et al., 2009). Moreover, algorithm bias, signal uncertainty, and cloud contamination induce biases to the PM_2.5 estimation from AOD (Stafoggia et al., 2019; Xiao et al., 2017).

Atmospheric visibility measurements, which have been available for several decades in China, were demonstrated to be a promising alternative for estimating near-surface PM_2.5 concentrations (Shen et al., 2016). Li et al. (2020) derived PM_2.5 concentrations over North China in 2014 using the combination of visibility observations and GEOS-Chem model simulations and reported that the estimated PM_2.5 was highly correlated to surface observations in time and space, with a correlation coefficient of 0.96 and 0.79, respectively. Liu et al. (2017) estimated historical (1957–1964 and 1973–2014) PM_2.5 in China using visibility measurements and a statistical approach, and found that the model can accurately estimate PM_2.5 concentrations with the CV R² of 0.71. They also reported an increasing trend of 1.9 μg/m³/decade averaged over China during 1957–2014. Due to the better abilities in dealing with non-linear and complex relationships between variables than traditional statistical approaches, machine learning methods can also be used in the visibility-PM_2.5 prediction. By using a machine learning model (the Extreme Gradient Boosting), Gui et al. (2020) constructed surface PM_2.5 concentrations in 2018 over China based on visibility and meteorological data, which offered the potential in reconstructing long-term PM_2.5 data in China with a machine learning method. Furthermore, in addition to visibility and meteorology, other factors such as emissions, topography, population and land use data, should be considered in the machine learning model to simulate PM_2.5 concentrations and spatiotemporal distributions.

In this study, we construct a gridded dataset of near-surface PM_2.5 concentrations across China covering 1980–2019 using the STRF model along with atmospheric visibility and other auxiliary data (e.g., meteorology, anthropogenic emissions, land use, topography, population density and spatiotemporal information), which have a longer time coverage and are more representative of the near-surface aerosols than the data based on satellite AOD. The performance of the STRF model in estimating PM_2.5 in China is evaluated and the long-term variations of PM_2.5 are characterized.

Section snippets

Datasets

We utilize existing hourly observed surface PM_2.5 concentrations during recent years (2014–2019), long-term atmospheric visibility and auxiliary data (e.g., meteorological variables, anthropogenic emissions, land use, national population, topography, and geographic and time variables of observations). The sources and preprocessing of data are elaborated below.

Model performance and importance of input variables

Fig. 2a presents the density scatterplot of fitting performance of the STRF model. The validation data of daily surface PM_2.5 observations for model evaluation are 372,596 in total across China during 2014–2018. The STRF model slightly underestimates the PM_2.5 concentrations, with a slope of 0.86 in the regression model. The values of R², MAE, RMSE and MRE are 0.95, 5.02 μg/m³, 8.92 μg/m³ and 12%, respectively, indicating a good agreement between the estimated PM_2.5 and surface observations.

The

Conclusion and discussions

In this study, the STRF machine learning model is trained with the input of atmospheric visibility observations, meteorology, land use, topography, anthropogenic emissions, population, and relevant spatiotemporal information to construct a 1-degree gridded near-surface daily PM_2.5 concentration dataset from 1980 to 2019. This spatiotemporally coherent historical PM_2.5 dataset is useful to study the long-term aerosol variations over China.

The PM_2.5 estimates are well correlated with near-surface

CRediT authorship contribution statement

Huimin Li: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft. Yang Yang: Conceptualization, Data curation, Formal analysis, Project administration, Supervision, Writing – review & editing. Hailong Wang: Formal analysis, Writing – review & editing. Baojie Li: Data curation. Pinya Wang: Data curation, Formal analysis. Jiandong Li: Data curation. Hong Liao: Writing – review & editing.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This study was supported by the National Natural Science Foundation of China (grant 41975159) and the National Key Research and Development Program of China (grant 2020YFA0607803 and 2019YFA0606800). HW acknowledges the support by the U.S. Department of Energy (DOE), Office of Science, Office of Biological and Environmental Research (BER), as part of the Earth and Environmental System Modeling program. The Pacific Northwest National Laboratory (PNNL) is operated for DOE by the Battelle Memorial

References (46)

A. Aili et al.
Effects of dust storm on public health in desert fringe area: case study of northeast edge of Taklimakan Desert, China
Atmos. Pollut. Res.
(2015)
M.A. Barrero et al.
Categorisation of air quality monitoring stations by evaluation of PM₁₀ variability
Sci. Total Environ.
(2015)
A.B. Chelani
Estimating PM_2.5 concentration from satellite derived aerosol optical depth and meteorological variables using a combination model
Atmos. Pollut. Res.
(2019)
A.J. Cohen et al.
Estimates and 25-year trends of the global burden of disease attributable to ambient air pollution: an analysis of data from the global burden of diseases study 2015
Lancet
(2017)
X. Fang et al.
Satellite-based ground PM_2.5 estimation using timely structure adaptive modeling. Remote Sens
Environ.
(2016)
K. Gui et al.
Construction of a virtual PM2.5 observation network in China based on high-density surface meteorological observations using the Extreme Gradient Boosting model
Environ. Int.
(2020)
Y. Guo et al.
Estimating ground-level PM_2.5 concentrations in Beijing using a satellite-based geographically and temporally weighted regression model. Remote Sens
Environ.
(2017)
S. Li et al.
Retrieval of surface PM_2.5 mass concentrations over North China using visibility measurements and GEOS-Chem simulations
Atmos. Environ.
(2020)
P. Pant et al.
Exposure to particulate matter in India: a synthesis of findings and future directions
Environ. Res.
(2016)
Z. Shen et al.
Retrieving historical ambient PM_2.5 concentrations using existing visibility measurements in Xi’an, Northwest China
Atmos. Environ.
(2016)

M. Stafoggia et al.

Estimation of daily PM₁₀ and PM_2.5 concentrations in Italy, 2013–2015, using a spatiotemporal land-use random-forest model

Environ. Int.

(2019)

W. Wang et al.

Estimation of PM_2.5 concentrations in China using a spatial back propagation neural network

Sci. Rep.

(2019)

J. Wei et al.

Estimating 1-km-resolution PM_2.5 concentrations across China using the space-time random forest approach

Remote Sens. Environ.

(2019)

Q. Xiao et al.

Full-coverage high-resolution daily PM_2.5 estimation using MAIAC AOD in the Yangtze River Delta of China. Remote Sens

Environ.

(2017)

T. Xue et al.

Spatiotemporal continuous estimates of PM_2.5 concentrations in China, 2000–2016: a machine learning method with inputs from satellites, chemical transport model, and ground observations

Environ. Int.

(2019)

F. Yao et al.

A spatially structured adaptive two-stage model for retrieving ground-level PM_2.5 concentrations from VIIRS AOD in China. ISPRS J. Photogramm

Remote Sens.

(2019)

C. Zhao et al.

Estimating the daily PM_2.5 concentration in the Beijing-Tianjin-Hebei region using a random forest model with a 0.01° × 0.01° spatial resolution

Environ. Int.

(2020)

Y. Zheng et al.

Estimating ground-level PM_2.5 concentrations over three megalopolises in China using satellite-derived aerosol optical depth measurements

Atmos. Environ.

(2016)

Boucher, O., Randall, D., Artaxo, P., Bretherton, C., Feingold, G., Forster, P., Kerminen, V.M., Kondo, Y., Liao, H.,...

Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32....

CMA, 2014. Forecasting and Networking Department of China Meteorological Administration released letter No.4: Notice on...

D.L. Crouse et al.

Risk of nonaccidental and cardiovascular mortality in relation to long-term exposure to low concentrations of fine particulate matter: a Canadian national-level cohort study

Environ. Health Perspect.

(2012)

A. van Donkelaar et al.

Use of satellite observations for long-term exposure assessment of global concentrations of fine particulate matter

Environ. Health Perspect.

(2015)

Cited by (42)

Estimation of historical daily PM<inf>2.5</inf> concentrations for three Chinese megacities: Insight into the socioeconomic factors affecting PM<inf>2.5</inf>
2024, Atmospheric Pollution Research
Studying historical changes in the 2.5 μm particulate matter concentration (PM_2.5) can clarify the relationship between air pollution and socioeconomic development. Daily PM_2.5 levels and meteorological data (2012–2022) for three large cities (Beijing, Shanghai, and Xi'an) at different development stages in different regions of China were used to construct a random forest (RF) model for estimating historical PM_2.5 data for the period 1973–2011, a time period in which few measurements were made. The eigenvalue for visibility was the largest in the RF model; visibility explained 0.76–0.87 of the variance in PM_2.5 for the three cities. The daily estimated PM_2.5 was validated, with an R² of 0.654–0.780 and average absolute error of 11.52–31.73 μg m⁻³ in the model. PM_2.5 concentrations predicted by the RF model for 2004–2011 were highly correlated with gravimetric measurements (R = 0.585, p < 0.01). We extensively validated the results of RF using manual weighing PM_2.5 data, online monitoring concentration of PM₁₀, and aerosol optical depth (AOD), demonstrating the accuracy of the model. Over the study period, the PM_2.5 level first increased and then decreased in the three cities; however, the year at which the trend changed differed. We further explored the effects of urbanization and economic growth on PM_2.5 levels by investigating the correlations between socioeconomic indicators and PM_2.5. The magnitude of the permanent population of Beijing and gross regional production growth in Shanghai were both significantly positively correlated with the PM_2.5 level. Increasing the size of urban green areas can reduce PM_2.5; this effect was strongest for the southern city of Shanghai, may due to their different climates and green tree species. Energy consumption and emissions from primary industries were strongly positively correlated with the urban PM_2.5 level. An in-depth understanding of the factors affecting PM_2.5 concentrations could help policymakers improve air quality management strategies, especially for densely populated megacities.
A review of machine learning for modeling air quality: Overlooked but important issues
2024, Atmospheric Research
Machine learning models based on satellite remote sensing have gained widespread use in estimating ground-level air pollutant concentrations, which overcome the limitations of the discontinuous spatial distribution of ground monitoring stations. However, due to the interdisciplinary nature of environmental modeling, atmospheric researchers may overlook some important issues when using machine learning. In this review, we summarize and discuss the overlooked but important issues in data preparation, model development, validation, and prediction, including feature engineering, imbalanced data, validation strategy, and model interpretation, which are critical for model generalizability. Firstly, we provide considerations and recommendations in obtaining, selecting, and using data of the main variables in machine learning for air quality mapping. Secondly, sufficient introduction and discussion are provided on using feature engineering and addressing imbalanced data, which can enhance data representativeness and improve model performance during model development. Thirdly, we analyze and compare model validation strategies, and give suggestions on their applicable situations. Finally, we propose that placing importance on model interpretation in model development and prediction can guide model improvements. We review several commonly used model interpretation methods, elucidate the interpretation scope, and advance the application in model diagnostics. Corresponding to these issues, this review provides in-depth and practical guidance on applying machine learning for robust air quality mapping.
Haze weather and mortality in China from 2014 to 2020: Definitions, vulnerability, and effect modification by haze characteristics
2024, Journal of Hazardous Materials
Haze weather, characterized by low visibility due to severe air pollution, has aroused great public concern. However, haze definitions are inconclusive, and multicentre studies on the health impacts of haze are scarce. We collected data on the daily number of deaths and environmental factors in 190 Chinese cities from 2014 to 2020. The city-specific association was estimated using quasi-Poisson regression and then pooled using meta-analysis. We found a negative association between daily visibility and non-accidental deaths, and mortality risk sharply increased when visibility was < 10 km. Haze weather, defined as a daily average visibility of < 10 km without a limit for humidity, produced the best model fitness and greatest effect on mortality. A haze day was associated with an increase of 2.53% (95% confidence interval [CI]:1.96, 3.10), 2.84 (95% CI: 2.13, 3.56), and 2.99% (95% CI: 1.94, 4.04) in all non-accident, cardiovascular and respiratory mortality, respectively. Haze had the greatest effect on lung cancer mortality. The haze-associated risk of mortality increased with age. Severe haze (visibility <2 km) and damp haze (haze with relative humidity >90%) had greater health impacts. Our findings can help in the development of early warning systems and effective public health interventions for haze.
A study on identifying synergistic prevention and control regions for PM<inf>2.5</inf> and O<inf>3</inf> and exploring their spatiotemporal dynamic in China
2024, Environmental Pollution
Air pollutants, notably ozone (O₃) and fine particulate matter (PM_2.5) give rise to evident adverse impacts on public health and the ecotope, prompting extensive global apprehension. Though PM_2.5 has been effectively mitigated in China, O₃ has been emerging as a primary pollutant, especially in summer. Currently, alleviating PM_2.5 and O₃ synergistically faces huge challenges. The synergistic prevention and control (SPC) regions of PM_2.5 and O₃ and their spatiotemporal patterns were still unclear. To address the above issues, this study utilized ground monitoring station data, meteorological data, and auxiliary data to predict the China High-Resolution O₃ Dataset (CHROD) via a two-stage model. Furthermore, SPC regions were identified based on a spatial overlay analysis using a Geographic Information System (GIS). The standard deviation ellipse was employed to investigate the spatiotemporal dynamic characteristics of SPC regions. Some outcomes were obtained. The two-stage model significantly improved the accuracy of O₃ concentration prediction with acceptable R² (0.86), and our CHROD presented higher spatiotemporal resolution compared with existing products. SPC regions exhibited significant spatiotemporal variations during the Blue Sky Protection Campaign (BSPC) in China. SPC regions were dominant in spring and autumn, and O₃-controlled and PM_2.5-dominated zones were detected in summer and winter, respectively. SPC regions were primarily located in the northwest, north, east, and central regions of China, specifically in the Beijing-Tianjin-Hebei urban agglomeration (BTH), Shanxi, Shaanxi, Shandong, Henan, Jiangsu, Xinjiang, and Anhui provinces. The gravity center of SPC regions was distributed in the BTH in winter, and in Xinjiang during spring, summer, and autumn. This study can supply scientific references for the collaborative management of PM_2.5 and O₃.
Opposite trends of cold surges over South China Sea and Philippines Sea and their different impacts on PM<inf>2.5</inf> in eastern China
2024, Science of the Total Environment
The variations in cold surge (CS) path can cause significant impacts on air pollution in the area it passes through. This study investigates impacts of CSs over South China Sea (CS_SCS) and Philippine Sea (CS_PHS) on PM_2.5 concentrations in eastern China (PC_EC) and their underlying mechanisms from 1979 to 2021. It was revealed that the CS_SCS is accompanied by the continental high-pressure over East Asia and shows an upward trend. CS_PHS is mainly affected by both the continental high-pressure over East Asia and the East Asian Trough over the Sea of Japan, showing a significant downward trend. Such difference in circulation anomalies is related to the different paths of the two types of CSs. Both observation and simulations indicate that more (less) Ural blocking in winter would lead to the cold air originating from the regions over Lake Baikal (Caspian Sea) to invade southward (eastward) along the northern (northwestern) path, resulting in more frequent CS_SCS (CS_PHS) and increased (decreased) winter averaged PC_EC due to the anticyclonic (cyclonic) anomalies over eastern China. Such variations in winter averaged PC_EC masked the synoptic signals that PC_EC would decrease (increase) during CS_SCS (CS_PHS) outbreaks. Therefore, the increased frequency of atmospheric blocking over Ural Mountains in recent years has still played a worsening role in the intensification of PC_EC.
Contrasting changes in ozone during 2019–2021 between eastern and the other regions of China attributed to anthropogenic emissions and meteorological conditions
2024, Science of the Total Environment
Ozone pollution is one of the most severe air quality issues in China that poses a serious threat to human health and ecosystems. During 2019–2021, the maximum daily 8-h average ozone concentrations in eastern China (110–122.5°E, 26–42°N) and the rest of China (ROC) show different decreasing patterns, with ozone concentrations in eastern China decreasing by 14.9 μg/m³, which is much larger than 4.8 μg/m³ in ROC. Here, based on two independent methods, the atmospheric chemical transport model (GEOS-Chem) simulations and the machine learning (ML) model (LightGBM) predictions, the reasons for the differences in ozone changes between eastern China and ROC during the warm season (April to September) are investigated. According to the GEOS-Chem (LightGBM) results, changes in the meteorological conditions contributed to an ozone decrease by 7.3 (6.8) μg/m³ in eastern China due to decreased chemical production and an ozone decrease by 6.8 (7.0) μg/m³ in ROC attributed to the weakened horizontal and vertical advection. With the influence of meteorological factors excluded, the observations show that changes in anthropogenic emissions resulted in an ozone decrease by 7.6 (8.1) μg/m³ in eastern China and an ozone increase by 2.0 (2.2) μg/m³ in ROC, which is primarily induced by the changes in NO_x emissions. The surface measurements and satellite retrievals also indicate that the reduction in NO_x emissions in ROC is less efficient than that in the more developed eastern China, leading to contrasting changes in ozone concentrations between eastern China and ROC during 2019–2021. Our results highlight the critical need to reduce ozone precursor emissions in the rest regions of China apart from eastern China.

View all citing articles on Scopus

View full text

Constructing a spatiotemporally coherent long-term PM2.5 concentration dataset over China during 1980–2019 using a machine learning approach

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Datasets

Model performance and importance of input variables

Conclusion and discussions

CRediT authorship contribution statement

Declaration of competing interest

Acknowledgments

Atmos. Pollut. Res.

Sci. Total Environ.

Atmos. Pollut. Res.

Lancet

Environ.

Environ. Int.

Environ.

Atmos. Environ.

Environ. Res.

Atmos. Environ.

Environ. Int.

Sci. Rep.

Remote Sens. Environ.

Environ.

Environ. Int.

Remote Sens.

Environ. Int.

Atmos. Environ.

Risk of nonaccidental and cardiovascular mortality in relation to long-term exposure to low concentrations of fine particulate matter: a Canadian national-level cohort study

Environ. Health Perspect.

Use of satellite observations for long-term exposure assessment of global concentrations of fine particulate matter

Environ. Health Perspect.

Constructing a spatiotemporally coherent long-term PM_2.5 concentration dataset over China during 1980–2019 using a machine learning approach