Introduction

Low Earth orbit (LEO) will see the addition of tens of thousands of satellites in the coming years as private companies are developing mega constellations for the new space economy1. This congestion of certain orbital regimes increases the likelihood of a future collision between two objects. Satellite collisions can create debris clouds consisting of thousands of objects large enough to pose significant threats to other space assets. The 2009 Iridium-Cosmos collision resulted in approximately 2300 observable debris objects, 65% of which remained in orbit 7 years later2. Debris objects created by collisions or weapons tests can catapult into highly elliptical orbits which pose a danger to satellites in multiple orbital regimes3.

In an effort to prevent these events from occurring, objects are continuously tracked, and their trajectories are predicted. However, uncertainties play a large role in the prediction of future satellite positions. In LEO, atmospheric drag is the largest single source of uncertainty mainly due to an incomplete understanding of the thermosphere. Variations in the thermosphere are connected to temperature changes, as the atmosphere expands and contracts. Solar extreme ultraviolet (EUV) and far ultraviolet (FUV) irradiance are the primary heating sources4. This absorption of solar irradiance provides the baseline thermospheric mass density5. The effects of solar emissions are well-represented by various solar indices and proxies6.

Solar irradiance is generally a long-term variation while the solar wind drives more rapid changes in the thermosphere. Mass and energy from the sun—manifested as the solar wind—travel through space and interact with the near-Earth geospace environment. Certain events (e.g. coronal mass ejections) send massive amounts of energy and mass that result in significant increases in thermospheric density. Energy, and therefore density, enhancements first appear in the auroral zone (high latitudes) and propagate towards the equator in the form of traveling atmospheric disturbances7. Geomagnetic storms are a particularly difficult phenomena to model and our current density models carry high uncertainty during these periods8,9.

Satellite accelerometers have provided a unique insight into the thermosphere with high fidelity in-situ measurements, particularly during storms10. Accelerations caused by non-drag sources (e.g. gravity and solar radiation pressure) are modeled out allowing the isolation of drag acceleration that is then used to estimate mass density11,12,13,14,15. Drag acceleration is given as

$$\begin{aligned} \begin{aligned} \vec {a}_{\text {drag}} = -\frac{1}{2} \rho \frac{c_D A}{m} v^2_{\text {rel}}\frac{v_{\text {rel}}}{\left| v_{\text {rel}}\right| } \end{aligned} \end{aligned}$$
(1)

where \(\vec {a}_{\text {drag}}\) is the drag acceleration, \(\rho\) is local mass density, \(c_D\) is the satellite drag coefficient, A is the cross-sectional area, m is the satellite mass, and \(v_{\text {rel}}\) is the relative velocity of the satellite with respect to the rotating atmosphere. With an estimate for drag acceleration, the density can be estimated, assuming adequate knowledge of the drag coefficient and cross-sectional area given the satellite orientation. Density estimates obtained through this method are considered ground truth and often used for model validation.

Accelerometer and orbit-derived densities have been used frequently in developing empirical models16,17,18. Furthermore, they have been used in data assimilation schemes to make corrections to background models, either through observed orbital drag data19 or two-line element data20. The most prominent integration of real-time data for neutral density modeling is the High Accuracy Satellite Drag Model’s (HASDM) Dynamic Calibration of the Atmosphere (DCA). This uses observed satellite data to make corrections to a background empirical density model21.

Even with these improvements, density models have high errors, and we generally use them without any knowledge of their confidence given the conditions. Until recently, no thermospheric density models—whether physics-based or empirical—provided estimates of uncertainty. Bruinsma et al. developed an uncertainty-based version of DTM2020 using polynomials to describe the 1 − sigma uncertainties as a function of the inputs22. Licata et al. used MC dropout to obtain uncertainty estimates for a global density modeling application with good calibration, providing baseline performance23. In this work, we leverage machine learning (ML) to generate predictive density models for the thermosphere that also provide robust and reliable uncertainty estimates. This is done for both a global and local datasets using two methods: Monte Carlo (MC) dropout and a direct prediction of the probability distribution (referred to primarily as direct probability).

We first outline the data and methods used for model development and analysis. Then, we use artificial data to demonstrate the techniques. We move to the results for modeling with the global density dataset using the two uncertainty techniques and perform a similar analysis for the models developed on local measurements. We also look at the global prediction capabilities of the model developed with in-situ data, and we compare the evaluation times of both uncertainty methods.

SET HASDM density database

The High Accuracy Satellite Drag Model (HASDM) is the operational thermospheric density framework used by the United States Space Force (USSF)21. By improving the density correction techniques presented by Marcos et al.24 and Nazerenko et al.25, HASDM modifies 13 global temperature coefficients to make real-time corrections to the Jacchia-Bowman 2008 Empirical Thermospheric Density Model (JB2008)17. Its Dynamic Calibration of the Atmosphere (DCA) algorithm ingests data from calibration satellites that are distributed in altitude between 190 and 900 km—most are between 300 and 600 km26. As the algorithm provides corrections to JB2008, HASDM provides global measurements on a 24 × 19 × 27 grid. For additional information on HASDM, the reader is referred to Storz et al.21.

While HASDM is highly desired due to its real-time data assimilation, it is a proprietary model that is inaccessible to researchers and operators. Space Environment Technologies (SET) is the contractor responsible for validating HASDM outputs on a weekly basis, and they recently released HASDM validation archives from 2000 to 2020 covering close to two full solar cycles providing good statistical coverage. This data release constitutes the SET HASDM density database27. With a 3-h cadence, the database contains 58,440 global HASDM outputs. Each output has a resolution of 15\(^\circ\) longitude, 10\(^\circ\) latitude, and 25 km altitude ranging from 175 to 825 km. For further details on the SET HASDM density database and the validation process, the reader is referred to Tobiska et al.27.

Satellite accelerometer density estimates

CHAllenging Minisatellite Payload (CHAMP) was launched in mid-2000 to study Earth’s gravity and magnetic fields28. Its orbit is nearly polar with an inclination of 87.3\(^\circ\) providing adequate global coverage, and it began at 460 km in altitude. CHAMP was in orbit until 2010 when it stopped providing measurements at an altitude of approximately 300 km. The long mission lifetime covered nearly a solar cycle, providing measurements in solar maximum across many strong geomagnetic storms and through the following solar minimum. Mehta et al. used higher fidelity satellite geometry and improved gas-surface interaction models to scale the CHAMP density estimates of Sutton29,30,31,32. The density dataset starts on January 1, 2002 and ends on February 22, 2010 with a 10-s cadence. The CHAMP dataset is prime for demonstration as spatiotemporally limited in-situ datasets are common in the space weather field. This type of model can be built upon with the addition of density estimates from other satellites, displayed in Fig. 1.

Figure 1
figure 1

Timeline of satellites with onboard accelerometers from 2000 to 2018 with bold lines representing the satellites’ mean altitude. e-POP does not contain an accelerometer. The bottom panel shows the corresponding F10 values indicating solar activity. The authors gratefully acknowledge Dr. Eelco Doornbos for providing this figure.

The addition of all satellites shown in Fig. 1 would significantly expand the altitude coverage of the in-situ density dataset. The CHAMP dataset used in this work has a 160 km altitude range and does not span a full solar cycle. Integrating the density datasets of Gravity Recovery and Climate Experiment (GRACE)33, Gravity Field and Steady-State Ocean Circulation Explorer (GOCE)34, and Swarm35 would provide a thorough altitude coverage of approximately 220–550 km and span from 2001 to present day. The Enhanced Polar Outflow Probe (e-POP)36 is a payload on Cascade, Smallsat and Ionospheric Polar Explorer (CASSIOPE) and its density estimates can be obtained through the processing of its Global Navigation Satellite System (GNSS) receivers37,38. However, there is still much active research related to the proper combination of different satellite density datasets39,40,41. Therefore, we proceed with the standalone CHAMP dataset for demonstration.

Density model drivers

JB2008 uses four solar indices/proxies as drivers for solar activity. F10—more completely referred to as F10.7—represents 10.7 cm solar radio flux and is a reliable proxy for solar EUV heating. S10 is an index for the integrated 26–34 nm solar EUV emission. The M10 proxy is a surrogate for FUV photospheric 160-nm Schumann-Runge Continuum emissions. Y10 is a hybrid index that measures solar coronal X-ray emissions during solar maximum and Lyman-\(\alpha\) emissions during solar minimum. The S10, M10, and Y10 indices and proxies are not related to the 10.7 cm wavelength, but they are converted to F10 units—solar flux units (sfu)—through linear regression. JB2008 also uses the 81-day centered averages for all four solar drivers. This is indicated by the “81c” subscript. Additional information on these solar drivers is provided by Tobiska et al.6.

To model geomagnetic activity, JB2008 uses a combination of ap and Dst. The ap index represents global geomagnetic activity with a 3-h cadence. While it is widely used in density models, it is limited by the low-latitude range of the measurements and its discrete range of 28 values. Dst is an index driven by the ring current strength in the inner magnetosphere42. When Dst_min is below − 75 nT, JB2008 shifts to using Dst as it improves storm-time performance17. The EXTEMPLAR (EXospheric TEMperatures on a PoLyhedrAl gRid) model and EXTEMPLAR-ML use Poynting flux totals in the northern and southern hemispheres—SN and SS, respectively43,44. Poynting flux represents electrodynamic energy flowing into the upper atmosphere. The ap and Dst indices have 3-h and 1-h cadences, respectively. Therefore, their use in a high-cadence model would not be advised. The geomagnetic index used to replace ap and Dst in the CHAMP model is SYM-H, the longitudinally symmetric component of the magnetic field disturbances45,46. SYM-H is available with a 1-min cadence.

The input sets for the HASDM and CHAMP models are shown in Table 1. The transformed time inputs t1t4 are defined in Eq. (2), and the transformed local solar time (LST) inputs are defined in Eq. (3). These transformations are performed to make the time and location inputs continuous.

$$\begin{aligned}&t_1=sin\left( \frac{2\pi doy}{365.25}\right) , \;\;\;\; t_2=cos\left( \frac{2\pi doy}{365.25}\right) , \;\;\;\; t_3=sin\left( \frac{2\pi UT}{24}\right) , \;\;\;\; t_4=cos\left( \frac{2\pi UT}{24}\right) . \end{aligned}$$
(2)
$$\begin{aligned}&LST_1=sin\left( \frac{2\pi LST}{24}\right) \;\;\;\; LST_2=cos\left( \frac{2\pi LST}{24}\right) \end{aligned}$$
(3)
Table 1 List of inputs for both models.

In Table 1, LST, LAT, and ALT are the local time, latitude, and altitude of the satellite, respectively. The “A” subscript for the geomagnetic indices refers to the daily average. The numerical subscripts for these indices refer to the value that many hours prior to prediction epoch. The combination of numbers refers to the average of that index over that many hours prior to epoch. The HASDM input set originates from Licata et al.23.

Methodology

Machine learning

Principal component analysis

Principal component analysis (PCA), also referred to as Empirical Orthogonal Function (EOF) analysis, has been widely used in thermospheric mass density applications. It has been applied to satellite accelerometer datasets (as described in “Satellite accelerometer density estimates”) to identify dominant modes of variability in the thermosphere15,47,48. PCA is also used in the field for dimensionality reduction as part of a reduced-order model (ROM)49,50,51. The HASDM dataset has 12,312 model outputs each epoch which makes uncertainty quantification (UQ) infeasible. Therefore, we apply PCA to the dataset for ROM development with the goal of UQ. PCA is an eigendecomposition technique that maximizes variance, determining uncorrelated linear combinations of data52,53. We first take the common logarithm (log10) of the density values to reduce the data’s variance then remove the spatial mean. PCA decomposes the data, separating the spatial and temporal variations such that

$$\begin{aligned} \begin{aligned} {\mathbf {x}}\left( {\mathbf {s}},t\right) ={\bar{\mathbf{x}}} \left( {\mathbf {s}}\right) +\mathbf {{\widetilde{x}}}\left( {\mathbf {s}},t\right) \;\;\;\text {and}\;\;\; \mathbf {{\widetilde{x}}}\left( {\mathbf {s}},t\right) =\sum ^r_{i=1}\alpha _i\left( t\right) U_i\left( s\right) \end{aligned} \end{aligned}$$
(4)

In Eq. (4), \({\mathbf {x}}\left( {\mathbf {s}},t\right)\) is the log density from HASDM, \({\bar{\mathbf {x}}}\) is the spatial mean, and \(\mathbf {{\widetilde{x}}}\) is the variation about the mean. \(\alpha _i(t)\) are temporal PCA coefficients and \(U_i\) are orthogonal modes—also called basis functions. The choice order of truncation (r = 10) was chosen from analyses in previous work as it allows for \(\sim\) 90% of the variance to be captured and only results in \(<3\)% truncation error23,54. The orthogonal modes are derived through

(5)

U consists of orthogonal vectors representing the modes of variability. The \(\Sigma\) matrix contains the eigenvalues—corresponding to the columns in U—along the diagonal, and \(V^T\) is composed of the right singular vectors of \({\mathbf {X}}\). The data is encoded by performing matrix multiplication with U.

Neural network modeling

In this work, we leverage neural networks (NNs) for nonlinear regression modeling due to their applicability as universal function approximators and flexibility in development. A neural network is a collection of computational cells (or neurons) connected in some form through multiplicative connections (or weights). Artificial neural networks (ANNs) have been used to directly predict thermospheric mass density using space weather indices and proxies as model drivers in order to study long-term trends55,56. These types of models can also be used as an exercise in understanding the effect of the drivers on non-machine learning (ML) models57. Chen et al.58 developed ANNs with different combinations of geomagnetic indices to fit to CHAMP and GRACE density estimates during storms, and Choury et al.59 developed an ANN to predict exospheric temperature for use in the Drag Temperature Model (DTM).

Loss functions

Loss functions are used to inform the NN of the objective during the training phase, or weight adjustment period. Loss functions can be minimized or maximized depending on the modeling objective. In this work, we minimize the negative logarithm of predictive density (NLPD) given as,

$$\begin{aligned} \begin{aligned} NLPD(y,\mu ,\sigma ) = \frac{1}{n}\sum ^n_{i=1}\left( \frac{\left( y_i-\mu _i\right) ^2}{2\sigma _i^2} + \frac{ln(\sigma _i^2)}{2} + \frac{ln(2\pi )}{2}\right) \end{aligned} \end{aligned}$$
(6)

where y is the observed value, \(\mu\) is the predicted mean, \(\sigma\) is the standard deviation of the output corresponding to each unique input, and n is the batch size. The batch size is the number of samples the model will pass through before updating the weights, averaging the loss over the batch. Losses are computed for every output and backpropagation is how the model determines how much to change each weight60,61. NLPD is derived from \(-ln(f(x))\) where ln is the natural logarithm and f(x) is the probability density function of the normal distribution.

Hyperparameter optimization

Tools like Keras Tuner have drastically reduced model development time62. You can provide ranges of hyperparameters and Keras Tuner will explore the search space. We use the Bayesian optimization scheme, allowing the tuner to perform a random search for the first 25 trials, or architectures, and using a Gaussian process (GP) model to choose the architectures for the final 75 trails to exploit the high performing areas of the space. The objective of the tuner is to minimize validation loss. The model optimizer and number of layers are first chosen by the tuner. For each layer, the model can have a unique number of neurons, activation function, and dropout rate. For each model developed (two datasets and two UQ techniques), the architecture is selected using Keras Tuner. It is important to note that the tuner can provide multiple architectures that provide similar results.

The 58,440 samples in the HASDM dataset are split into 60% training, 20% validation, and 20% test data. This is displayed in Fig. 2. As the number of training and validation samples is manageable, the full sets are used in tuning. We obtain the HASDM models directly from the tuner without a need to train further.

Figure 2
figure 2

First 10 HASDM PCA coefficients with F10 and ap. The shading represents the training, validation, and test sets.

The CHAMP dataset is significantly larger with over 25 million total samples. Unlike the HASDM dataset, location is now an input. CHAMP only covers the local solar time domain once every 3 months. The dataset also does not span an entire solar cycle. We originally tried using a 10-month segment of 2003 for validation and a 10-month segment from 2005 to 2006 for testing. This resulted in poor model generalization due to the lack of coverage of the solar cycle in the remaining training set. Therefore, we repeated the following data split scheme. Eight weeks are used for training (483,840 samples), then the following week is used for validation (60,480 samples), and the next week is used for the test set (60,480 samples). This results in similar input and output distributions while keeping temporally disjoint sets as there are 2 weeks or 120,960 samples between the training segments. For the tuner, 1 million random samples are chosen from the training data and 500,000 random samples are chosen from the validation data. Once the tuner is complete, the best models are retrained on the full training set and evaluated on the other two sets.

Uncertainty quantification

We use two ML techniques: MC dropout and direct probability distribution prediction, as UQ with machine-learned models is fairly unexplored in the space weather domain. Dropout is a generalization technique that applies Bernoulli distributions in each layer to change the flow of information through the model63,64. Dropout is traditionally only active during training to maintain a deterministic form in prediction. By forcing dropout to remain active in prediction, the model becomes probabilistic. MC dropout has been shown to be an approximation of a GP65. For both methods, we use the negative logarithm of predictive density (NLPD) loss function (Eq. 6). Licata et al. found that the mean square error loss function resulted in underestimated uncertainty estimates in surrogate modeling for the HASDM dataset23.

Monte Carlo dropout implementation

The typical input and output shape is n \(\times\) ninp and n \(\times\) nout, respectively. n is the number of samples, ninp is the number of inputs, and nout is the number of outputs. In training, the mean and standard deviation need to be unique to each input sample, so the model has to be provided each input k times. k needs to be a large enough number to allow for adequate representation of the predicted distribution. The inputs and outputs for training are stacked about a repeated intermediary axis. The training samples are identical about k, but are unique about n. The new input and output shapes—necessary for proper training—are n \(\times\) k \(\times\) ninp and n \(\times\) k \(\times\) nout, respectively. In each training batch, the mean and standard deviation are taken with respect to the intermediate axis, and the NLPD loss can be computed.

Direct probability distribution prediction

Another way to represent uncertainty is to directly predict the mean and standard deviation. The mean square error loss function cannot be used here as there are no labels for the standard deviation. However, Nix and Weigend used a neural network to directly predict the mean and variance of a toy dataset using the NLPD loss function66. We implement this technique for the datasets presented. To accomplish this, we create a custom output layer with 2nout neurons. The first nout neurons represent the mean prediction and have a linear activation function. The last nout neurons represent the standard deviation and use the softplus activation function. The softplus function and its derivative—the sigmoid function—are shown in Eq. (7).

$$\begin{aligned} \begin{aligned} f(x) = ln(1+e^x) \;\;\;\;\;\;\;\;\;\;\; f'(x) = \frac{e^x}{1+e^x} \end{aligned} \end{aligned}$$
(7)

The desired qualities of the standard deviation output are: (1) always positive and (2) having no upper bound. The initial choice was the absolute value function. However, the resulting models had erratic loss values, and it was difficult to obtain a good model. The softplus function is (1) always positive, (2) has no upper bound, (3) is monotonically increasing, and (4) is differentiable across all inputs. This resulted in stable training losses and better models.

Metrics

To compare the predictive capability of the models developed, we look at the mean absolute error across the training, validation, and test sets. The errors across different space weather conditions will be investigated as well. We also test the reliability of the uncertainty estimates both qualitatively and quantitatively. The calibration error score is given as

$$\begin{aligned} \begin{aligned} \text {Calibration Error} = \frac{100\%}{m \cdot n_{out}}\sum ^{n_{out}}_{i=1} \sum ^m_{j=1} \Big |p(z_{i,j})-p({\hat{z}}_{i,j})\Big | \end{aligned} \end{aligned}$$
(8)

where m is the number of prediction intervals (PIs) of interest. Here the PIs range from 5 to 95% with 5% increments in addition to 99%—[0.05, 0.10, 0.15, ... , 0.90, 0.95, 0.99]. \(p({\hat{z}}_{i,j})\) is the observed cumulative probability obtained by dividing the number of true samples within the prediction interval by the total number of samples. Equation (8) is the miscalibration of prediction intervals averaged over each output and prediction interval tested. For this work, it provides the average deviation from all 20 PIs for each model output. We can visualize the reliability of the uncertainty estimates by plotting the calibration curve—\(p({\hat{z}})\) vs p(z).

Toy problems

To visualize the way the NLPD loss function influences training, we train models for two toy problems. Each problem is a function, y(x), with additive Gaussian noise having zero-mean and a functional form to the standard deviation. These functions are displayed in Table 2. The results for Problem 1 is shown in Fig. 3.

Table 2 Functions for the two toy problems with the right column being the functional form of the Gaussian noise.
Figure 3
figure 3

Mean prediction with \(2\sigma\) bounds plotted on data (a), clean function plotted with mean prediction (b), calibration curve (c), and predicted standard deviation on true standard deviation function (d) for Problem 1.

Figure 3 shows that the model is able to adequately predict the function and is able to predict the overall probability distribution. The interesting aspect of the figure is panel (d): the model is able to predict the standard deviation without a label. Meanwhile, this is fairly trivial data. Figure 4 shows the predictions and calibration curve for the more complex Problem 2.

Figure 4
figure 4

Mean prediction with \(2\sigma\) bounds plotted on data (a), clean function plotted with mean prediction (b), calibration curve (c), and predicted standard deviation on true standard deviation function (d) for Problem 2.

For the more complex data, the model is not as accurate overall x. When x \(<6\), the model can accurately predict the mean and standard deviation. When x > 6, the standard deviation prediction no longer represents uncertainty in the data but the model’s uncertainty in its prediction. For this portion of panel (b), the mean prediction deviates from the true mean of the data and the standard deviation in panel (d) consequently increases. Panel (c) shows that the model is still well-calibrated and representing both uncertainty in the data and uncertainty in the model’s predictions.

The NLPD loss function does not ensure model calibration. However, we show that it can be used—if properly tested—in model development to represent uncertainty in the data and uncertainty in the model’s predictions. Note: these models were trained on the entire dataset, and this is purely for demonstration. The thermospheric density models are developed with separate validation and independent test sets.

HASDM model

Using the best tuner models for MC dropout and direct probability distribution prediction, we assess the error and calibration statistics. Table 3 shows the mean absolute error and calibration error score for both techniques across the training, validation, and test sets.

Table 3 HASDM modeling results using MC dropout and direct probability prediction.

It is evident that the performance using both methods is very similar. Across all three sets, the mean absolute error and calibration error score do not deviate by more than 0.8% and 1.4% respectively. The MC dropout model has better performance on the independent test set in terms of calibration. This is a desired quality as the test data is not used for model development in any way. As the calibration error scores are composites of the scores for each output, the calibration curves are shown in Fig. 5 for a qualitative assessment.

Figure 5
figure 5

The left and right columns show the MC dropout and direct probability calibration curves, respectively. The top, middle, and bottom rows are the calibration curves for the training, validation, and test sets, respectively.

Both techniques lead to slightly overestimated uncertainties on the training set for multiple outputs. Meanwhile, the remaining outputs are almost perfectly calibrated. On the validation set, each model has outputs with overestimated and underestimated uncertainties. Again, most of the outputs are very well-calibrated which is affirmed by the calibration error scores. For the test set, the direct probability prediction model tends to marginally underestimate the uncertainty while the MC dropout model provides reliable uncertainty estimates on virtually all model outputs. Table 4 shows the mean absolute error for both models across an array of solar and geomagnetic conditions. The entire dataset is used for this analysis as there are not enough samples in each bin using only the test set.

Table 4 Mean absolute error across global grid for HASDM-ML as a function of space weather conditions.

These errors tend to reiterate the results from Table 3. The direct probability model was more accurate on all three sets, and Table 4 shows that it is also more accurate across all 20 conditions considered. For a majority of the conditions, the difference is small (\(<1\)%). However, the high ap conditions show that the direct probability model makes considerable improvements. These error reductions from MC dropout range from 1.6 to 4.1%.

To further assess the uncertainty capabilities of the models, we attempt to visualize the calibration in the full-state (global density grids) to identify any spatial dependence in the reliability of the uncertainty estimates. First, the models are evaluated on the entire test set and the density mean and standard deviations are extracted. Using these statistics, the observed cumulative probability with a 90% prediction interval is computed for each spatial location. The resulting \(24\times 19\times 27\) array is used to determine how well calibrated the model is on independent data as a function of location. We show seven maps for each model (200, 300, ... , 800 km) in Fig. 6. Even though HASDM has a lateral spatial resolution of 24 longitude and 19 latitude segments, we interpolate the results to the polyhedral grid used in the EXTEMPLAR model for visualization purposes. This is done in the remainder of the manuscript.

For reference, perfect calibration in Fig. 6 would be uniform green maps at all altitudes. This would convey that with a 90% prediction interval, the model’s predictions/uncertainty estimates contain 90% of true samples at all locations. While this is not the case, the results are still insightful. At 200 km, both models are underestimating the uncertainty by 10–15%. This could be a result of the relative variability as a function of altitude in the SET HASDM density database. The general trend of relative variability is that it increases with altitude, so the models may underpredict the standard deviation at low altitudes as a result, which indicates that the model has a false sense of confidence in that region. Both models have an average cumulative probability within 5% of the expected value at most of the altitudes shown in Fig. 6 with the best results at 600 km. At 700 and 800 km, both models begin to overestimate uncertainty, likely because they have the lowest confidence at those altitudes. An interesting outcome of this study is the lateral variability of the cumulative probability between the models. The MC dropout model (left) has more lateral variability, meaning the cumulative probability changes more as a function of longitude and latitude.

Figure 6
figure 6

Observed cumulative probability maps for a 90% prediction interval using the MC dropout (left) and direct probability (right) models. The average observed cumulative probability is shown for each altitude in parenthesis.

CHAMP model

After running tuners for both uncertainty techniques, we trained the best models on the entire training set. The models were chosen based on the lowest prediction error and best calibration scores on the validation set. Table 5 shows the mean absolute error and calibration error scores on the three sets.

Table 5 CHAMP modeling results using MC dropout and direct probability prediction.

Both models are well-generalized in terms of prediction accuracy. The range in error between sets for the MC dropout and direct probability model is 0.54% and 0.23%, respectively. Both models have higher calibration error scores on the training set but have similar scores on the validation and test sets. The two techniques provide similar results with the only notable difference is the 1.91% higher calibration error score for the direct probability model on the training set. The calibration curves for both models are shown in Fig. 7.

Figure 7
figure 7

Calibration curves for the training, validation, and test sets using MC dropout (a) and direct probability prediction (b). (c) and (d) show the difference between the observed and expected cumulative probability using MC dropout and direct probability prediction, respectively.

Both models are well-calibrated on all three sets. There is a tendency for both models to slightly overestimate uncertainty on the training set which is more evident for the  direct probability model. The differences between the calibration curves and the perfectly calibrated reference line (in black) is shown in panels (c) and (d). Panel (d) highlights the overestimation of uncertainty for the direct probability model on the training set. However, it never deviates by more than 9%. Both models tend to underestimate uncertainty on the validation and test set for the larger prediction intervals. Again, the deviation from perfect calibration is no more than 2% for any PI. Due to the intrinsic difference between the datasets that the CHAMP and HASDM models are developed from, the proceeding analyses will be different than those for the HASDM model.

Global modeling with local measurements

The CHAMP models were developed with in-situ measurements, but we hypothesize that it should be able to learn the functional relationship of the combined inputs. Therefore, the model should be able to provide global outputs at any point in time. As a qualitative assessment, we show global maps at 400 km for the winter and summer solstices in Fig. 8 using the direct probability model. All proceeding global analyses will be performed using this model. For this test, the solar drivers are all set to 120 sfu, SYM-H is set to 0 nT, both Poynting flux totals are set to 27GW, and the time is set to 00:00 UTC.

Figure 8
figure 8

Global density map with moderate solar activity, low geomagnetic activity, the altitude fixed to 400 km, and the time of day being 00:00 UTC for the winter solstice (a) and the summer solstice (b).

The diurnal structure is present in both panels with the peak density being in the southern hemisphere during the winter solstice and in the northern hemisphere during the summer solstice. This shows the model’s understanding on annual trends (Earth’s tilt). The general density level is higher during the winter solstice, but the relative variation between day and night are very similar. This is reaffirmed by the exospheric temperature distribution shown by Weimer et al.43 during the solstices. Additional global density maps at different altitudes can be found in Fig. S1 using baseline conditions. Furthermore, a global storm example is shown in Figs. S2 and S3.

Next, we look at the uncertainty levels for eight unique conditions of activity and time. These are all displayed in Table 6. Using these space weather and temporal inputs, the CHAMP model is evaluated at all 1620 polyhedral grid locations from 300 to 450 km in 1 km increments. The metric we use here is a normalized measure of model uncertainty: \(100 \cdot \sigma / \mu\), essentially providing the 1 − \(\sigma\) uncertainty as a percentage of the mean prediction. The resulting maps are averaged across each altitude to evaluate the model’s uncertainty for each condition as a function of altitude. Three aspects of model drivers are investigated: solar activity, geomagnetic activity, and temporal dependence. In Table 6, there are three solar activity levels, with all other drivers kept constant. There are also three geomagnetic cases: low and high geomagnetic activity with moderate solar activity, and high geomagnetic activity with high solar activity. We only look at two daily cases—00:00 and 12:00 UTC. We also look at the fall equinox, summer solstice, and winter solstice with moderate solar and low geomagnetic activity. The resulting altitude profiles are shown in Fig. 9.

Table 6 CHAMP model inputs to study various conditions as a function of altitude.
Figure 9
figure 9

Normalized uncertainty variations as a function of altitude for solar (a), geomagnetic (b), daily (c), and annual (d) cases. The drivers for each curve can be found in Table 6.

Panel (a) in Fig. 9 shows that the CHAMP model has low uncertainty in its lower altitude predictions for solar minimum (or low solar activity) which drastically increases with altitude. The opposite can be said for solar maximum. The moderate solar activity case results in lowest uncertainties between 350 and 375 km and higher uncertainties above and below that range. This is all a result of CHAMP’s altitude from 2002 to 2010. It started around 460 km during solar maximum and ended at 300 km during solar minimum. Therefore, the model has confident predictions in the altitude range the satellite was located during the various phases of the solar cycle. If there was additional data from satellites at different altitudes over a longer time period, the model would likely be more confident over a larger altitude range.

In panel (b), we see the same general trends for Geo 1 and Geo 2, because they are evaluated using moderate solar activity. However, it is evident that the increase in geomagnetic activity results in up to 5% more uncertainty. The Geo 3 case is similar to Solar 3 (high solar activity) but again has increased uncertainty due to the storm conditions it represents. Panel (c) indicates that there is a low impact from universal time on the model uncertainty. In Panel (d), the black line indicates the fall equinox which is similar to the winter solstice. The Winter solstice uncertainties deviate from the equinox uncertainties at the highest altitude range. While the overall shape remains consistent, there are highest uncertainties for the summer solstice at all altitudes. The overall takeaway form Fig. 9 is that the shape of the model uncertainty altitude profile is most strongly effected by the solar activity level while the day of year and geomagnetic activity tend to uniformly increase or decrease uncertainty. These profiles would all likely be impacted if the model was developed using additional satellite data.

Evaluation time comparison

We attempt to provide an equal comparison of the two methods in terms of computational complexity. To do so, each CHAMP model is evaluated on either 8640 samples (1 week) or 86,400 samples (10 weeks). For the direct probability prediction model, it sees each input once and provides the mean and standard deviation. These are used to sample a Gaussian distribution 1000 times to get probabilistic predictions for density over the given window.

For MC dropout, we cannot pass 1 week of inputs to the model stacked 1000 times (as is done for HASDM). There is not enough memory on an NVIDIA GeForce RTX 2080 Ti graphics processing unit (GPU)—11 GB—to perform this evaluation. Therefore, we pass the 100 repeated inputs in 10 chunks to obtain the 1000 predictions. When evaluating over 10 weeks, we must reduce to 10 repeated inputs in 100 chunks. In Table 7, we show the evaluation times on both GPU and CPU for both methods over the two durations. Note: when running MC dropout on CPU, we use 100 repeated inputs for both durations. The batch size for all predictions is 217 or 131,072. The size of the MC dropout and direct probability models are 233.3 kB and 21.9 MB, respectively.

Table 7 Run time to obtain 1000 probabilistic predictions from each model using GPU and CPU in seconds.

The run times are unique to these specific models. The size of the models plays a role in run time, and the size of these models are a result of the tuner. The MC dropout model is approximately 100 times smaller, but the increase in required model prediction calls results in the significantly longer run times. The direct probability method, for this particular problem, is anywhere from 3 to 30 times faster depending on the number of samples and whether the GPU or CPU is being used.

Discussion

In this work, we leverage the NLPD loss function to develop thermospheric density models using (1) MC dropout and (2) direct probability prediction. These two uncertainty techniques were used to create both a model based in the PCA coefficients of the SET HASDM density database and a model based in localized accelerometer-derived density estimates from CHAMP. Using two toy problems, we showed that the NLPD loss function can be used to create a ML model with calibrated uncertainty estimates relative to uncertainty in the model, uncertainty in the data, or both. For the HASDM database, the MC dropout and direct probability distribution prediction models had similar metrics in terms of error and calibration. Furthermore, the calibration curves for the PCA coefficients were nearly identical. By looking at the density calibration of both HASDM models, we found that they were well-calibrated at mid-altitudes, and there was more lateral variability in the calibration of the MC dropout model.

The CHAMP models also had similar performance and were both well-generalized. We test the CHAMP model’s global prediction capabilities by generating baseline maps during the winter and summer solstices to ensure physical global trends are being captured by the CHAMP model. This showed that the model was able to emulate the effect of Earth’s tilt. We also performed global evaluations for eight unique conditions to determine the altitude dependence of model uncertainty. The altitude profiles showed that the minimum and maximum 1 − \(\sigma\) uncertainties were 10–28% of the mean predictions, respectively. Solar activity was most influential in determining the profiles’ shapes while geomagnetic activity and day of year tended to provide uniform changes in the uncertainty. In general, the MC dropout and direct probability methods were shown to have similar performance for thermospheric density modeling applications. However, there are pros and cons for both methods, and careful consideration is required when deciding on a UQ method for space weather models. These are highlighted in Table 8.

Table 8 Pros and cons for MC dropout and direct probability distribution prediction.

The main disadvantage for the direct probability method is the requirement to sample from a Gaussian distribution to get probabilistic density variations. The drawback to MC dropout is its higher computational cost. In terms of density modeling, both techniques have prompt evaluation times. Relative to one another, we show that the direct probability models can be evaluated much faster. The size of the training data (in both number of samples and dimensionality) is also important to consider. With MC dropout, GPU memory can constrain modeling efforts if the dataset is too large. It can also require additional steps for prediction. In this work, MC dropout did not inhibit model development for the smaller HASDM PCA data. However, it did add numerous considerations when developing and evaluating the CHAMP model. In general, the uncertainty estimation capabilities may be improved through modifications to the loss function to either: (a) add higher order moments or (b) obtain non-Gaussian estimates.

All the preceding results show that for thermospheric density applications, these two techniques can be used to obtain an accurate model with reliable uncertainty estimates. There are other methods that can be used in space weather application such as GP regression and ensemble modeling, but this is a sufficient starting point. Other final considerations concern orthogonality and applicability. For a multi output regression model (e.g. HASDM models), the outputs must be orthogonal. This is to both prevent collinearity and since the use of NLPD requires uncorrelated outputs. The CHAMP data only spans an altitude range of 300–460 km. Any predictions outside of this range may be unreliable. To combat this, density estimates from other satellites can be added to increase the altitude coverage and provide the model with more data to learn from, as discussed in “Satellite accelerometer density estimates”.