Introduction

Corrosion of industrial machinery, bridges, engineering equipment, vehicles, and structures under different corrosive environments is a major concern for food processing and chemical industries, power plants, maritime and air transport. It is reported that in the United States 6.2% of the gross domestic product (GDP) is spent to replace engineering equipment in industries due to corrosion1. The development of corrosion-resistant materials and coatings capable of extending the lifespan of equipment and structures can yield substantial economic benefits. Birbilis et al. and Yeh et al. suggested that an emerging class of materials, high entropy alloys (HEAs) and multi-principal element alloys (MPEAs), are being considered as preferred candidates for corrosion-resistant applications compared to conventional alloys, such as stainless steels, Ni-based alloys, and Ti-based alloys2,3. MPEAs consist of 4 or more base elements that typically stabilize a single-phase (random) solid-solution crystallographic phase. Certain MPEA compositions have exhibited excellent hardness, exceptional wear properties, as well as high-temperature strength relative to traditional alloys2,4,5,6,7,8.

In theory, MPEAs composed of 4 or more elements allow for extensive compositional freedom with perhaps greater property control. Due to the high entropy of mixing2, which increases roughly as natural log N with number (N) of elements, the tendency to order or segregate is thus greatly reduced in MPEAs. Such alloys are then less likely to form ordered structures (unless a large enthalpy reduction occurs from a new elemental pair interaction, a pair probability that increases as N2). Hence, MPEAs tend to form disordered solid solutions predominantly of face-centered cubic (FCC), body centered cubic (BCC), or hexagonal close-packed (HCP) structures9,10,11, instead of more complex intermetallic phases. The locally disordered chemical environment within these solid solutions has been theorized to lead to the excellent corrosion properties of MPEAs5. For instance, Qiu et al.12 reported that MPEAs containing Co, Cr, and Ni offer better corrosion resistance in the presence of NaCl solution compared to austenitic stainless steel.

Corrosion resistance of MPEAs depends on the elemental composition of the alloys and the corresponding corrosive environments9,13. The number of possible elemental compositions is much higher in MPEAs than that of traditional metallic alloys and the search can involve about a trillion combinations as we move away from the vertices of the multi-dimensional composition space toward the center14. Moreover, different corrosive environments, such as NaCl, HCl, and H2SO4 have different influences on MPEAs and it is extremely challenging to select the appropriate composition to optimize corrosion resistance for different environments by trial-and-error experiment or intuition15. Additionally, long-term atmospheric exposure testing for MPEAs is not commonly practiced in industries and hence such data are not readily available thereby limiting the applications for MPEAs in harsh aqueous environments3. Simulations and data informatics related to metallic alloys can be a potential tool to predict the corrosion resistance of MPEAs under different environments. Techniques like density functional theory (DFT)16, molecular dynamics (MD), and thermodynamic modeling have been devoted to study phase stability, solidification, and crystallization kinetics of MPEAs17,18,19,20,21,22,23. These methods are computationally expensive and time-consuming even for a single composition and hence less readily usable on a large scale to narrow down the search space.

Materials informatics offers a faster and less expensive path to develop new materials by identifying correlations and trends existing in the dataset, incorporating the underlying physics, predicting the properties of new materials, and guiding the next set of experiments24. Some examples of applications include predictions of crystal structures25,26, physical properties27, and corrosion rates28. The approach uses ML to construct a surrogate function (f) that relates the target property (Y) to the material features or descriptors (X). The depth up to which the descriptors capture the physical and chemical signatures of the materials combined with the mathematical accuracy of the models determine the robustness of the model. Therefore, for any given target property, the optimized combination of f and X needs to be selected. For a system with M models and N materials descriptors, there are M × (2N − 1) potential combinations possible out of which only one would ideally have the desired accuracy. Material descriptors can be based on the composition, microstructure, processing conditions, and other thermodynamic parameters which necessitates the construction of a large pool of descriptors29. For instance, 10 descriptors and 5 models can produce 5115 different combinations out of which one will be selected based on minimum prediction error. Hence, the primary goal becomes choosing the best combination of model and descriptors to predict the target property of desired alloys.

Domain knowledge plays a crucial role in down selecting the descriptors30,31. For example, Rickman et al.7 used Young’s modulus asymmetry (representing stiffness) and mean melting temperature (proportional to bond strengths) as descriptors in conjunction with thermodynamic descriptors to generate compositions with high hardness. Kim et al.27 used a DFT dataset (from the Materials Project database32) to construct a model using the average values of physical parameters (e.g., cohesive energy, density) in conjunction with thermodynamic parameters to predict the bulk and shear modulus. When the descriptor space becomes too large, principal component analysis can be used to reduce the descriptor space by using certain mathematical transformational laws but this operation may lack interpretability33.

In the present study, we have utilized a gradient boost ML model coupled with a 2-stage feature down selection process to build a robust model for corrosion-rate prediction in MPEAs. This optimization process is based on the best performing combination of descriptors to predict the corrosion rate of MPEAs under different corrosive environments. The methodology and results will be explained further in the subsequent sections. We conclude by reasoning why certain features are the most impactful in determining corrosion rates and also carry out detailed data analysis to show the dependence of corrosion rate on those descriptors.

Results and discussion

According to the “no-free-lunch” theorem34, there does not exist a perfect model for a given problem. There always exists room for improvement in accuracy by adopting different combinations of models and features. In the present work, we have chosen the best model for our problem as the Gradient Boosting Regressor (GBR) model. To choose the best features out of the pool of 30 features, we adopt a 2-stage down selection process as shown in Fig. 1. The first stage down selects top 13 out of the 30 features as obtained from the feature importance function. The second stage of down selection takes all possible combinations of 1, 2, 3, …, 13 features out of the 13 features from stage 1 and evaluates the mean squared error (MSE). The best number and combination of features are thus selected.

Fig. 1: Schematic diagram of the 2-stage feature down-selection process used in this work to select 5 features from a pool of 30.
figure 1

The 2-stage features down selection process adopted in this work to down select the features from a pool of 30 to a final 5.

Features down selection stage 1

To reduce the complexity of the model, the number of descriptors can be limited to a certain value without compromising the performance of the model or sometimes even improving it. In fact, some reports claim that the model degrades the results when there are too many variables because the redundant feature variables interfere with the ML model35. Thus, it is essential to down select the most significant and relevant features to construct a robust model. Feature importance is determined by assigning a value to input features based on how relevant they are in predicting a target property. The feature importance was extracted from the model using the “feature_importances” function built into the Scikit-learn python module36. Gradient boosting models work by iteratively fitting several base models such as decision trees to data. The GBR fits one tree, then fits another tree to the residual error of the first tree. The GBR then fits a new tree to the residual error of the most recent tree and so on. Feature importance is computed as the mean and standard deviation of accumulation of the decrease in error with each progressing tree. Such an approach has been used by several other works in the past to eliminate redundant features from a larger feature pool37,38. The feature importance of all 30 features has been shown in Supplementary Fig. 1. With the objective of downsizing the feature pool to reduce the complexity of the model, the top 13 most important features out of the total 30 have been down selected. These top 13 descriptors with the relative importance ranging from 2 to 37% and corrosion-rate prediction by GBR model are provided in Fig. 2 and Table 1, respectively.

Fig. 2: The relative importance of the top 13 features to predict the corrosion rate.
figure 2

This down selected pool will be subsequently used for stage 2 feature down selection. The second stage may result in a reshuffling of the rankings of the top features.

Table 1 Validation dataset with details of the corrosive environment, corresponding observed experimental corrosion rate8,9 and predicted corrosion rate by considering the top 13 descriptors and the optimized descriptors.

From feature importance, it is clear that the chemical environments during the corrosion reaction, such as pH of the reaction medium and presence of halide ion, have the maximum impact on the corrosion process. The pH of the environment plays a decisive role in the corrosion rate. To understand the effect of pH in greater detail, we take the example of a pure metal in an acidic environment. When a metal is submerged in a deaerated acid solution, the electrochemical reaction can be divided into two or more partial oxidation and reduction reactions as follows:

(i) M → Mn++ne−1, which is the oxidation(anodic) reaction and

(ii) nH++ne−1 → (n/2)H2 (gas) which is the reduction (cathodic) reaction.

To find the propensity of corrosion in the given solution, the potentials of the anodic (\(E_{M^{ + + }/M}\)) and cathodic reaction (\(E_{H^{ + + }/\frac{1}{2}H_2}\)) should be known. The cell potential is formulated39 as

$$\begin{array}{ll}{{\Delta }}E = \left( {E_{H^{ + + }/\frac{1}{2}H_2} - E_{M^{ + + }/M}} \right) = \left( {E_{H^{ + + }/\frac{1}{2}H_2}^0 - E_{M^{ + + }/M}^0} \right) \\- \frac{{2.303RT}}{{nF}}\left( {2 \,\times\, pH + \log \left[ {M^{ + + }} \right]} \right),\end{array}$$
(1)

where \({{\Delta }}E\) should be positive to make \({{\Delta }}G\) negative so that the reaction is spontaneous according to the equation \({{\Delta }}G = - nFE\). According to the above formulation under acidic conditions, a lower pH (acidic environment) will increase the \({{\Delta }}E\) and thereby enhance the corrosion rate. Therefore, the pH of the aqueous environment plays a crucial role in determining the corrosion rate.

Notably, this equation does not consider the formation of the passive film due to the presence of certain metals that form the protective oxide layer. Hence, this situation is considered by including the composition of metals like Cr, Al, Cu, Ti, Ni, Sn, and Mo13 as separate features that will indirectly incorporate the effects of passive layer formation into the model. The addition of nickel to MPEAs has been investigated before. Ni was added to Al2CrFeCoCuTiNix (x = 0, 0.5, 1, 1.5, and 2) alloy system in 3.5 wt. % NaCl solution and 1 M NaOH solution. The composition with x = 1 was found to have the least corrosion current density in both salt and alkaline solution, implying that there is an optimum Ni content below and above which Ni has a detrimental impact on corrosion40. Ni has also been shown to be effective in resisting corrosion in nickel-aluminum bronze alloys, where increasing the concentration of Ni improves the corrosion resistance of the alloy41.

Another important descriptor is the difference in electronegativity between the highest and lowest reduction potential of a specific alloy. A larger difference in reduction potential enhances the driving force for galvanic corrosion. Corrosion rate further increases with the presence of halide ion, such as chloride ion (Cl−1) in the reaction medium. Cl−1 ions can penetrate the passive film on the alloy surface and hence increase the pitting corrosion rate. Chou et al. reported that pitting potential was proportional to the logarithm of the chloride concentrations at a constant temperature42 and the potential will increase with increasing temperature and hence increase the pitting corrosion rate. It is important to mention that corrosion rate in presence of Cl−1 is smaller for MPEAs compared to that for stainless steel. For instance, Chen et al. compared CuNiAlCoCrFeSi MPEA and compositionally similar 304SS with near-equal Cr content in deaerated 1 M NaCl solution at room temperature43. The MPEA was shown to be more noble and have a lower corrosion current density than 304SS in NaCl solution. Due to the presence of nearly amorphous structure, which had negligible or no grain boundaries and hence acted as passive film, the MPEA showed significant resistance to uniform corrosion. The passive region of the MPEA is smaller than that of 304SS and when the MPEA passive film is ruptured the alloy is more susceptible to nucleation and growth than 304SS.

We emphasize that the feature importance obtained from the ML model represents the ranking of the contribution of each feature to the mathematical expression of the target property when all the features are used. Hence, feature importance is basically the weight of each feature when that feature is used as a variable in the mathematical expression of corrosion rate formulated by the ML model. The top 13 features represent the features that have the highest weight. Since the formulation of this expression is based on statistics, it is only helpful in eliminating the redundant features. The order of the topmost important features may vary if the training dataset is changed. Hence, at the current stage with the limited dataset, we may not be able to conclusively determine the reasoning of the order of the top important features. But the approach may be employed to downsize the feature pool by eliminating the redundant features35,37.

The choice of alloying elements is also an important descriptor. For instance, aluminum (Al) is used to synthesize lower density MPEAs because of light weight. In addition to that, Al helps to improve the mechanical strength of the MPEAs44,45,46,47,48,49,50. However, Al present in the Al based MPEAs is less noble (lower in the galvanic series) compared to the other alloying elements, such as Fe, Ni, Cr, Co, and Ti that are used to synthesize MPEAs51. Therefore, it is reasonable that in the presence of aqueous corrosive environments, Al atoms may be preferentially released from MPEA solid solution compared to other alloying elements and hence introduce dealloying. It is reported that depassivated Al-containing MPEAs have a higher corrosion rate and the rate increases with increasing Al content in the presence of a corrosion environment, which promotes dissolution. Interestingly, Al is beneficial when used as thin film for surface passivation as reported in the case of Ni-based corrosion-resistant alloys52. It is also reported that MPEAs containing Cr, Fe, and Ni showed good corrosion resistance in both NaCl and H2SO4 solution as compared to steel. Hsu et al. reported that due to the presence of Cr and Ni, CoCrFeNi MPEA showed better corrosion resistance as compared to 304LSS53.

From Fig. 2, it is observed that thermodynamic parameters, such as mixing entropy \((\Delta S_{{\mathrm{mix}}})\), can be important for corrosion-rate prediction. The equation for \(\Delta S_{{\mathrm{mix}}}\) is given below:

$$\Delta S_{{\mathrm{mix}}} = - R\mathop {\sum}\limits_{i = 1}^n {(C_ilnC_i)}$$
(2)

Where, \(C_i\) is the composition of element i in a specific MPEA, n is the total number of individual elements present on the MPEA and R = 8.314 J·mol−1·K−1 is the molar gas constant, However, mixing entropy does not take into consideration the selection of elements; it only considers the number and fraction of elements. For example, CoCrFeNi and CoCuFeNi have the same mixing entropy, but the former exhibits better corrosion resistance due to the formation of the stable Cr2O3 film54. Hence, mixing entropy should not be used in isolation to make predictions about the corrosion rate.

Similarly, the difference in atomic radii (or \(\delta\)) also appears as one of the top 13 most important features. \(\delta\) is the representative of the strain in the lattice. Higher strain in the alloy lattice can introduce larger lattice distortion and promote the formation of secondary phases55 which may be detrimental to the mechanical properties. But in a given set of single-phase alloys, the ones with a larger \(\delta\) and hence larger strain are posited to have retarded diffusion thus making it difficult for Cl or O atoms to migrate inside the alloy matrix. Hence the alloy is protected from corrosion. Similar to \(\delta\), the Δa has an impact on lattice distortion and results in uniform lattice distortion56. Hence a higher difference in lattice constant might retard the diffusion and protect the alloy from corrosion.

As we are not able to draw concrete conclusions from our current list of top features, we adopt another ingenious technique to further downsize our feature list. This technique is explained in the following section and is used to build the final model for corrosion-rate prediction.

Features down selection stage 2

There is a tradeoff between trying to improve the accuracy of a ML model with a large number of descriptors and having a model that is easier to interpret with a smaller set of descriptors. Thus, it is necessary to track the model performance as a function of the number and combination of descriptors. To reduce the complexity of the model, we have performed descriptor set optimization by finding the optimum number and combination of descriptors that produce the minimum MSE.

In Fig. 3 (d), all combinations (13C1, 13C2, 13C3, …, 13C13) of descriptors from the top 13 descriptors obtained in stage 1 down selection have been fitted to the GBR model to check dependency of performance on the number of descriptors. Such an approach has been successfully used in ref. 57 to increase the accuracy of the ML model from the case where a larger number of features is used. We plot the MSE of the GBR model as a function of the number and combination of descriptors shown in Fig. 4.

Fig. 3: Flowchart of the strategy used to obtain the best combination of ML model and descriptors for predicting the corrosion rates in MPEAs.
figure 3

a Composition and corrosion-rate dataset of 142 alloys was collected from the literature. b The descriptor space contained a total of 30 descriptors but down selected to 13 descriptors in stage 1 based on feature importance. c A ML model was chosen and hyperparameter optimization done to identify the best parameters with the lowest MSE. d Stage 2: an optimization technique down selected the best number and combination of descriptors from a total of 13 listed after stage 1. e The most accurate combination of the ML model and the descriptor set is then identified.

Fig. 4: MSE of the GBR model versus number of descriptors.
figure 4

Data are from numerous descriptor combinations from 1 to 13 for the absolute difference of predictions and experimental values, at least 19 out of 25 test data are within 2.5 times that of the experimental values. The lowest points represent the best regressor for a given number of descriptors. y-axis is limited for clarity to only those data that are <0.09. In actuality, there are 13Cn values for any given number of descriptors (n), as shown for n = 3 a total of 286. MSE does not show a trend with the number of descriptors and is minimum for a set of 5 descriptors (listed in figure).

For a given number of descriptors (n), the best combination (lowest MSE out of 13Cn) is shown in Fig. 4 (solid-black frontier). The MSE does not show any trend with the number of descriptors and is minimum for a set of 5 descriptors, namely, pH and halide molarity of the medium, composition of element with minimum reduction potential, Δa, and average reduction potential.

At minimum marked in Fig. 4, the model makes the best predictions closest to the experimental values. The MSE at that point is a minimum (i.e., 0.055 mm/year) out of all possible cases. The predictions at this point (by optimization and by top 13 descriptors) and experimental corrosion rate are listed in Table 1. We can see in Table 1 that our ML model predicts the corrosion rate of different MPEAs in a variety of corrosive environments. However, the accuracy of the predictions increased when we applied optimized descriptors corresponding to the point of minimum error in Fig. 4. From this result, we can say that if the severity of the corrosion environment (pH of medium and halide molarity) is fixed then only two atomic descriptors (the Δa and average reduction potential), and one compositional descriptor (the composition of the element with minimum reduction potential) will be needed to predict the corrosion rate, saving considerable computational and experimental time for alloy down selection.

Furthermore, we have analyzed the effect of individual descriptors on corrosion rate to assign some physical significance to the descriptor. We have plotted in Fig. 5 (panel 1 and 2, respectively) the corrosion rate of 12 MPEA datasets from our training set versus Δa and average reduction potential. (Note: for each dataset, the corrosion environment is identical.) Details of the 12 datasets are presented in Supplementary Table 1. From Fig. 5 (panel 1), we see that only 6 datasets (marked with a star) out of 12 alloy datasets show a trend where the corrosion rate tends to increase with increasing Δa. A larger Δa may introduce higher strain energy in the system and enhance the formation of secondary phases leading to a higher corrosion rate. However, this trend is not definitive for all the data, so a clear conclusion cannot be drawn, especially given the limited dataset.

Fig. 5: Trends of corrosion rate with respect to 2 of the 5 down selected features.
figure 5

The corrosion rate of 12 MPEA datasets as a function of the difference in lattice constant and average reduction potential (eV) are presented in panel 1, (a–l) and panel 2, (m–x), respectively. The star mark in each panel indicates the datasets which are in agreement with our reasoning. Since the prediction of the ML model is based on the combined effect of the 5 down select features, the corrosion rate does not show a clear trend in some plots when plotted individually with the selected features. Details of the datasets are provided in Supplementary Table 1.

Yet, from a material science point of view, only certain elements are reactive on surfaces of MPEAs, i.e., that segregate to a surface and form protective layers (oxide). Hence, within the datasets we find a trend of increasing corrosion rate with increasing Δa driven by the presence of a few specific elements. In Fig. 5 (a-l) (panel 1), the dataset 1, dataset 5 and dataset 7 show a trend in line with our hypothesis. As suspected, these datasets and their compositions (refer to Supplementary Table 2) reveal increasing Al in dataset 1 increases Δa and also increases the corrosion rate; this follows as Al may form a passive layer but the oxide film formed is porous9 and fails to resist corrosion13. Similarly, in datasets 5 and 7, increasing Ti decreases the Δa and also the corrosion rate, which follows because Ti is known to form a protective TiO2 film anticipated to improve the corrosion resistance of the alloy58. Therefore, we reiterate that it is difficult to draw a definitive conclusion regarding the role of Δa based on the limited dataset. Rather, it is the presence of certain protective reactive metals that decides corrosion resistance but directly affect a change in Δa of the MPEA.

From Fig. 5 (m-x) (panel 2), no clear trend is observed where lowering the average reduction potential has a negative impact on the corrosion rate of the MPEAs. Only 5 datasets (marked with a star) out of 12 show such a trend. Datasets 4, 5 and 7 show the opposite trend. Datasets 8 and 9 show a lot of scattering. The prediction of the ML model appears to be based on the synergistic effect of multiple descriptors, but the corrosion rate does not show a clear trend when considered individually with this selected descriptor.

To assess the improvement in model performance, we compare the predictions based on the top 13 descriptors and from the descriptor optimization (5 descriptors), as given in Fig. 6. We have observed 13 fairly accurate predictions (same order of magnitude as that of experiments) out of a validation set of 25 when we considered the top 13 descriptors. However, when we consider 5 optimized descriptors, namely, pH and halide molarity of the medium, composition of the element with minimum reduction potential, Δa, and average reduction potential, we achieve better predictions (19 out of 25 for the validation set were fairly accurate).

Fig. 6: Comparison between experimental and ML predicted corrosion rate.
figure 6

a Using top 13 features from stage 1 down selection, and b using 5 optimized features from stage 2 down selection. The bad prediction for each case is indicated by blue arrows.

This study shows both the promise and perils of ML when applied to a complex phenomenon like corrosion of alloys. ML can help select descriptors and build a model that can reasonably predict trends in the corrosion rate of MPEAs in different environments to help narrow down the alloy search space. However, there remain large quantitative differences between the experimental corrosion rate and the rate predicted by this state-of-the-art ML model, which is not surprising given the orders of magnitude difference in corrosion rates in our training dataset. The data utilized in this study were collected from various sources in the literature. They have some inherent irreducible uncertainty and noise that makes it difficult to utilize this dataset to build models for a chemical space that is not part of the original training dataset. Sparse data and poor data quality are well-known limitations in materials science59. There is a need to collect high-quality corrosion data with metadata to understand the effect of descriptors, such as pH, halide ions, mixing entropy, and strain energy, on the corrosion behavior of MPEAs.

Our goal was to demonstrate the use and limitations of ML as a rapid-screening tool for corrosion resistance in the vast space of MPEAs. We developed a model that predicts corrosion rate (mm/year) within a relaxed threshold and significantly accelerates the identification of MPEAs with improved corrosion resistance. From our modeling exercise, we found that for most of the materials our predicted corrosion rates (mm/year) were within 2.5 times of experimental values. A larger dataset with higher quality data is needed to find more significant physical relationships between individual descriptors and the corrosion rate.

In summary, we have developed a domain-science-informed ML-based approach to predict the corrosion rate of MPEAs. We used the GBR model to predict the corrosion rate under different corrosive environments. We modified the predictive model by developing a count-based selection model to get better accuracy compared to conventional mean-squared error measurement. Our descriptor-optimization strategy predicts corrosion rates reasonably well by down selecting best descriptors—here five, namely, two environmental descriptors (pH of the medium and halide molarity), two atomic descriptors (Δa and average reduction potential), and one elemental descriptor (composition of the element with minimum reduction potential). Our main thrust was to examine the utility of ML models based on limited and noisy published data to predict corrosion rates for MPEAs. The model developed can search alloy space without requiring synthesis and testing of many alloys by trial-and-error in the laboratory. As such, it can dramatically reduce the time and cost involved in narrowing down the vast composition space. Our model offers a good prediction of the corrosion rate (19 out of 25 MPEAs in the validation dataset). However, it is challenging to rationalize the effect of individual descriptors in terms of materials science due to the limited training dataset of variable quality. The accuracy of the predictions could be further improved by increasing the training dataset size and data quality. Data sharing among researchers with standardized metadata is needed to realize this goal. The methods described in this work could also be applied to select degradation-resistant alloys based on the prediction of properties, such as Young’s modulus, yield strength, and hardness, for a variety of engineering applications.

Methods

Data collection

We have collected corrosion data of 142 MPEAs available in the literature from9,13. The results are from potentio-dynamic-polarization tests in acid, base, and salt solutions at room or higher temperatures. As the data includes corrosion current densities obtained from samples at the experimental scale, we assume that our model will predict localized corrosion at the point of active reactivity60. Here, we only screen single-phase disordered MPEAs, with the assumption that at the length scale of the corrosion process, the alloy surface will be corroded homogeneously at statistically distributed points on the surface with no segregation of a particular element in the microstructure. To quantify the extent of the corrosion, we converted the experimental current density to corrosion rate in mm/year. In a real application, the prediction of corrosion is made extremely challenging by the formation of a passivating film, the presence of impurities in the environment that can enhance or inhibit corrosion, and preferential dissolution of an element in the alloy61. Our aim is to offer a tool for the quick screening of MPEAs based on susceptibility to corrosion, just as a Pourbaix diagram62 is used to assess the corrosion susceptibility of a metal. We show that despite the complexity of the corrosion process, variable quality of data drawn from different sources, and the sparseness of the dataset, actionable insights can be extracted about corrosion. The caveat is that this approach is based on uniform corrosion and may not adequately cover situations where the alloy corrodes incongruently. From an engineering perspective, it is anticipated that this machine learning aided corrosion-rate prediction can give a fair idea about the suitability of a set of alloys for a given application. The corrosion rate was obtained by converting the corrosion current density into corrosion rate (rcorr) by using the relation:

$${{{\mathrm{Corrosion}}}}\,{{{\mathrm{rate}}}}\,\left( {{{{\mathrm{mm}}}}/{{{\mathrm{year}}}}} \right) = 3.27 \times 10^{ - 3} \times \frac{{i_{{\mathrm{corr}}}}}{\rho } \times EW,$$
(3)

where ρ is the density in g/cc, icorr is the corrosion current density (in µA/cm2) and EW is the equivalent weight of the alloy given as:

$$EW = \left( {{\sum} {\frac{{n_if_i}}{{W_i}}} } \right)^{ - 1}$$
(4)

where ni is the valency of the i-th element of the alloy, fi is the mass fraction of the i-th element, and Wi is the atomic weight of the i-th element in the alloy. To minimize the skew of 4–5 orders of magnitude in the target property i.e., corrosion rate (mm/year), the natural log of corrosion rates have been taken and set as target property. Out of the 142 MPEAs, we have included 117 in the training set, and the remaining 25 in the validation set. Once the model makes a prediction, the result is calculated as the antilog of the Ln (rcorr), which is in mm/year and can be directly compared with the test data rcorr values.

Materials descriptors space construction

One of the significant pillars of a ML algorithm is the set of descriptors used in constructing the algorithm. Since the degree of relevance of any given descriptor is unknown in the beginning, all potentially relevant descriptors even mildly affecting the target property should be listed while constructing the dataset. All descriptors based on thermodynamic, atomic, physical and chemical parameters should be considered. The thermodynamic parameters like mixing entropy (Ω)63, a geometric parameter (λ) based on mixing entropy and atomic size mismatch64, and mixing enthalpy control the process of MPEA phase formation. These parameters have been widely used for phase classification in MPEAs. Atomic parameters like electronegativity mismatch (χ) and atomic size mismatch (δr) have previously been used for classifying phases and predicting mechanical properties7. Physical parameters including melting point, compression modulus, and density have been used to predict stable phases57 and hardness7. A rationale behind using melting point as a descriptor is that it represents the bond strength65. Additionally, previous work has also used chemical compositions to characterize the effect of addition of certain elements on specific target properties like oxidation behavior and corrosion rates66. The present work aims to predict the corrosion rate of general MPEAs (mm/year). For that purpose, we have included previously used parameters for corrosion-rate prediction and additional relevant parameters that have not been used to predict corrosion rate previously in the literature. A complete list of the 30 descriptors is given in Supplementary Table 2. The additional parameters include the maximum and minimum of the reduction potentials of all elements in the MPEA, and atomic percent of Cr, Al, Cu, Ti, Ni, Sn, and Mo. The elements Cr, Al, and Cu can form a protective oxide layer during corrosion and Mo triggers the Cr to form a protective layer67, which makes them significant for consideration as descriptors in the ML algorithm. The presence of halide ion breaks the passive layer and reduces corrosion resistance due to halide attack68. Hence, concentration of halide in the corrosion test solution has been included as a descriptor. The severity of the environment characterized by the pH has been previously shown to have a significant influence on the pitting morphology68. The corrosion rate is higher in acidic or basic solutions relative to the rate in solutions with neutral pH69.

We calculated the 30 descriptor values for all 142 alloys in our dataset and we set the rcorr as the target property. Since our descriptors vary in magnitude, we scaled down those descriptors that have a larger magnitude relative to other descriptors. More specifically, we scaled down \(\Delta T_m\) by 100 and both Ω and Tm by 1000, and we scaled up \({{\Delta }}\chi\) and δ by 100. All descriptor values then ranged from 0 ~|10|. This approach is slightly different from regular normalization in the sense that only biasing is performed on the descriptors by multiplying with an appropriate bias value so scaled data are the same order of values for each descriptor. This minimizes the bias from one descriptor to another as seen by the ML model. In our dataset, some descriptors like Ω have values that range from 960 to 45800, with most values lying towards the 103 order of magnitude. In such a case a Min-Max normalization or a Z-score normalization would reduce majority of the values to the order of 10−2 for the descriptor Ω, while keep other descriptors in the order of 10−1. Hence to avoid a wide variation in the values of the descriptors, such a scaling technique has been adopted. This ensures reducing the training time by starting the training process for each descriptor within the same timescale70.

Machine-learning model

In our initial trials, we employed the least absolute shrinkage and selection operator (LASSO), random forest (RF), support vector machine (SVM) and the GBR to predict the corrosion rate. At this stage, we selected the algorithm with the maximum R2 value and hence we present the R2 values obtained from all models in Table 2. Ten-fold cross-validation was employed during the training. The GBR shows the highest training R2 value of 0.93 and a testing R2 value of 0.61 and is therefore selected as the final model for the prediction of corrosion rate.

Table 2 The training and testing R2 values obtained from the various models.

GBR is an ensemble type algorithm. Ensemble methods are meta-algorithms that combine several base models to produce a better predictive model. To decrease variance, a bagging ensemble method can be used. To decrease bias, a boosting ensemble method can be used. Boosting was first introduced by Robert Schapire71. Gradient Boosting, Ada Boost and XGBoost all fall under boosting ensemble methods. A boosting method converts weak learners to strong ones. Usually, decision stumps are used as the base weak learners, but this is not always the case. Most boosting methods build models in stage-wise fashion and they generalize the model by optimizing an arbitrary differentiable loss function.

GBR is a machine-learning technique based on boosting method which can be used for both regression and classification31,38. The initial idea of gradient boosting was observed by Leo Breiman72. Friedman developed the explicit regression gradient boosting algorithm73. The algorithm for GBR is given below.

GBR Algorithm

Input: Training Data \(\{ \left( {x_i,y_i} \right.\} _{i = 1}^n\),

Differential Loss Function \(L\left( {y,f\left( x \right)} \right),\)

Number of Iteration M

  1. 1.

    Initialize \(f_0\left( x \right) = argmin_\gamma \mathop {\sum}\nolimits_{i = 1}^N {L\left( {y_i,\gamma } \right).}\)

  2. 2.

    For m = 1 to M:

  1. (a)

    For i = 1, 2, ……., N compute\(r_{im} = - \left[ {\frac{{\delta L\left( {y_i,f\left( {x_i} \right)} \right)}}{{\delta f\left( {x_i} \right)}}} \right]_{f = f_{m - 1}}\)

  2. (b)

    Fit a regression tree to the targets \(r_{im}\) giving terminal regions \(R_{jm}\), j = 1, 2, …, \(J_m\)

  3. (c)

    For j = 1, 2 …, \(J_m\) compute \(\gamma _{jm} = argmin_\gamma \mathop {\sum}\nolimits_{x_i \in R_{jm}} {L\left( {y_i,f_{m - 1}\left( {x_i} \right) + \gamma } \right)}\).

  4. (d)

    Update \(f_m\left( x \right) = f_{m - 1}\left( x \right) + \mathop {\sum}\nolimits_{j = 1}^{J_m} {\gamma _{jm}I\left( {x \in R_{jm}} \right).}\)

  1. 3.

    Output \(\hat f(x) = f_M(x)\)

In this algorithm, we have used a training set \(\left\{ {\left( {x_1,y_1} \right), \ldots ,\left( {x_n,y_n} \right)} \right\}\) of known values of x and y. Our objective is to find an approximation \(\hat f(x)\) to a function f(x) which minimizes the expected value of a loss function represented by \(L\left( {y,f\left( x \right)} \right)\). It does so by starting the model (step 1) with a constant function \(f_0(x)\), given in the initialization step 1 above. In step 2 (a), the pseudo-residuals given by \(r_{im}\) are computed. In 2(b), a weak base learner was fitted to rm, i.e., the model was trained using \(\left\{ {\left( {x,r_{im}} \right)} \right\}_{i = 1}^n\). Next, in step 2(c) the minimizer, \(\gamma _{jm}\) was found for \(\mathop {\sum}\nolimits_{x_i \in R_{jm}} {L\left( {y_i,f_{m - 1}\left( {x_i} \right) + \gamma } \right)}\). In 2(d), \(\gamma _{jm}\) obtained in 2(c), was used to update the model using expression \(f_m\left( x \right) = f_{m - 1}\left( x \right) + \mathop {\sum}\nolimits_{j = 1}^{J_m} {\gamma _{jm}I\left( {x \in R_{jm}} \right)}\), where \(I\left( {x \in R_{jm}} \right)\) is 1 if \(x \in R_{jm}\) and 0 if \(x \notin R_{jm}\). Step 2 is repeated M times to obtain the final model. For our study, we have used the GBR model through the sci-kit learn package36.

Construction of regressor

Our problem demands the construction of a regressor to predict the Ln (rcorr) with the 30 descriptors. For this purpose, we have created a model using the GBR as it has previously been shown to be a robust algorithm for materials properties prediction38. To assess the accuracy of a model, we have devised a two-stage approach. To understand this, we present a sample dataset to explain the approach. Consider the dataset A and dataset B shown in Fig. 7. Predictions of model A has a smaller MSE than that of model B. But the predictions of model B have a low absolute error (AE) ≤ 0.4 for all data as compared to model A that has AE > 0.4 for 2 data points. The above difference is mathematically of lower magnitude but has an important consequence when the magnitudes of test data are of an extremely small order of magnitude such as those of corrosion rates reported in the literature (10−3 to 10−1 mm/year). Hence, we adopt this criterion of count (of predictions in the designated AE range) based selection of the model. This technique of model selection is advantageous in our case as it is more important to have a model that may have a higher MSE but can make fairly accurate predictions for all data points that are of nearly the same order of magnitude as that of the experimental dataset, rather than a model that has lower MSE but makes few very accurate predictions and a larger number of highly inaccurate predictions.

Fig. 7: A simplified illustration of the count (of predictions in the designated AE range) based selection of model: predictions of model A have smaller MSE than that of model B.
figure 7

However, the predictions of model B have a low AE ≤ 0.4 for all data as compared to model A that has AE > 0.4 for 2 data points. Hence, model B is more advantageous in making a greater number of good predictions than fewer more accurate predictions as done by model A. We, therefore, adopt this criterion for selecting the best model here. Data shown here for illustrative purpose only.

The model has been optimized for the best performance via hyperparameter optimization. The hyperparameters of the GBR model namely the learning rate and n-estimators, were varied from 0.005 to 0.20 and from 50 to 500, respectively, to find the point at which the minimum MSE is produced using the count-based model selection technique described above. It was found that a minimum MSE of 0.016 mm/year was obtained at a learning rate of 0.01 and 150 n-estimators.