Ensemble approach based on bagging, boosting and stacking for short-term prediction in agribusiness time series

doi:10.1016/j.asoc.2019.105837

Applied Soft Computing

Volume 86, January 2020, 105837

https://doi.org/10.1016/j.asoc.2019.105837 Get rights and content

Highlights

•
Ensembles and single models are compared for short-term forecasting in agribusiness.
•
Soybean and wheat commodities are adopted as case studies.
•
Boosting approaches showed lower predictions errors.
•
XGB or STACK and RF models are adopted in soybean and wheat cases, respectively.
•
Ensemble performance is better than SVR, KNN and MLP performance.

Abstract

The investigation of the accuracy of methods employed to forecast agricultural commodities prices is an important area of study. In this context, the development of effective models is necessary. Regression ensembles can be used for this purpose. An ensemble is a set of combined models which act together to forecast a response variable with lower error. Faced with this, the general contribution of this work is to explore the predictive capability of regression ensembles by comparing ensembles among themselves, as well as with approaches that consider a single model (reference models) in the agribusiness area to forecast prices one month ahead. In this aspect, monthly time series referring to the price paid to producers in the state of Parana, Brazil for a 60 kg bag of soybean (case study 1) and wheat (case study 2) are used. The ensembles bagging (random forests — RF), boosting (gradient boosting machine — GBM and extreme gradient boosting machine — XGB), and stacking (STACK) are adopted. The support vector machine for regression (SVR), multilayer perceptron neural network (MLP) and K-nearest neighbors (KNN) are adopted as reference models. Performance measures such as mean absolute percentage error (MAPE), root mean squared error (RMSE), mean absolute error (MAE), and mean squared error (MSE) are used for models comparison. Friedman and Wilcoxon signed rank tests are applied to evaluate the models’ absolute percentage errors (APE). From the comparison of test set results, MAPE lower than 1% is observed for the best ensemble approaches. In this context, the XGB/STACK (Least Absolute Shrinkage and Selection Operator-KNN-XGB-SVR) and RF models showed better performance for short-term forecasting tasks for case studies 1 and 2, respectively. Better APE (statistically smaller) is observed for XGB/STACK and RF in relation to reference models. Besides that, approaches based on boosting are consistent, providing good results in both case studies. Alongside, a rank according to the performances is: XGB, GBM, RF, STACK, MLP, SVR and KNN. It can be concluded that the ensemble approach presents statistically significant gains, reducing prediction errors for the price series studied. The use of ensembles is recommended to forecast agricultural commodities prices one month ahead, since a more assertive performance is observed, which allows to increase the accuracy of the constructed model and reduce decision-making risk.

Introduction

The operations inherent to agribusiness constitute an important network of activities that have a strong influence on the local, regional, and national economy. Generally, in times of crisis, agribusiness has been one of the areas that have contributed to the growth of different indicators related to the economy, such as the Gross Domestic Product (GDP). In view of the current Brazilian economic scenario, the GDP volume of agribusiness increased by 1.9%, facilitating national GDP growth and inflation control [1]. At the state level, according to recent reports from the Parana Institute for Economic and Social Development [2], the Parana state ended 2017 with a GDP of R$ (BRL) 415.8 billion, equivalent to 6.35% of the Brazilian economy. Regarding production, according to the Ministry of Agriculture, Livestock and Food Supply (MAPA) [3], the area planted with grains is expected to increase by 14.9% in the next ten years. Indeed, this is an important area for the development of the regions, and it also contributes to family income. Faced with this, it is important to know future price quotations, since this may affect the economic planning of small, medium, and large producers.

In this aspect, the forecast for time series of agricultural commodities prices plays an important role in the economic scenario. Also, the precision of the forecasts is as important as the expected results, and it is directly linked to the accuracy of the adopted model. Its importance permeates all areas of knowledge, as forecasts can be used as the basis for revision/implementation of public policies, development of strategic planning in companies, and decisions in the corporate world. In general, forecasting can be understood as a process of idealizing a more probable result, assuming a set of assumptions inherent in several aspects such as technology, environment, pricing, marketing, and production.

To improve the predictive capacity of regression models, among which we can highlight the time series models, several approaches have been proposed in the literature. One of these are regression ensembles [4], [5]. This principle is based on machine learning, whose essence is associated with the idea of divide-to-conquer and which seeks to overcome the limitation of a machine learning model that operates in isolation [6], [7]. Models characterized as ensembles refer to multiple base models (base learners, similar or not) developed to solve the same problem, each of which learns characteristics from data and accordingly makes a prediction. Hence, the final prediction is presented as a combination of the individual predictions. Through this approach we want to obtain a robust system that incorporates the prediction from all base learners. Among the basic ways to build ensembles for regression problems is the grouping by average and weighted average. In general, this approach achieves better results than single prediction models, since it has a better capacity to generalize and adapt in different scenarios [8]. The purpose of using regression ensembles is to generate a set of models to improve the prediction accuracy of the adopted models, whose target variables are numerical [9]. In this aspect, it is a methodology that has been used extensively over years and has been successful in different applications such as energy [10], climate [8], market forecasting [11], finance [12], [13], and chemistry [14].

From this perspective, this paper aims to explore the predictive capacity of regression ensembles by comparing ensembles among themselves, as well as with approaches that consider a single model (reference models), to forecast agricultural commodities prices one month ahead. The main motivation for the choice of this forecast horizon for the area of agribusiness is associated with the fact that short term forecasting allows small, medium and large producers to develop strategic planning for a short period of time to meet their instantaneous necessities. This paper conducts two case studies in which historical data of monthly prices (R$) paid to the producers in the state of Parana, Brazil, are considered for 60 kg bags of soybean (case study 1) and wheat (case study 2). The specific objectives are stated as follows: (i) Identify, by related works, the main variables that act as price drivers in the soybean and wheat cases. In addition, extract the importance of them for each case study; (ii) Present a theoretical revision of the regression ensembles based on bagging (random forests — RF), boosting (gradient boosting machine — GBM and extreme gradient boosting machine — XGB) and stacking (STACK) and (iii) verify if the ensemble approaches outperform single models (K-Nearest Neighbor — KNN, Multilayer Perceptron Neural Network — MLP and Support Vector Regression with linear Kernel— SVR), as well as which structure is better in the adopted cases.

The main contribution of this paper to the soft computing and machine learning literature lies in the application, evaluation and addition of discussions not previously developed regarding the behavior of GBM, RF, STACK and XGB to forecast (in the short-term) agricultural commodities prices, with the use of the price drivers identified in Section 3.2. While the papers identified in Section 2 focus on the study of price volatility by hybrid models, as well as combining decomposition, optimization and machine learning models, this paper proposes the investigation of the performance of bagging, boosting and stacking ensemble approaches in the presence of different features (price drivers) regarding short-term forecasting for the soybean and wheat commodities.

Through these contributions, it is expected that once a few time series models using regression ensembles to predict the agribusiness theme are introduced in the literature, this work will fill a gap in an area of the economy essential to the composition of economic indicators. Seeking to illustrate the superior performance of the ensemble models vis-à-vis reference approaches to forecasting in time series associated with agribusiness, it is expected that this study will stimulate the interest of financial market researchers to use this approach, since it performs well and enables managers to minimize their losses and leverage their earnings.

The remainder of this paper is structured as follows: Section 2 presents a set of related works on the subject of regression ensembles for commodities forecasting. Section 3 describes the data sets used in this paper, as well as the predictors employed in the modeling process. Section 4 introduces the theoretical aspects necessary for the development of this paper. Section 5 presents the adopted methodology. Section 6 details the results and discussions. Finally, Section 7 concludes the paper with general considerations and directions for future research.

Section snippets

Related works

Commodities comprise an important chain of products that directly influence the global economy. These products are grouped into categories as follow: energy, metal, livestock, and agriculture [15]. In this section, some recent works are presented regarding the prediction of prices, volatility, and future returns for the aforementioned sectors, using regression ensembles, as well as the main gaps are pointed out. Table 1 summarizes some related works with regard to objective (ensemble adopted),

Cases studies

The objective this section is to present the cases studies adopted in this paper. The first subsection presents the cases, and the second subsection presents the predictive variables adopted in the modeling process as well as some considerations.

Theoretical aspects

Decision-making is an essential step in different situations, especially in data modeling. However, since the results obtained will be the basis for decision-making, they should be as accurate as possible. In this aspect, the description and implementation of techniques that allow forecasting with greater accuracy should be the object of study in many situations, in order to make decision-making more concise and assertive. Therefore, in the following Subsections we describe the main theoretical

Methodology for agribusiness time series modeling

With the purpose to clarify the modeling steps, this section describes the main steps adopted in the data analysis. First, the data preprocessing is presented by the Box–Cox transformation. Second, the hyperparameters overview is presented, and then the control hyperparameters for adopted models are described in Table 6 contained in Section 6. Third, the time series cross-validation process is described. Fourth, an overview of STACK modeling is shown, as well as the performance measures.

Results and discussions

In this section, the exploratory analysis of the time series used in each case study is presented (Section 6.1), followed by the results of performance measures (Section 6.2), statistical tests for the errors of test set and residual analysis. After each Table and Figure, discussions are made. The Table A.1, Table A.2, which contain the performance measures of the 56 generated models, are presented in the Appendix.

Conclusions and future research

This paper aimed to compare the predictive performance of the GBM, XGB, RF and STACK regression ensembles, as well as the MLP, SVR and KNN models in two case studies related to agribusiness, namely: case 1, the soybean price and case 2, the wheat price paid to the producer from the state of Parana, for short-term forecasting. Explanatory variables were chosen based on studies found in the literature. Different models were combined, in the stacking approach, to form the level-0 of the

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.asoc.2019.105837.

Acknowledgments

The authors would like to thank National Council of Scientific and Technologic Development of Brazil — CNPq (Grants number: 303908/2015-7-PQ, 204893/2017-8 and 404659/2016-0-Univ) for its financial support of this work. Furthermore, the authors wish to thank the Editor and anonymous referees for their constructive comments and recommendations, which have significantly improved the presentation of this paper.

References (98)

RenY. et al.
Ensemble methods for wind and solar power forecasting—A state-of-the-art review
Renew. Sustain. Energy Rev.
(2015)
SoaresE. et al.
Ensemble of evolving data clouds and fuzzy models for weather time series prediction
Appl. Soft Comput.
(2018)
Torres-BarránA. et al.
Regression tree ensembles for wind energy and solar radiation prediction
Neurocomputing
(2019)
ZhangX.-d. et al.
Stock trend prediction based on a new status box method and AdaBoost probabilistic support vector machine
Appl. Soft Comput.
(2016)
KraussC. et al.
Deep neural networks, gradient-boosted trees, random forests: statistical arbitrage on the S&P 500
European J. Oper. Res.
(2017)
WengB. et al.
Macroeconomic indicators alone can predict the monthly closing price of major U.S. indices: Insights from artificial intelligence, time-series analysis and hybrid models
Appl. Soft Comput.
(2018)
PeimankarA. et al.
Multi-objective ensemble forecasting with an application to power transformers
Appl. Soft Comput.
(2018)
HeK. et al.
Ensemble forecasting of value at risk via multi resolution analysis based methodology in metals markets
Expert Syst. Appl.
(2012)
PierdziochC. et al.
A boosting approach to forecasting the volatility of gold-price fluctuations under flexible loss
Resour. Policy
(2016)
YuL. et al.
A novel decomposition ensemble model with extended extreme learning machine for crude oil price forecasting
Eng. Appl. Artif. Intell.
(2016)

PedroH.T. et al.

Assessment of machine learning techniques for deterministic and probabilistic intra-hour solar forecasts

Renew. Energy

(2018)

ThompsonW. et al.

Automatic responses of crop stocks and policies buffer climate change effects on crop markets and price volatility

Ecol. Econom.

(2018)

Cepea

Centro de Estudos Avançados em Economia Aplicada. PIB do Agronegócio Brasileiro

(2018)

Cited by (306)

Inverse design of twisted bilayer graphene metasurface for terahertz absorption broadening based on artificial neural network
2024, Optics and Laser Technology
With the development of terahertz (THz) technology, achieving ultra-wideband absorption of THz waves has become a significant challenge that researchers are striving to address. Different from the complex stacking of multiple graphene layers in the traditional design, inspired by the twisted magic angle of graphene structures, a twisted bilayer graphene metasurface absorber (TBGMA) is innovatively investigated for THz wave absorption broadening inverse design based on artificial neural network (ANN). Compared with the traditional manual parameter tuning, the ANN-based inverse design method can quickly accomplish the selection of the structural parameters and twist angles of the TBGMA, realizing the ultra-wideband absorption of THz waves (6.046 THz). The relationship between twisted SGLs and broadened effective absorption bandwidths is thoroughly analyzed through a combination of the electric field distribution and effective medium theory (EMT). Additionally, the tunability and absorption spectra of TBGMA under different incidence conditions have been discussed. The study of the twisted angle of SGLs can provide a unique idea for expanding the absorption bandwidth, and the inverse design of structural and angle parameters in conjunction with ANN is expected to be extended to the design of other precision structures in nanophotonics.
A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation
2024, Expert Systems with Applications
Class imbalance (CI) in classification problems arises when the number of observations belonging to one class is lower than the other. Ensemble learning combines multiple models to obtain a robust model and has been prominently used with data augmentation methods to address class imbalance problems. In the last decade, a number of strategies have been added to enhance ensemble learning and data augmentation methods, along with new methods such as generative adversarial networks (GANs). A combination of these has been applied in many studies, and the evaluation of different combinations would enable a better understanding and guidance for different application domains. In this paper, we present a computational study to evaluate data augmentation and ensemble learning methods used to address prominent benchmark CI problems. We present a general framework that evaluates 9 data augmentation and 9 ensemble learning methods for CI problems. Our objective is to identify the most effective combination for improving classification performance on imbalanced datasets. The results indicate that combinations of data augmentation methods with ensemble learning can significantly improve classification performance on imbalanced datasets. We find that traditional data augmentation methods such as the synthetic minority oversampling technique (SMOTE) and random oversampling (ROS) are not only better in performance for selected CI problems, but also computationally less expensive than GANs. Our study is vital for the development of novel models for handling imbalanced datasets.
Optimizing machine learning techniques and SHapley Additive exPlanations (SHAP) analysis for the compressive property of self-compacting concrete
2024, Materials Today Communications
This study examined the effectiveness of employing machine learning (ML) techniques to estimate the compressive strength (CS) of self-compacting concrete (SCC). Multiple techniques were utilized, such as a decision tree (DT), a random forest regressor (RFR), an AdaBoost regressor (AR), and a gradient boosting regressor (GBR). Additionally, the research investigated the impacts and correlations among input variables and the CS of SCC using the SHapley Additive exPlanations (SHAP) technique. A full dataset was created, including a single dependent variable and ten independents. Although GBR and DT methods proved effective, the research indicated that AR and RFR provided the most accurate predictions of SCC's CS. The AR and RFR models demonstrated superior performance compared to the DT and GBR models, as evident by their R² values of 0.90 and 0.91, respectively, outperforming the DT and GBR models with R² values of 0.85 and 0.88, respectively. The SHAP investigation revealed that the concentration of superplasticizer and cement in the mixture had an extensive impact on the CS of SCC. The findings of this study indicate that both AR and RFR methods had comparable predictive abilities in estimating SCC's CS.
A machine learning method based on stacking heterogeneous ensemble learning for prediction of indoor humidity of greenhouse
2024, Journal of Agriculture and Food Research
Efficient production management, high productivity, and improved product quality are essential for the success of greenhouse production in producing sustainable agricultural products. Several environmental factors, including air temperature, humidity, CO2 levels, and light levels, have a major influence on this. Managing internal humidity is critical to preventing climate variation, disease, and pests in glasshouses that can cause significant damage if not properly controlled. This article assesses the performance of machine learning models in predicting indoor humidity levels in a greenhouse using a dataset from Guilan University's greenhouse located in Rasht City, Iran. Seven regression models were used to make predictions: multiple linear regression (MR), polynomial regression (PR), decision tree regression (DT), k-nearest neighbors regression (KNN), support vector regression (SVR), random forest regression (RF), and extreme gradient boosting regression (XGBoost). Evaluation criteria including coefficient of determination ( $R^{2}$ ), mean absolute error (MAE), mean square error (MSE), and root mean square error (RMSE) were used to evaluate each model. The best machine learning models were selected based on these criteria values ( $R^{2}$ > 0.94) and combined using the stacking method, a popular ensemble learning technique, to create a metamodel for accurately predicting internal humidity within the greenhouse. The metamodel showed exceptional performance, with significantly improved evaluation criteria on the test dataset, specifically $R^{2}$ of 0.96515, MAE of 0.01395, MSE of 0.03205, and RMSE of 0.00102.
Safety assessment of tunnel construction based on counterintuitivity detection using multi-profile multi-model ensemble learning
2024, Expert Systems with Applications
Motivated to accurately predict the building tilt rate (BTR) in tunnel construction, a new multi-profile multi-model ensemble learning approach is proposed in this study by distinguishing between counterintuitive data and intuitive data, and handling each type appropriately. The proposed approach consists of three phases, namely the definition, identification, and handling of counterintuitive data, which are the major theoretical contributions of this study. Firstly, counterintuitive data is defined based on the input–output causal relation of safety-related data, introducing a novel concept in this research. Secondly, counterintuitive data is initially identified using a multi-profile ensemble learning approach and subsequently validated through multi-model ensemble learning. Finally, the identified and confirmed counterintuitive data is handled by assigning reduced weights. To validate the practicality of the approach, a case study on predicting the building tilt rate (BTR) in tunnel construction is conducted. The results of the case study demonstrate that the proposed multi-profile multi-model approach yields more accurate predictions of BTR compared to direct prediction and several other machine learning approaches. Furthermore, the proposed approach aids in identifying counterintuitive data in the testing dataset. Additionally, the effectiveness and superiority of the proposed approach are verified by comparing the prediction results when traditional abnormal data or random data is identified and treated as counterintuitive data.
A novel combination of machine learning models and metaheuristic algorithm to predict important parameters of twin screw wet granulation process
2024, Alexandria Engineering Journal
Twin screw granulation (TSG) has recently been emerged as a novel approach for the continuous wet granulation of fine particles (i.e., powders) in the pharmaceutical industry. The presence of brilliant advantages like the ability of operation at very low liquid concentrations and excellent product consistency has made this technique promising. Except positive points, the existence of major challenges like scalability and flexibility in the processing regimes has enhanced the importance of deeper investigations towards true recognition of this process. The central aim of this theoretical article is to develop the modeling process of TSG employing four machine learning models and one metaheuristic algorithms in a hybrid approach. Screw speeds, material throughputs, liquid binder (water)-to-solid ratios, and screw configurations are known as important parameters of TSG process, which were validated via their comparison with the obtained experimental data. GBR, SGD, and SVR were finally selected for 3 targets with their best combinations of hyper-parameters employing FA. The output is based on d-values (d10, d50, d90) for the granulate particle size distribution (PSD). Final models have R² scores of 0.919, 0.960, and 0.877 for d10, d50, d90 outputs, respectively.

View all citing articles on Scopus

View full text

Ensemble approach based on bagging, boosting and stacking for short-term prediction in agribusiness time series

Highlights

Abstract

Introduction

Section snippets

Related works

Cases studies

Theoretical aspects

Methodology for agribusiness time series modeling

Results and discussions

Conclusions and future research

Declaration of Competing Interest

Acknowledgments

Renew. Sustain. Energy Rev.

Appl. Soft Comput.

Neurocomputing

Appl. Soft Comput.

European J. Oper. Res.

Appl. Soft Comput.

Appl. Soft Comput.

Expert Syst. Appl.

Resour. Policy

Eng. Appl. Artif. Intell.

Appl. Soft Comput.

Energy Econ.

Int. Rev. Econ. Finance

Appl. Soft Comput.

Resour. Policy

Energy

Energy Econ.

Int. Econ.

J. Int. Money Finance

Appl. Soft Comput.

Appl. Soft Comput.

Appl. Soft Comput.

Appl. Energy

Expert Syst. Appl.

Sol. Energy

Energy Build.

Transp. Res. A

Electron. Commer. Res. Appl.

Neural Netw.

Appl. Soft Comput.

Appl. Soft Comput.

Expert Syst. Appl.

Appl. Soft Comput.

Appl. Soft Comput.

Appl. Soft Comput.

Appl. Soft Comput.

J. Environ. Chem. Eng.

Appl. Soft Comput.

Appl. Soft Comput.

Comput. Electron. Agric.

Expert Syst. Appl.

Comput. Statist. Data Anal.

Int. J. Forecast.

Energy Convers. Manage.

Transp. Res. C

Comput. Electron. Agric.

Renew. Energy

Ecol. Econom.

Centro de Estudos Avançados em Economia Aplicada. PIB do Agronegócio Brasileiro