Elsevier

Applied Soft Computing

Volume 86, January 2020, 105837
Applied Soft Computing

Ensemble approach based on bagging, boosting and stacking for short-term prediction in agribusiness time series

https://doi.org/10.1016/j.asoc.2019.105837Get rights and content

Highlights

  • Ensembles and single models are compared for short-term forecasting in agribusiness.

  • Soybean and wheat commodities are adopted as case studies.

  • Boosting approaches showed lower predictions errors.

  • XGB or STACK and RF models are adopted in soybean and wheat cases, respectively.

  • Ensemble performance is better than SVR, KNN and MLP performance.

Abstract

The investigation of the accuracy of methods employed to forecast agricultural commodities prices is an important area of study. In this context, the development of effective models is necessary. Regression ensembles can be used for this purpose. An ensemble is a set of combined models which act together to forecast a response variable with lower error. Faced with this, the general contribution of this work is to explore the predictive capability of regression ensembles by comparing ensembles among themselves, as well as with approaches that consider a single model (reference models) in the agribusiness area to forecast prices one month ahead. In this aspect, monthly time series referring to the price paid to producers in the state of Parana, Brazil for a 60 kg bag of soybean (case study 1) and wheat (case study 2) are used. The ensembles bagging (random forests — RF), boosting (gradient boosting machine — GBM and extreme gradient boosting machine — XGB), and stacking (STACK) are adopted. The support vector machine for regression (SVR), multilayer perceptron neural network (MLP) and K-nearest neighbors (KNN) are adopted as reference models. Performance measures such as mean absolute percentage error (MAPE), root mean squared error (RMSE), mean absolute error (MAE), and mean squared error (MSE) are used for models comparison. Friedman and Wilcoxon signed rank tests are applied to evaluate the models’ absolute percentage errors (APE). From the comparison of test set results, MAPE lower than 1% is observed for the best ensemble approaches. In this context, the XGB/STACK (Least Absolute Shrinkage and Selection Operator-KNN-XGB-SVR) and RF models showed better performance for short-term forecasting tasks for case studies 1 and 2, respectively. Better APE (statistically smaller) is observed for XGB/STACK and RF in relation to reference models. Besides that, approaches based on boosting are consistent, providing good results in both case studies. Alongside, a rank according to the performances is: XGB, GBM, RF, STACK, MLP, SVR and KNN. It can be concluded that the ensemble approach presents statistically significant gains, reducing prediction errors for the price series studied. The use of ensembles is recommended to forecast agricultural commodities prices one month ahead, since a more assertive performance is observed, which allows to increase the accuracy of the constructed model and reduce decision-making risk.

Introduction

The operations inherent to agribusiness constitute an important network of activities that have a strong influence on the local, regional, and national economy. Generally, in times of crisis, agribusiness has been one of the areas that have contributed to the growth of different indicators related to the economy, such as the Gross Domestic Product (GDP). In view of the current Brazilian economic scenario, the GDP volume of agribusiness increased by 1.9%, facilitating national GDP growth and inflation control [1]. At the state level, according to recent reports from the Parana Institute for Economic and Social Development [2], the Parana state ended 2017 with a GDP of R$ (BRL) 415.8 billion, equivalent to 6.35% of the Brazilian economy. Regarding production, according to the Ministry of Agriculture, Livestock and Food Supply (MAPA) [3], the area planted with grains is expected to increase by 14.9% in the next ten years. Indeed, this is an important area for the development of the regions, and it also contributes to family income. Faced with this, it is important to know future price quotations, since this may affect the economic planning of small, medium, and large producers.

In this aspect, the forecast for time series of agricultural commodities prices plays an important role in the economic scenario. Also, the precision of the forecasts is as important as the expected results, and it is directly linked to the accuracy of the adopted model. Its importance permeates all areas of knowledge, as forecasts can be used as the basis for revision/implementation of public policies, development of strategic planning in companies, and decisions in the corporate world. In general, forecasting can be understood as a process of idealizing a more probable result, assuming a set of assumptions inherent in several aspects such as technology, environment, pricing, marketing, and production.

To improve the predictive capacity of regression models, among which we can highlight the time series models, several approaches have been proposed in the literature. One of these are regression ensembles [4], [5]. This principle is based on machine learning, whose essence is associated with the idea of divide-to-conquer and which seeks to overcome the limitation of a machine learning model that operates in isolation [6], [7]. Models characterized as ensembles refer to multiple base models (base learners, similar or not) developed to solve the same problem, each of which learns characteristics from data and accordingly makes a prediction. Hence, the final prediction is presented as a combination of the individual predictions. Through this approach we want to obtain a robust system that incorporates the prediction from all base learners. Among the basic ways to build ensembles for regression problems is the grouping by average and weighted average. In general, this approach achieves better results than single prediction models, since it has a better capacity to generalize and adapt in different scenarios [8]. The purpose of using regression ensembles is to generate a set of models to improve the prediction accuracy of the adopted models, whose target variables are numerical [9]. In this aspect, it is a methodology that has been used extensively over years and has been successful in different applications such as energy [10], climate [8], market forecasting [11], finance [12], [13], and chemistry [14].

From this perspective, this paper aims to explore the predictive capacity of regression ensembles by comparing ensembles among themselves, as well as with approaches that consider a single model (reference models), to forecast agricultural commodities prices one month ahead. The main motivation for the choice of this forecast horizon for the area of agribusiness is associated with the fact that short term forecasting allows small, medium and large producers to develop strategic planning for a short period of time to meet their instantaneous necessities. This paper conducts two case studies in which historical data of monthly prices (R$) paid to the producers in the state of Parana, Brazil, are considered for 60 kg bags of soybean (case study 1) and wheat (case study 2). The specific objectives are stated as follows: (i) Identify, by related works, the main variables that act as price drivers in the soybean and wheat cases. In addition, extract the importance of them for each case study; (ii) Present a theoretical revision of the regression ensembles based on bagging (random forests — RF), boosting (gradient boosting machine — GBM and extreme gradient boosting machine — XGB) and stacking (STACK) and (iii) verify if the ensemble approaches outperform single models (K-Nearest Neighbor — KNN, Multilayer Perceptron Neural Network — MLP and Support Vector Regression with linear Kernel— SVR), as well as which structure is better in the adopted cases.

The main contribution of this paper to the soft computing and machine learning literature lies in the application, evaluation and addition of discussions not previously developed regarding the behavior of GBM, RF, STACK and XGB to forecast (in the short-term) agricultural commodities prices, with the use of the price drivers identified in Section 3.2. While the papers identified in Section 2 focus on the study of price volatility by hybrid models, as well as combining decomposition, optimization and machine learning models, this paper proposes the investigation of the performance of bagging, boosting and stacking ensemble approaches in the presence of different features (price drivers) regarding short-term forecasting for the soybean and wheat commodities.

Through these contributions, it is expected that once a few time series models using regression ensembles to predict the agribusiness theme are introduced in the literature, this work will fill a gap in an area of the economy essential to the composition of economic indicators. Seeking to illustrate the superior performance of the ensemble models vis-à-vis reference approaches to forecasting in time series associated with agribusiness, it is expected that this study will stimulate the interest of financial market researchers to use this approach, since it performs well and enables managers to minimize their losses and leverage their earnings.

The remainder of this paper is structured as follows: Section 2 presents a set of related works on the subject of regression ensembles for commodities forecasting. Section 3 describes the data sets used in this paper, as well as the predictors employed in the modeling process. Section 4 introduces the theoretical aspects necessary for the development of this paper. Section 5 presents the adopted methodology. Section 6 details the results and discussions. Finally, Section 7 concludes the paper with general considerations and directions for future research.

Section snippets

Related works

Commodities comprise an important chain of products that directly influence the global economy. These products are grouped into categories as follow: energy, metal, livestock, and agriculture [15]. In this section, some recent works are presented regarding the prediction of prices, volatility, and future returns for the aforementioned sectors, using regression ensembles, as well as the main gaps are pointed out. Table 1 summarizes some related works with regard to objective (ensemble adopted),

Cases studies

The objective this section is to present the cases studies adopted in this paper. The first subsection presents the cases, and the second subsection presents the predictive variables adopted in the modeling process as well as some considerations.

Theoretical aspects

Decision-making is an essential step in different situations, especially in data modeling. However, since the results obtained will be the basis for decision-making, they should be as accurate as possible. In this aspect, the description and implementation of techniques that allow forecasting with greater accuracy should be the object of study in many situations, in order to make decision-making more concise and assertive. Therefore, in the following Subsections we describe the main theoretical

Methodology for agribusiness time series modeling

With the purpose to clarify the modeling steps, this section describes the main steps adopted in the data analysis. First, the data preprocessing is presented by the Box–Cox transformation. Second, the hyperparameters overview is presented, and then the control hyperparameters for adopted models are described in Table 6 contained in Section 6. Third, the time series cross-validation process is described. Fourth, an overview of STACK modeling is shown, as well as the performance measures.

Results and discussions

In this section, the exploratory analysis of the time series used in each case study is presented (Section 6.1), followed by the results of performance measures (Section 6.2), statistical tests for the errors of test set and residual analysis. After each Table and Figure, discussions are made. The Table A.1, Table A.2, which contain the performance measures of the 56 generated models, are presented in the Appendix.

Conclusions and future research

This paper aimed to compare the predictive performance of the GBM, XGB, RF and STACK regression ensembles, as well as the MLP, SVR and KNN models in two case studies related to agribusiness, namely: case 1, the soybean price and case 2, the wheat price paid to the producer from the state of Parana, for short-term forecasting. Explanatory variables were chosen based on studies found in the literature. Different models were combined, in the stacking approach, to form the level-0 of the

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.asoc.2019.105837.

Acknowledgments

The authors would like to thank National Council of Scientific and Technologic Development of Brazil — CNPq (Grants number: 303908/2015-7-PQ, 204893/2017-8 and 404659/2016-0-Univ) for its financial support of this work. Furthermore, the authors wish to thank the Editor and anonymous referees for their constructive comments and recommendations, which have significantly improved the presentation of this paper.

References (98)

  • YuL. et al.

    LSSVR ensemble learning with uncertain parameters for crude oil price forecasting

    Appl. Soft Comput.

    (2017)
  • ZhaoY. et al.

    A deep learning ensemble approach for crude oil price forecasting

    Energy Econ.

    (2017)
  • YangK. et al.

    Realized volatility forecast of agricultural futures using the HAR models with bagging and combination approaches

    Int. Rev. Econ. Finance

    (2017)
  • TangL. et al.

    A non-iterative decomposition-ensemble learning paradigm using RVFL network for crude oil price forecasting

    Appl. Soft Comput.

    (2018)
  • BonatoM. et al.

    Gold futures returns and realized moments: A forecasting experiment using a quantile-boosting approach

    Resour. Policy

    (2018)
  • DingY.

    A novel decompose-ensemble methodology with AIC-ANN approach for crude oil forecasting

    Energy

    (2018)
  • Fernandez-PerezA. et al.

    Contemporaneous interactions among fuel, biofuel and agricultural commodities

    Energy Econ.

    (2016)
  • ParisA.

    On the link between oil and agricultural commodity prices: Do biofuels matter?

    Int. Econ.

    (2018)
  • BodartV. et al.

    Real exchanges rates, commodity prices and structural factors in developing countries

    J. Int. Money Finance

    (2015)
  • ErdalH. et al.

    Bagging ensemble models for bank profitability: An emprical research on Turkish development and investment banks

    Appl. Soft Comput.

    (2016)
  • Hamze-ZiabariS. et al.

    Improving the prediction of ground motion parameters based on an efficient bagging ensemble model of M5 and CART algorithms

    Appl. Soft Comput.

    (2018)
  • ThakurM. et al.

    A hybrid financial trading support system using multi-category classifiers and random forest

    Appl. Soft Comput.

    (2018)
  • AssoulineD. et al.

    Large-scale rooftop solar photovoltaic technical potential estimation using random forests

    Appl. Energy

    (2018)
  • HeH. et al.

    A novel ensemble method for credit scoring: Adaption of different imbalance ratios

    Expert Syst. Appl.

    (2018)
  • PerssonC. et al.

    Multi-site solar power forecasting using gradient boosted regression trees

    Sol. Energy

    (2017)
  • TouzaniS. et al.

    Gradient boosting machine for modeling the energy consumption of commercial buildings

    Energy Build.

    (2018)
  • DingC. et al.

    Applying gradient boosting decision trees to examine non-linear effects of the built environment on driving distance in Oslo

    Transp. Res. A

    (2018)
  • MaX. et al.

    Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning

    Electron. Commer. Res. Appl.

    (2018)
  • WolpertD.H.

    Stacked generalization

    Neural Netw.

    (1992)
  • ShamaeiE. et al.

    Suspended sediment concentration estimation by stacking the genetic programming and neuro-fuzzy predictions

    Appl. Soft Comput.

    (2016)
  • SerbesG. et al.

    An emboli detection system based on dual tree complex wavelet transform and ensemble learning

    Appl. Soft Comput.

    (2015)
  • PetropoulosA. et al.

    A stacked generalization system for automated forex portfolio trading

    Expert Syst. Appl.

    (2017)
  • Pernía-EspinozaA. et al.

    Stacking ensemble with parsimonious base models to improve generalization capability in the characterization of steel bolted components

    Appl. Soft Comput.

    (2018)
  • AnifowoseF. et al.

    Improving the prediction of petroleum reservoir characterization with a stacked generalization ensemble model of support vector machines

    Appl. Soft Comput.

    (2015)
  • QureshiA.S. et al.

    Wind power prediction using deep neural network based meta regression and transfer learning

    Appl. Soft Comput.

    (2017)
  • MabuS. et al.

    Ensemble learning of rule-based evolutionary algorithm using multi-layer perceptron for supporting decisions in stock trading problems

    Appl. Soft Comput.

    (2015)
  • MessikhN. et al.

    The use of a multilayer perceptron (MLP) for modelling the phenol removal by emulsion liquid membrane

    J. Environ. Chem. Eng.

    (2017)
  • ChenR. et al.

    Forecasting holiday daily tourist flow based on seasonal support vector regression with adaptive genetic algorithm

    Appl. Soft Comput.

    (2015)
  • WangJ. et al.

    Improved v-support vector regression model based on variable selection and brain storm optimization for stock price forecasting

    Appl. Soft Comput.

    (2016)
  • ShineP. et al.

    Machine-learning algorithms for predicting on-farm direct water and electricity consumption on pasture based dairy farms

    Comput. Electron. Agric.

    (2018)
  • WengB. et al.

    Predicting short-term stock prices using ensemble methods and online data sources

    Expert Syst. Appl.

    (2018)
  • BergmeirC. et al.

    A note on the validity of cross-validation for evaluating autoregressive time series prediction

    Comput. Statist. Data Anal.

    (2018)
  • FloresB.E.

    The utilization of the wilcoxon test to compare forecasting methods: A note

    Int. J. Forecast.

    (1989)
  • FanJ. et al.

    Comparison of support Vector machine and extreme gradient boosting for predicting daily global solar radiation using temperature and precipitation in humid subtropical climates: A case study in China

    Energy Convers. Manage.

    (2018)
  • ZhangY. et al.

    A gradient boosting method to improve travel time prediction

    Transp. Res. C

    (2015)
  • KhanalS. et al.

    Integration of high resolution remotely sensed data and machine learning techniques for spatial prediction of soil properties and corn yield

    Comput. Electron. Agric.

    (2018)
  • PedroH.T. et al.

    Assessment of machine learning techniques for deterministic and probabilistic intra-hour solar forecasts

    Renew. Energy

    (2018)
  • ThompsonW. et al.

    Automatic responses of crop stocks and policies buffer climate change effects on crop markets and price volatility

    Ecol. Econom.

    (2018)
  • Cepea

    Centro de Estudos Avançados em Economia Aplicada. PIB do Agronegócio Brasileiro

    (2018)
  • Cited by (306)

    View all citing articles on Scopus
    View full text