Abstract

In the classical multivariate prediction model, most research studies focused on the selection of relevant behaviour factors and the stability of historical data for improving the predicting accuracy of the main behaviour factor, and the historical data of the main behaviour factor have never been considered as one relevant behaviour factor, which in fact can be the first key impact factor; besides, the historical data can directly predict the main behaviour in the time series forecasting model, such as the ARIMA model. In this paper, one modified MLR model combined with time series forecasting theory is presented and applied in grain consumption forecasting. In the proposed model, to improve the current grain consumption forecasting, how to select impact factors is also discussed by combining the grey relational degree and Pearson correlation coefficient with given weights, and the optimal preprocessing parameter by the moving average filtering is computed for eliminating the abnormal points and stabilizing the data. Finally, the selected main impact factors are inputted to the proposed modified MLR model for forecasting grain consumption. Simulation results have shown that the five-year mean absolute percentage error of ration and feed grain is 2.34% and 3.27%, respectively, and the prediction accuracy has improved up to 2 times compared with the BP model and LSTM model. Moreover, the robustness of the model is verified by prediction analysis at different time intervals of historical data.

1. Introduction

With the change of social, economic, and environmental factors, the grain consumption structure appears to have new features. There have been various prediction methods by research methods, focuses, and perspectives, which can be divided into qualitative analysis and quantitative analysis. Qualitative prediction is the analysis of the specialists based on the characteristics of historical data and intuitive materials, and the corresponding results rely on the experience and analysis ability of researchers. The quantitative prediction method is to build one mathematical model to predict the changing laws and the development trend in future, which has obtained wider acceptance and application. In general, the quantitative analysis methods include the time series prediction model, econometric equation (single equation and simultaneous equation), regression analysis prediction model, etc [1]. Cheng et al. considered the actual level of China’s current grain consumption and the future trend of China’s population development and predicted that the total grain consumption demand in China will not exceed 640 million tons in 2030 [2]. Gao used the time series method to calculate per capita grain consumption by selecting a national sample of residents and predicted total food consumption. He said that in 2020, China’s total grain consumption will reach to 595 million tons [3]. Chen et al. used the single equation econometric model to complete the prediction of per capita rural consumption and population changes in China; here the two variables were multiplied to obtain the total grain consumption, which will reach 76 million tons in 2020 and 65 million tons in 2030 [4]. Liao et al. proposed the modified CAPSIM-PODIUM model to forecast and analyze the grain demand of the whole country and nine major watershed slices, including rice, wheat, and corn; the results have shown that China’s total grain demand will reach 507 million tons in 2020 [5]. Pan et al. first established a regression analysis; here only two factors of food price and per capita disposable income are regarded as significant impacts on food consumption. Thus, the direct and indirect grain consumption models of rural residents per capita were obtained by the linear fitting model, and the predicted value of total grain consumption was obtained by combining the total population. The results have shown that direct and indirect grain consumption in China’s rural areas will reach 96.8658 million tons and 87.3173 million tons by 2020 and 58.1396 tons and 91.1647 million tons by 2030 [6]. From the existing research, the impact factors and historical data features are more important for the grain consumption prediction model. However, so far, it is not enough for the more comprehensive analysis. In this paper, one new combined model is proposed, including the selection of impact factors and the consideration of optimal preprocessing.

The main contributions of this paper are listed as follows:(1)To smooth original data and optimize the selection of impact factors, firstly, the abnormal points in the original data are removed by moving average filtering with the optimal window width, which makes the data trend to be smoothed and stabilized, and the relational degree of grain consumption and its impact factors are calculated by the proposed combined method, and thus the key impact factors are selected.(2)To minimize the prediction error of grain consumption, we combine the advantages of two models, and one new modified MLR model combined with time series forecasting theory based on data barycenter is proposed.

2. Research of Consumption Forecasting Method

China’s grain consumption can be divided into two categories: food consumption and nonfood consumption. Food consumption includes ration and feed grain consumption, while nonfood consumption includes industrial grain, seed grain, and loss grain. As shown in Figure 1, food consumption is the most important mode of grain consumption in China, so the research focus of this paper is the changing trend of ration and feed grain.

2.1. Moving Average Filtering with Optimal Window Span

Moving average filtering has been applied widely in numerical analysis, also known as linear filtering, which belongs to one-dimensional filtering. The algorithm is realized by recursion, that is, replacing the original value of the corresponding position with the average value of several neighborhood points, and moving average value is regarded as the trend representative value of the moving average term in this period. In this paper, all the historical data will be preprocessed by average moving filter for eliminating abnormal point and random fluctuations. Generally, when the filtering window width N is a positive odd number, the general formula of moving average filtering is defined aswhere is the input sequence, is the output sequence, and N is the window width.

Under different window widths, the smoothness and stabilization of the filtering original data will have a large difference. Here, coefficient of variation is adopted to evaluate the discreteness of observed values, which can be expressed by the ratio of standard deviation to average value of each observed value, and the symbol is CV [7]:where is the standard deviation, , and is the average value, . The smaller the coefficient of variation is, the better stability the group of data has. In fact, for obtaining higher prediction accuracy, the largest window width with the smallest coefficient of variation may not bring expected results; therefore, the optimal filtering window width will be chosen by the coefficient of variation and the prediction error.

2.2. Analysis of Impact Factors of Grain Consumption
2.2.1. Grey Relational Degree

The method of grey correlation analysis, also called “grey relational degree,” is used widely to calculate the correlation between variables [8], which can analyze the degree of correlation between two systems or two factors within the system under the condition of the change of time and speed and judge whether two data series are highly related by comparing their geometric similarity. In this paper, assuming the grain consumption data are a reference sequence and their impact factors are comparative sequence, the grey relational degree between the reference sequence and comparative sequence is calculated as follows [9]:(1)The reference sequence is ; comparative sequence is(2)Normalization is adopted to carry on a dimensionless preprocessing, and the normalized reference and comparative series are expressed as and , respectively.(3)Calculating the absolute difference sequence between reference sequence and comparative sequence :(4)Searching the maximum absolute difference sequence and minimum absolute difference sequence : the i-th grey relational coefficient is defined aswhere is the resolution coefficient, which is generally between 0.5 and 1. The smaller the parameter p is, the stronger the distinguishing ability for the difference among the relational coefficients is, usually taking 0.5.(5)Averaging the relational coefficients: the relational degree can be obtained as

2.2.2. Pearson Correlation Coefficient

Pearson correlation coefficient is a classical statistic proposed by Karl Pearson. In statistics, it is widely used to measure the degree of correlation between two variables, which can reflect the linear correlation between two variables. It is defined as the quotient of covariance and standard deviation between two variables [10]. Assuming two series and , the Pearson correlation coefficient between them iswhere r is between −1 and 1. The larger the absolute value of the correlation coefficient is, the higher the correlation between X and Y is, and the positive correlation coefficient means positive correlation, and the negative correlation coefficient represents negative correlation.

2.2.3. Combined Model of Selecting Main Impact Factors

Grain consumption includes ration consumption, feed-grain consumption, and industrial/seed/loss grain consumption. The changing social, economic, and environmental factors will produce a different effect on each component of the grain consumption; however, some uncorrelated impact factors and those with lower correlation will reduce the prediction accuracy and increase the complexity. Therefore, it is necessary to select main impact factors for the following prediction of the grain consumption.

From Sections 2.2.1 and 2.2.2, grey correlation analysis can embody the geometric similarity between any two factors, and Pearson correlation coefficient can reflect the linear correlation between two variables. How to combine the above two correlation degrees and select the real key impact factors will be one of innovations in this paper. In the existing references, only one correlation analysis usually is used to rank the correlation degree between one main behavioral factor and the reference series. Though for the same impact factors, there can produce the different ranking order of correlation degree by the above two correlation analysis methods [11]. One new combined model of selecting main impact factors is proposed in Figure 2. Here, grain consumption is the main behavioral factor, and its more possible impact factors are the reference series. From Figure 2, the correlation degree between them will firstly be computed separately by grey correlation analysis and Pearson correlation coefficient, and then one weighted operator will be adopted to obtain the final relational degree. Thus, the key impact factors can be chosen by the ranking order and one threshold of the final relational degree. Here, the threshold can be determined by the feedback of the following prediction error.

2.2.4. Prediction Model of Impact Factors

The ARIMA model is one time series prediction method proposed by Box and Jenkins in the early 1970s, which is also called the Box–Jenkins model [12, 13]. In the ARIMA (p, d, q) model, p is called autoregressive order and q is called moving average order. For obtaining better prediction, the original data should be stationary. The unit root test is often used to check the stationarity of time series. If the time series is nonstationary, it can be transformed into stationary time series by several times of difference [14].

Suppose that is the predictive value in t year and are the original values of various impact factors in past p years and set ; here is a single integer sequence with d order and is the stationary series; the general form of the ARMA model can be written as

Suppose L is the lag operator, thenand equation (9) can be rewritten aswhere

The d order difference transformation of ARMA(p, q) as equation (10) is called ARIMA(p, d, q):where is a white noise process with its mean value being 0 and variance being .

2.3. MLR Prediction Model Based on Data Barycenter

In the classical multivariate linear regression (MLR) model, the least-square method is usually used to compute the prediction parameters. Compared with the least-square method, the barycenter operators [15] can provide higher stable parameters for a multivariate regression model [16].

Assuming the dataset is , its first-order barycenter can be written as

If recorded , called a first-order barycenter operator. Similarly, its second-order barycenter can be written aswhere , called , a second-order barycenter operator.

And so forth, the k-order barycenter operator iswhere .

In the classical multivariate linear regression model [17, 18], the prediction value can be expressed ashere, are the prediction parameters; is the independent variables, and the number are of them, sets of observations of variables are expressed as.

In this paper, the k-order barycenter operators are adopted to compute the parameters of multivariate prediction model. The order of k-order barycenter operator is equal to the number of the prediction model parameters. The new proposed model based on data barycenter is called DB-based MLR.

In the DB-based MLR, equation (17) can be transformed as

Let k = 1, 2, …, p in equation (18); we can get one linear equation set. The corresponding parameters can be solved by Cramer's rule:where

Replacing the j-th column of D with (that is, the left term in equation (18)), we can get the coefficients , . Furthermore, taking equation (19) into equation (17), the optimal predictive model parameters can be obtained by several iterations.

2.4. Time Series Prediction Model

The time series prediction model is one of the quantitative prediction methods, which arranges the historical data of the predicted object into a time series in time order, analyzes its changing trend with time, and establishes a mathematical model to extrapolate the future value. [19]. The general structure of time series prediction model is expressed as follows:where are the prediction parameters and are the historical values of grain consumption. By the modeling idea and parameter calculating algorithms, the time series models can be divided into the moving average (MA) model, trend extrapolation model, exponential regression (ES) model, autoregressive (AR) model, ARMA model, ARIMA model, etc [2022]. It is same for the above time series to predict the future value by fitting its historical data, only using the different computing methods of model parameters.

2.5. Modified MLR Model Combined with Time Series Forecasting Theory

Summarizing the analysis of the time series model and classical MLR model, we can find that the two models cannot make full use of the historical data and the impact factors at the same time. In this paper, the historical data of prediction variable will be considered as one key impact factor, which will be imputed to the MLR model with the other chosen impact factors. The proposed modified MLR model can be expressed aswhere are the historical data of the prediction value ; they are the “internal factors” of the MLR model; besides, are the chosen key impact factors, which are “external factors.” Therefore, the modified MLR model can take advantages from adding the internal factors. In this model, the model parameters and can be computed at one time; therefore, the proposed model is different with the simple combining the two models, respectively. The prediction performance will be discussed in the following sections.

2.6. The Evaluation Index of Prediction Error

In this paper, the absolute percentage error (APE), the mean absolute percentage error (MAPE), and the Theil inequality coefficient of prediction residuals are used as the criteria for the model prediction. MAPE is the average ratio of all absolute percentage errors (APEs) to real values, and it is a percentage. In general, absolute error can avoid the problem of errors cancelling out each other; therefore, MAPE can accurately reflect the actual prediction error and can better reflect the reliability of prediction model. Besides, Theil inequality coefficient is a correlation index to measure the prediction accuracy. The closer the Theil inequality coefficient is to 0, the smaller the difference between the predicted value and the real value will be, which indicates the better fitting degree of the prediction model. The mean absolute percentage error and the Theil inequality coefficient are defined as

3. Simulation Analysis

3.1. Moving Average Filtering with Optimal Window Width

In experiment, the data are from the website of the National Statistical Office and “China Statistical Yearbook 2018” [23]. All impact factors and different kinds of grain consumptions will be preprocessed by the moving average filtering under different window widths, as shown in Figures 3 and 4. Here, urban ration (1981∼2017) is one example of the original data in the experiment.

From Figures 3 and 4, for different window width N, the larger the window width is, the smaller the coefficient of variation is and the smoother the filtered sequence is. Considering the following prediction accuracy, for given training data, the optimal window width can be determined by two indexes: coefficient of variation test and prediction error.

3.2. Selection of Impact Factors

As shown in Sections 2.2.1 and 2.2.2, the relational degree and order between each kind of grain consumption and the corresponding several impact factors are analyzed. The selection of key impact factors is shown as Table 1. Here, the grain consumption includes urban ration, rural ration, urban feed grain, and rural feed grain. The impact factors include population, urbanization level, Engel’s coefficient, per capita income, price index of agricultural products, etc.

3.3. Analysis of Model Prediction

In the modified prediction model proposed above, by considering the complexity of the algorithm and the comparison of experimental results, the final model determined by the project is as follows:where model parameters , and can be calculated by the data barycenter method.

The original data include grain consumption and several impact factors. In simulation, the first 33 years (1981–2013) of grain consumption data are trained in the model, and then the next five years (2014–2017) of grain consumption data are predicted. Defining the year of 1981 as t = 1 and 2013 as t = 33, the urban ration consumption, urban feed-grain consumption, rural ration consumption, and rural feed-grain consumption are, respectively, expressed as , and ; the five factors are, respectively, defined as , and . The forecasting process is shown as follows:(1)When t = 1∼37, the relational degree between and , and is calculated; the two key impact factors will be selected by the proposed combined method, expressed as and .(2)When t = 1∼33, , , , and are preprocessed by average moving filtering and are expressed as , , , and , respectively, which are used as training data into . The model parameters , and will calculated by data barycenter, and then the prediction model can be obtained.(3) and are predicted by the ARIMA model.(4)Assuming t = t + 1 and , the urban ration consumption next year is obtained by the proposed joint model.(5)Repeating (3)∼(4), the prediction of the urban ration consumption can be completed.

Similar to the forecasting process of rural ration consumption, the urban feed-grain consumption , rural ration consumption , and rural feed-grain consumption can be obtained, respectively. For obtaining the total grain consumption, we can sum the above prediction results.

3.3.1. Fitting Results

By the proposed method as equation (24), for every kind of grain consumption, two key impact factors and the historical grain consumption are chosen as the input of the time series-MLR joint model. However, for the classical MLR model, the input is only the two chosen key impact factors. The fitting results of the two models are shown in Figure 5(a) for feed grain prediction and Figure 5(b) for ration grain prediction.

From Table 2, it can be seen that the fitting goodness of ration and feed grain all can be improved. The fitting results of the proposed model can provide better prediction reliability.

3.3.2. Analysis of Prediction Results

For the experiments of predicting the future trend of grain consumption, we choose the filter window width equal to 3, 5, or 7 and establish the prediction model. Different filtering window widths will affect the model prediction performance; therefore, if we feed the prediction error back to adjust the filter parameters, the optimal filtering window width can be chosen. Here, the chosen optimal window width is 5 and 3, respectively, for ration consumption prediction and feed grain consumption prediction. In this paper, the MLR model and modified MLR model are established based on these optimal preprocessing methods.

In addition, in our simulation, a classical time series ARIMA model is adopted to predict the ration and feed grain consumption in China. Besides, deep learning prediction is also considered for comparison, which is the emerging research hotspot in recent years [2430]. Five kinds of prediction models are considered for comparing APE, MAPE, and Theil’s U performance.(1)The classical time series ARIMA model:where is the actual value of the past p years; the order of p and q can be realized by model recognition and and can be calculated by the least-square method.(2)The MLR model:The two impact factors with the greatest relational degree as the key impact factors in the modeling were input into the MLR prediction model:where are key impact factors; model parameters can be calculated by the data barycenter method.(3)The modified MLR model:Considering the complexity of the algorithm and the prediction performance, the proposed MLR model iswhere model parameters , and can be calculated by the data barycenter method; is the future prediction value; is the historical data of the prediction variable; and are the chosen optimal impact factors.(4)The BP model:Backpropagation (BP) is one of the artificial neural networks (ANNs) applied widely, which can train the model parameters by reversing the error propagation algorithm [2428]. The prediction frame of the BP model is shown in Figure 6.There are three layers to predict the future value from the three inputted history data; here the hidden layer is used to analyze the characteristics of input data. In Figure 6, is the inputted history data of grain consumption, is the future prediction value, and and are the weighting values of the trained model.(5)The LSTM model:LSTM is one special recurrent neural network (RNN), which substitutes the original neural units of hidden layers for memory units [29, 30]. Each memory unit has an input door, output door, forgotten door, and memory cells, as shown in Figure 7.

Here is the input history data of grain consumption, is the output of different memory unit, is the prediction value of each layer, and so on; we can get the final future value .

The comparison of the prediction results of these five models is shown in Tables 3 and 4.

From Tables 3 and 4, for the ARIMA model, the closer the forecast year is, the higher the prediction accuracy of the model is. For the MLR model, the further away the predicted year is, the higher the prediction accuracy of the model is. This is because the time series prediction model is more suitable for short-term prediction and the multiple regression model is suitable for long-term prediction. However, it can be found that the prediction accuracy of the proposed modified MLR model is higher than that of the single time series model and the classical MLR model. It is mainly because the combined model can not only track the change of impact factors but also grasp the internal change trend of grain consumption, thereby leading to a small error range. And the proposed model is superior to the BP model and LSTM model for the two indexes of MAPE and Theil’s U.

Meanwhile, in order to verify the stability of the modified MLR model, two experiments are carried out: (1) the training time interval is 1981–2014, and the predicting time interval is 2015–2018; (2) the training time interval is 1981–2013, and the predicting time interval is 2014–2017. The prediction performance of ration consumption and feed grain consumption is shown in Table 5.

From Table 5, it can be seen that the prediction error limit is kept within 2% for ration and feed grain consumption under different time intervals, which indicates that the proposed model can realize stable prediction with satisfying accuracy for the medium-short-term prediction.

4. Conclusions

In this paper, one new modified MLR model combined with time series forecasting theory is proposed. In the proposed model, the historical data of prediction variable will be considered as one key impact factor, which will be imputed to the MLR model with the other chosen impact factors. Thus, it can make full use of the advantages of the time series model and the MLR model, and the prediction performance has been improved strikingly, compared with the MLR model, ARIMA model, and even the BP model and LSTM model. Besides, there are other innovative works; the original data are preprocessed by moving average filtering under optimal window, and the data barycenter method is also adapted to compute the model parameters of MLR for improving the robustness. The proposed prediction model can be applied widely in short-medium-term prediction of grain prediction or the other fields.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was financially supported by the National Natural Science Foundation of China (61871176 and 61741107), Key Projects of Henan Science and Technology Department (202102110265), Applied Research Plan of Key Scientific Research Projects in Henan Colleges and Universities (19A510011), and Scientific Research Foundation Natural Science Project in Henan University of Technology (2018RCJH18).