Abstract

Short-term load forecasting (STLF) is an essential and challenging task for power- or energy-providing companies. Recent research has demonstrated that a framework called “decomposition and ensemble” is very powerful for energy forecasting. To improve the effectiveness of STLF, this paper proposes a novel approach integrating the improved complete ensemble empirical mode decomposition with adaptive noise (ICEEMDAN), grey wolf optimization (GWO), and multiple kernel extreme learning machine (MKELM), namely, ICEEMDAN-GWO-MKELM, for STLF, following this framework. The proposed ICEEMDAN-GWO-MKELM consists of three stages. First, the complex raw load data are decomposed into a couple of relatively simple components by ICEEMDAN. Second, MKELM is used to forecast each decomposed component individually. Specifically, we use GWO to optimize both the weight and the parameters of every single kernel in extreme learning machine to improve the forecasting ability. Finally, the results of all the components are aggregated as the final forecasting result. The extensive experiments reveal that the ICEEMDAN-GWO-MKELM can outperform several state-of-the-art forecasting approaches in terms of some evaluation criteria, showing that the ICEEMDAN-GWO-MKELM is very effective for STLF.

1. Introduction

Accurate load forecasting plays a significant role for the participants in the electricity industry because it can provide a safe and reliable automatic management for a smart grid. According to the length of the period involved, the tasks of load forecasting can be divided into three groups: long-term forecasting, mid-term forecasting, and short-term forecasting. Among them, short-term load forecasting (STLF) has become the hottest research topic in load forecasting because it can not only increase the scheduling efficiency but also reduce the costs of operations [1, 2].

In the last decades, time series-based models, such as random walk, moving average (MA), autoregressive integrated MA (ARIMA), ARIMA with explanatory variable (ARIMAX), and generalized autoregressive conditional heteroskedasticity (GARCH), are widely used in load forecasting [3, 4]. Lee and Ko embedded a lifting scheme into ARIMA models to improve the forecasting ability, and the simulation results verified the effectiveness of STLF [3]. Cui and Peng introduced temperature into ARIMA to propose an improved ARIMAX to deal with the mutation data structures [4]. However, because these time series-based models are usually built on the assumption that the load data are with the characteristics of linearity and stationarity, which are not always met in practical load data, the forecasting accuracy is limited. In fact, recent research has demonstrated that the load data are usually nonlinear and nonstationary. Therefore, it is necessary to use other models instead of time series-based models to improve load forecasting accuracy. Artificial intelligence (AI) models are thought of as having the capacity to capture intrinsic features of complicated signals. Therefore, they have become more and more popular in energy forecasting. Typical AI models include support-vector regression (SVR) and its extension least-squares SVR (LSSVR) [5, 6], artificial neural network (ANN) [79], extreme learning machine (ELM) [10], sparse Bayesian learning (SBL) [11, 12], deep learning [13] (stacked denoising autoencoders (SDAs) [14], deep belief network (DBN) [15], convolutional neural network (CNN) [16], and long short-term memory (LSTM)) [17]), and nature-inspired optimization algorithms [18, 19]. For example, Chen et al. put forward a new SVR model that used the temperature before demand response as additional input variables for STLF [6]. Kulkarni et al. proposed a spiking neural network to forecast short-term load with consideration of weather variables [8]. Yoem and Kwak used an ELM with knowledge representation for STLF, and the experimental results indicated good performance of the approach [10]. Han et al. presented time-dependency CNN and cycle-based LSTM for STLF by mapping the load data as pixels and rearranging them into a 2-D image, and the extensive experiments demonstrated that the proposed models were superior to the compared models in terms of computation complexity [13]. Some other research aims to forecast load data accurately using hybrid models [20, 21].

As far as energy time series forecasting is concerned, recent studies have shown that a framework called “decomposition and ensemble” is able to improve the forecasting performance significantly. The main idea of this framework is to decompose the raw energy data series into several simpler components, then handle each component individually, and finally integrate the result from each component as the final forecasting result. This framework is a typical form of the strategy of “divide and conquer” that is widely used in energy price forecasting [2224], wind speed forecasting [25, 26], load forecasting [27, 28], biosignal processing [29, 30], fault diagnosis [31], image processing [3235], and so on. There are many types of decomposition methods that can be applied to decomposing energy time series. Among them, the members of the family of empirical mode decomposition (EMD), i.e., ensemble EMD (EEMD), complete EEMD (CEEMD), CEEMD with adaptive noise (CEEMDAN), and the improved CEEMDAN (ICEEMDAN), have become very popular in signal decomposition [3639]. With them, the complicated raw energy time series can be decomposed into several components, some high-frequency components and some low-frequency ones, making it easier to forecast each component individually than to forecast the raw energy time series directly. The existing research has revealed that the ICEEMDAN is better than the other decomposition methods in the family of EMD [39]. Theoretically, any regression methods in AI can be applied to forecasting each component. ELM has shown its superpower in the tasks of both classification and regression in recent years [40]. In particular, ELM is able to achieve satisfactory results with time series forecasting [4143]. When kernel trick is introduced into ELM, it has the ability to improve the performance of STLF further [44, 45]. However, the existing kernel ELM usually uses one kernel or a simple combination of several kernels to construct the kernel matrix as the input of ELM, and the parameters in the single kernel or the weights of the several kernels are optimized by algorithms or specified by users. The number and types of single kernels in such ELM may limit forecasting effectiveness. An ideal way is to construct the kernel matrix using multiple kernels built by different types of kernels, and both the parameters and the weights can be optimized adaptively. Some nature-inspired algorithms, such as ant colony optimization [46, 47], particle swarm optimization (PSO) [48, 49], and grey wolf optimization (GWO) [50, 51], have the potential to optimize the parameters and the weights for the involved kernels simultaneously. Recent research has demonstrated that GWO outperforms some other nature-inspired algorithms in numerical optimization [52, 53].

Inspired by the power of ICEEMDAN in signal decomposition, GWO in numerical optimization, and multiple kernel ELM (MKELM) in regression or prediction, this paper proposes a novel approach that integrates ICEEMDAN, GWO, and MKELM, so-called ICEEMDAN-GWO-MKELM, for STLF. Specifically, the proposed ICEEMDAN-GWO-MKELM is composed of three stages. First, the raw load series is decomposed into a couple of components, some showing high-frequency characteristics while others showing low-frequency ones. Second, an independent MKELM model is built on each component. To improve the representation ability of the kernel matrix, GWO is applied to optimizing both the parameters and the weight of each base kernel simultaneously. Finally, the predicted results of all the individual components are simply accumulated as the final forecasting result.

The novelty of this paper lies in the following two aspects: (1) it is the first time that the combination of ICEEMDAN, GWO, and ELM is used for energy forecasting, especially for STLF. (2) A novel multiple kernel learning (MKL) framework optimizing both the parameters and the weight of every single kernel with GWO is proposed. We conduct extensive experiments to compare the proposed ICEEMDAN-GWO-MKELM with many state-of-the-art approaches, and the results demonstrate that the ICEEMDAN-GWO-MKELM is able to outperform the compared approaches in most cases.

The remaining part of this paper is structured as follows. Section 2 briefly describes ICEEMDAN, GWO, ELM, and MKL. The proposed ICEEMDAN-GWO-MKELM is methodologically formulated in detail in Section 3. To evaluate the proposed method, we conduct extensive experiments, and the results are analyzed and discussed in Section 4. Finally, the paper is concluded in Section 5.

2. Methods

2.1. The Improved Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (ICEEMDAN)

The ICEEMDAN is a signal decomposition algorithm proposed by Colominas et al. based on CEEMDAN in 2014 [39]. It is a new member of the family of EMD.

The EMD is a signal decomposition technique that can decompose a data series into a set of components with different frequencies, including intrinsic mode functions (IMFs) and a residue [36]. The signal must satisfy the following two characteristics: (1) in all data, the number of extreme values (maximum and minimum) and zero crossings are equal or different at most by one; (2) the local mean must be zero at any time.

The EMD algorithm is described as follows:Step 1: connect the local maxima as well as the minima of the raw data by two cubic spines to form one upper envelope and one lower envelope, respectivelyStep 2: average the upper envelope and the lower envelope, and subtract the mean value from the original signal to obtain a new sequenceStep 3: examine whether the new sequence from Step 2 meets the two characteristics listed above; if not, take the new sequence as raw data and go to Step 1; otherwise, take the new sequence as one IMF and subtract it from the original sequence to obtain one residue (the local mean)Step 4: examine whether the residue from Step 3 is a monotonic function or its value is less than a predetermined threshold; if not, take the residue as raw data and go to Step 1; otherwise, end the decomposition and output all IMFs and one residue

However, EMD is prone to mode-mixing problems during the decomposition process [37]. The main form of the problem is that an IMF contains signals with large scale differences. The main reason for this problem is as follows: in the case of an abnormal event, there will be an uneven distribution of extreme points in the signal, which will affect the shape of the envelope. The IMF thus contains the intrinsic mode of the original signal and the intrinsic mode of the adjacent time scale brought by the anomalous event.

In order to overcome the shortcomings of EMD, Wu and Huang put forward the ensemble EMD (EEMD) method, which is an integrated, straightforward, and adaptive decomposition method for nonlinear and nonstationary time series [37]. Similarly, EEMD decomposes the original data into a couple of IMFs and one residue and each decomposed component has the same length as the original data. In EEMD, a certain amount of different Gaussian white noises are firstly added to the original signal, as shown in the following equation:where x denotes the original series, is the i-th Gaussian white noise to be added, and is the corresponding processed signal.

Then, each processed signal is subjected to EMD processing to get IMFs and one residue, respectively. At last, the IMFs and residues from different processed signals are integrated and averaged correspondingly to obtain the final decomposition results:where is the k-th final decomposition result, is the k-th decomposition result from , K is the final number of IMFs after averaging, determined by the length M of the original sequence of the time series, , and N is the realization number of decomposition.

The method causes the extreme point distribution state in the original signal to be changed so that the processed signals can be continuously distributed on different time scales and reduce the probability of generating mode-mixing problems.

EEMD can reduce the mode-mixing problem to a certain extent. However, because of the added white noise, after a finite number of averaging calculations, the error is not completely eliminated, which will affect the accuracy of the reconstruction sequence and the accuracy of the prediction. Although the error can be reduced by the increase of the average number of integrations, the amount of calculation and the calculation time are greatly increased. Based on this phenomenon, Torres et al. designed CEEMDAN [38].

In CEEMDAN, Gaussian white noise is added to each decomposition layer to obtain one IMF and corresponding residual signal. Given the operator that produces the k-th IMF by EMD, the CEEMDAN can be simply described as follows: (1) like the EEMD, the CEEMDAN decomposes the original data and gets the first IMF and residue ; (2) the following k-th IMF and residue can be obtained bywhere is the i-th Gaussian white noise to be added, is the k-th IMF by decomposing using EMD, and is the signal-to-noise ratio of the additional noise and the original signal.

The algorithm terminates when the residual satisfies the iterative termination condition, and finally CEEMDAN can get several IMFs and compute one residue as follows:where R is the residue obtained from CEEMDAN, x is the original signal, and K is the number of IMFs.

However, CEEMDAN still suffers from two main problems: (1) residual noise contained in models and (2) spurious mode problem. An improved CEEMDAN (ICEEMDAN) was proposed to solve these problems. Let be the operator for generating local mean and be the operator that produces the k-th IMF by EMD [39]. The specific decomposition process is as follows:Step 1: add to the original signal x, , where is the i-th white noise to be added, is the signal-to-noise ratio, and N is the amount of white noise added.Step 2: calculate the local mean of by EMD and get the first residue ; then, obtain the first IMF .Step 3: recursively use formulas to get the k-th IMF , where

The results show that the residual noise problem in the IMF is greatly reduced and also the mean value problem caused by the different numbers of IMFs generated by EEMD is also solved.

2.2. Grey Wolf Optimization (GWO)

Grey wolf optimization (GWO) algorithm is a new type of swarm intelligence optimization algorithm proposed by Mirjalili et al. in 2014 [50]. It is derived from the simulation of the predation behavior of the grey wolf population and achieves optimization through wolf group’s tracking, encirclement, pursuit, and attack. The advantages of the GWO algorithm include simple principle, fewer parameters to be adjusted, easy implementation, and strong global search ability.

The wolves are divided into four levels: α represents the dominant wolf in the group and it is at the first level, β represents the subordinate wolf at the second level who helps α make decision, δ represents the wolf following the instructions of α and β, and ω represents the wolves at the lowest level. In GWO, the pursuit behavior is performed by α, β, and δ, and ω follows the first three to track and coffer the prey and finally completes the predation task. Suppose the number of grey wolves is N and the dimension of search space is d, the position of the i-th grey wolf in the d-th dimension space can be expressed as . According to the fitness function of a specific optimization problem, the optimal individual is recorded as α and the corresponding individuals ranked second and third are recorded as β and δ with the remaining individuals recorded as ω. The position of the prey means the global optimal solution of the optimization problem. Three definitions to be used in the GWO are given as follows.

Definition 1. Distance between grey wolf and prey:where indicates the position of the t-th generation prey, indicates the position of the t-th generation grey wolf individual, and C is a swing factor determined bywhere is a random vector between 0 and 1.

Definition 2. Encircling prey.
In nature, grey wolves always hunt preys by encircling them. Also, the mathematical model is given as follows:where is the convergence factor determined bywhere is a random vector between 0 and 1 and decreases linearly from 2 to 0 as the iteration number increases.

Definition 3. Hunting and capturing prey.
An issue with Definition 2 is that the position of the prey (the optimal solution) in practical optimization problems is unknown. So, in order to simulate the behavior of hunting prey, three types of wolves, namely, α, β, and δ, are defined based on their distance to the prey, and they have the clearest understanding of the position of the prey. The closer the distance, the more the wolves understand the position of the prey. We can use their positions to find the prey and lead the rest ω wolves to update their own locations. The mathematical expression of hunting prey can be explained as follows:During the hunting, it first calculates the distance between individuals within the group and α, β, and δ from equations (10)–(15) and comprehensively determines the direction in which the individual moves toward the prey by using equation (16).
Finally, wolves (searching agents) finish hunting when they capture the prey and the algorithm terminates.
The main idea of GWO can be described on the basis of the following definitions: randomly generate a population of grey wolves in the problem space; evaluate each individual wolf based on its distance to the prey according to Definition 1, and nominate α, β, and δ wolves and then update the location of each wolf by Definition 3; repeat the operations of evaluation and update the positions of wolves until the wolves capture the prey [50].

2.3. Extreme Learning Machine (ELM)

The ELM is a new single-hidden-layer feedforward neural network (SLFN) [54]. This machine randomly initializes the linking weights and the bias, and only the number of hidden-layer nodes needs to be determined by users. ELM can get unique optimal output weights by only one-step calculation and thus gets high training speed. ELM has been proved to perform well in both regression and classification problems.

Given a dataset of size N, where , an ELM regression model with hidden-layer nodes and activation functions G can be expressed aswhere is a vector of weights of the i-th hidden-layer node and the input node, is another vector of weights between the i-th hidden-layer node and the output node, is the bias of the i-th hidden-layer node, is the inner product between and , is the number of hidden-layer nodes, and is the output when input is . When the model fits the N samples exactly, we can get

Equation (19) can be written aswhere is a hidden-layer output matrix, and the output weight can be obtained by solving the following linear system:and the solution iswhere is Moore–Penrose pseudoinverse of hidden-layer output matrix .

It is proved that equation (22) is the unique minimum norm least-squares solution of equation (21), which can be shown as [54]

When the numbers of the hidden-layer nodes and the samples are identical, the network can approximate the samples very well, but in practice, the number of hidden nodes is usually less than the number of training samples, so the data samples may have multicollinearity problems. When solving the Moore–Penrose pseudoinverse , the existence of multicollinearity may make singular. Each time the model is modeled by ELM, the obtained matrix is different and the hidden output weight of the hidden layer is also inconsistent. These reasons finally cause the output of ELM prone to random fluctuations, and the model’s stability and generalization ability are not ideal.

2.4. Multiple Kernel Learning (MKL)

To further enhance the generalization ability and stability of ELM, Huang et al. introduced the kernel function into ELM by comparing the principles of ELM and support-vector machine (SVM) and proposed kernel extreme learning machine (KELM) [40].

The objective of ELM is to minimize not only the sum of training error but also the norm of weights, which can be presented aswhere is the error of the m output nodes when the input is , C is the penalty coefficient used to weigh the ratio between structural risk and empirical risk, is the output weight vector between the hidden layer and the output layer, is the true value with respect to , N is the amount of samples, and is the output vector of the hidden layer by input .

According to the KKT theorem, equation (24) is equal to the dual optimization problem as follows:where is the Lagrange multiplier, and KKT conditions of equation (25) are

Equation (29) can be derived by substituting equations (26) and (27) into equation (28):where is an identity matrix, is generated by mapping the input samples through a kernel function, , , and .

From equations (26) and (29), we can get

The output function is

We can define the kernel matrix aswhere is the hidden-layer node output function; K is a kernel function, and its forms include RBF kernel function, linear kernel function, and polynomial kernel function.

The kernel matrix replaces the random matrix in equation (31) and uses the kernel function to map all input of the n-dimensional space to a high-dimensional space. After the kernel parameter setting is completed, the mapped value of the kernel matrix is a fixed value.

Now, the output function can be rewritten as [40]

In the kernel-based ELM algorithm, as long as a function satisfies Mercer’s condition, it can be used as the kernel function, such as radial basis kernel function (RBF) and polynomial function (polynomial). Each kernel function usually has its own application fields, and a single kernel function often cannot maximize the representation ability. Therefore, this paper presents a new kernel ELM framework that combines different types of kernels called multiple kernel ELM (MKELM). MKELM replaces the single kernel function in KELM with weighted combination of different kernel functions. The popular single kernels used in machine learning are as follows.

RBF kernel function:

Polynomial kernel function:

Linear kernel function:

Wave kernel function:

So, the combined kernel function can be formulated aswhere , and represent the number of four kernel functions in order, represents the weight of corresponding single RBF kernel function, and the same for . The coefficients satisfy . There are 1, 2, 0, and 3 parameters to be optimized for a single RBF kernel, a single polynomial kernel, a single linear kernel, and a single wave kernel, respectively. So, in the combined kernel, all the number of parameters for optimization are the sum of the parameters for single kernels and the weights of all the kernels , which is .

3. The Proposed ICEEMDAN-GWO-MKELM Approach

3.1. GWO-Based Multiple Kernel Extreme Learning Machine

This paper uses GWO to optimize the weights and parameters for MKELM, namely, GWO-MKELM, which can be described as follows:Step 1: define the fitness function according to root mean square error (RMSE):where N denotes the number of training samples, denotes the actual value of input , is the vector of the parameters and weights in the proposed model which need to be optimized, and is the prediction value of the model with and .Step 2: set parameters for running GWO, including maximum number of iterations, population size, and upper and lower boundaries for parameters of four different types of kernel functions and the regularization parameter. Randomly initialize the position for each wolf between the upper and lower boundaries, and set the fitness value of the α, β, and δ wolves to be infinite. Set the times of iteration . Initialize a, , and .Step 3: for each wolf, if the fitness of current wolf is less than that of the α wolf, replace the α wolf with it; if the fitness is between β wolf and α wolf, replace the β wolf with it; if the fitness value is between δ wolf and β wolf, replace the δ wolf with it.Step 4: update a, , and by equations (7) and (9). Update all the search agents by equation (16). Update .Step 5: judge whether t is larger than the maximum number of iterations; if not, go to Step 3; if so, break the iteration and output the position of α wolf as the optimized parameters and weights for MKELM.

3.2. The Proposed ICEEMDAN-GWO-MKELM Approach

Because the electricity load time series is hard to predict for its nonlinear and nonstationary features, this paper adopts a forecasting framework called “decomposition and ensemble.” The framework decomposes the original time series into several relatively simple components and predicts these components separately, and the results are aggregated as the final predicted value. It is worth noting that the components obtained by the decomposition can effectively preserve the characteristics of the original time series at different levels and can be directly processed by relatively simple methods, so it is possible to forecast load from the decomposed components.

The proposed ICEEMDAN-GWO-MKELM consists of three stages, namely, decomposition, individual forecasting, and ensemble, following the very popular framework of “decomposition and ensemble.” The details of the approach are illustrated in Figure 1.

In the first stage, i.e., decomposition, the scheme uses ICEEMDAN to decompose the raw load data into a set of components ( IMFs and one residue). The decomposed components will show simpler characteristics than the raw data, making it easier to forecast the fluctuation of the components than to forecast the raw data directly. After that, an individual MKELM model is built on each component. To improve the forecasting accuracy, GWO is applied to optimizing both the parameters and weights adaptively. In the last stage, the forecasting results of all the components are simply summated as the final result.

From the figure, it can also be seen that the proposed ICEEMDAN-GWO-MKELM is a typical strategy of “divide and conquer.” That is to say, the tough and challenging task of STLF is now divided into some subtasks of forecasting the decomposed components individually. Because the decomposed components show relatively simple features compared with the raw load data, the forecasting accuracy with such components might be significantly improved.

4. Experimental Results

4.1. Data Description

The data of electricity load demand can be accessed from Australian Energy Market Operator (AEMO) [55]. Especially, we use the hal-fhour demand data of 2014 from New South Wales (NSW), Tasmania (TAS), Queensland (QLD), Victoria (VIC), and South Australia (SA) to test and verify the proposed method. For each region, to reflect the difference of season, four months including three solar months of 31 days (January, July, and October) and one lunar month of 30 days (April) were chosen. The statistics of the used datasets are shown in Table 1. We use the first observations as training data and the rest as testing data for each month. The size of datasets from a solar month is 1488, so the numbers of training and testing samples are 1190 and 298, respectively. Likewise, the size for a lunar month is 1440, so the numbers of the two parts are 1152 and 288.

We use the previous 48 data to predict the next data, so the horizon is 1 and the lag is 48, which can be formulated aswhere is the prediction of horizon h at time t, is the actual data at time t, and l is the lag.

4.2. Evaluation Criteria

We use several widely used indices to evaluate the proposed approach: the mean absolute error (MAE), the mean absolute percent error (MAPE), and the root mean squared error (RMSE), as defined in equations (41)–(43), respectively:where N is the number of testing data and and are the true value and the prediction value at time t. Obviously, the lower the three criteria, the better the predict method.

Meanwhile, the Nemenyi test is also conducted to show that the proposed method is significantly superior to the other methods [56]. The basic idea of the Nemenyi test is to determine whether the algorithm is similar to each other according to the average performance of the algorithm on different datasets. First, it sorts different algorithms from good to bad on each dataset according to the evaluation criteria so that the ordinal value is the position of the algorithm on the dataset, and 1 is the best. If the position is the same, the ordinal value is the same, and it can be calculated by , where k is the position and n is the number of algorithms in the same position. Then, it averages all the ordinal values of each algorithm to get the average ordinal value. Under a certain level of significance, the critical difference of the average ordinal value difference can be calculated bywhere α is the level of significance, k is the number of algorithms, and N is the number of datasets.

If the average ordinal value difference between the two algorithms exceeds , then the assumption that the two algorithms have the same performance is rejected at the level of significance of . Here, we take a significant level of 0.05 and divide all prediction methods into two parts: one is without decomposition and the other one is with decomposition. The two parts of the algorithms take RMSE, MAPE, and MAE as the criteria for measuring performance to conduct the Nemenyi test, respectively, so a total of 6 tests were performed finally.

4.3. Experimental Settings

First of all, some state-of-the-art prediction models will be introduced as comparative approaches to verify the validity of the proposed method ICEEMDAN-GWO-MKELM. According to whether the original data are decomposed and how they are decomposed, benchmark methods can be divided into three parts. The first part is methods without decomposition: SVR, ANN, random forest (RF), deep belief network (DBN) [15], ELM, kernel extreme learning machine with single RBF kernel (KELM), and GWO-MKELM. The second part is methods with EMD decomposition: EMD-SVR, EMD-ANN, EMD-RF, EMD-DBN, EMD-ELM, EMD-KELM, and EMD-GWO-MKELM. The third part is methods with ICEEMDAN decomposition: ICEEMDAN-SVR, ICEEMDAN-ANN, ICEEMDAN-RF, ICEEMDAN-DBN, ICEEMDAN-ELM, ICEEMDAN-KELM, and ICEEMDAN-GWO-MKELM. Comparison between each part can reflect the impact of decomposition on the prediction effect of the model, and the comparison within each group can reflect the impact of different prediction methods on the effectiveness.

The parameters for the proposed approach and the compare methods are as follows. For ANN with back propagation, we set 10 and as the number of neurons in the hidden layer and the iteration epochs, respectively. We set 100 as the number of trees in RF. For DBN, the batch size is 6 and the epoch number is 100. For ELM, the number of hidden neurons is 20 and we use the sigmoid function as the activation function. For KELM and MKELM, the regularization coefficient is 1. For the ICEEMDAN, the standard deviation of the white noise and the number of ensembles are 0.2 and 100, respectively. The decomposition results of the electricity load demand of NSW in January of 2014 by ICEEMDAN are shown in Figure 2. Also, for GWO, the parameters are shown in Table 2.

We apply the min-max normalization to preprocessing the data before being fed into the models, as formulated by using the following equation:where is the normalized data, x is the original data, and and are corresponding minimal and maximal of each data dimension. With this normalization, every data are mapped into a value in the range of .

We conduct all the experiments by MATLAB R2016b on 64-bit Windows 10, and the main hardware includes a 3.6 GHz CPU as well as 32 GB RAM.

4.4. Results and Analysis

To validate the proposed ICEEMDAN-GWO-MKELM, we conduct two groups of experiments: single models, which mean performing STLF with the raw load data, and ensemble models, which mean performing STLF with the decomposed components.

4.4.1. Single Models

The results of MAPE, RMSE, and MAE by single models on the different datasets are listed in Table 3, where we use bold and italics to mark the best and the worst results, respectively.

From Table 3, we can find that ELM and KELM obtain the worst results in 43 and 17 out of 60 cases, respectively. It reveals that the kernel trick can improve the forecasting results for ELM to some extent. We can also see that none of the other models obtain the worst results, indicating that the nonkernel ELM and ELM with a single RBF kernel (KELM) underperform the other models. It may owe to both the forecasting ability of ELM and the representation ability of a single RBF which are still poor for STLF. However, when we use multiple kernel ELM optimized with GWO (GWO-MKELM) for STLF, it achieves the best results in 29 out of 60 cases, which indicates that GWO-MKELM outperforms other single models significantly. The possible reason is that GWO-MKELM can optimize the weights and parameters of the multiple kernels for ELM to improve the representation ability of the kernels. SVR and DBN achieve the best results 22 and 10 times, ranked second and third, respectively. ANN and RF obtain similar results in most cases.

We further use the Nemenyi test with RMSE, MAPE, and MAE to compare the models, as shown in Figures 35, respectively. Regarding Figure 3, the proposed GWO-MKELM is ranked first, with a value of 2, followed by DBN and SVR with 2.2 and 2.4, respectively. The ELM is ranked last, with a value of 6.6. Therefore, the results show that GWO-MKELM is the best single model in terms of the Nemenyi test of RMSE. When we look at the Nemenyi tests of MAPE and MAE, we can see that both tests are almost the same. Specifically, SVR is ranked first with a value of 2, followed by GWO-MKELM and DBN with values of 2.1 and 2.15 in both tests, respectively. Once again, the ELM is ranked last in both tests.

All the experimental results with single models show that none of the single models can always outperform others, although GWO-MKELM achieves the best results the most times. Therefore, to improve the accuracy of STLF, a possible way is to adopt the “decomposition and ensemble” framework.

4.4.2. Ensemble Models

To demonstrate the performance of ICEEMDAN, we utilize EMD as a compared decomposition method. We still applied the seven forecasting approaches in single models to conduct individual forecasting for both decomposition methods, so there are a total of 14 ensemble forecasting models. The results of ensemble models are listed in Table 4.

From the table, it can be seen that the ensemble models with EMD and ICEEMDAN obtain the worst results, i.e., 51 and 9 out of 60 times, respectively, while all the best results of the 60 cases are achieved by the models associated with ICEEMDAN. Among the ensemble models with EMD, EMD-ELM and EMD-KELM obtain the worst results 23 and 28 times, respectively. The possible reason is that nonkernel ELM and ELM with a single RBF kernel lack good forecasting ability, so it is necessary to use multiple kernels to improve such ability. If we further investigate each individual forecasting method, we can find that the results with ICEEMDAN usually outperform those with EMD, confirming that the components decomposed by ICEEMDAN are more powerful than those by EMD for STLF.

The best results of all cases are associated with ICEEMDAN, among which GWO-MKELM achieves the best results in 57 out of 60 cases while the remaining three best results are achieved by SVR. Regarding the individual forecasting models, DBN and RF obtain neither the best nor the worst results and their performance is at an intermediate level. ANN obtains the worst results 3 times. Note that although KELM with a single RBF cannot improve the forecasting ability of ELM, the proposed GWO-MKELM can significantly outperform both ELM and KELM, showing that the proposed multiple kernel learning framework and the GWO-MKELM are very effective for STLF.

When we look at the best result of each evaluation criterion, we can find that the proposed ICEEMDAN-GWO-MKELM is obviously advantageous over the compared methods in most cases. We take NSW data in JAN as an example, the ICEEMDAN-GWO-MKELM achieves the best MAPE 0.0037, which is far below the second best result 0.0076 achieved by the EMD-SVR and the ICEEMDAN-SVR. Likewise, the best MAE by the ICEEMDAN-GWO-MKELM is 30.2777, and it is less than half of the second best MAE by the ICEEMDAN-SVR. The results of MAE by some other methods are an order of magnitude larger than the MAE by the ICEEMDAN-GWO-MKELM. The results on RMSE show a similar trend.

Again, we use the Nemenyi test to compare all the involved ensemble models. The results regarding RMSE, MAPE, and MAE can be found in Figures 68, respectively. In all the figures, the proposed ICEEMDAN-GWO-MKELM is ranked first with a fixed value of 1.05, followed by ICEEMDAN-SVR. In addition, EMD-KELM is still ranked last in all cases, showing that it has the worst ability for STLF when compared with the other ensemble models.

From the above analysis, we can see that (1) ensemble models outperform single models for STLF; (2) regarding the decomposition in ensemble models, ICEEMDAN is superior to EMD; (3) GWO is very useful for optimizing weights and parameters in MKELM simultaneously; and (4) the proposed ICEEMDAN-GWO-MKELM is very effective for STLF.

4.5. Discussion

In this subsection, we will study the influence of parameter settings of ICEEMDAN, lag order, GWO, and KELM. We will also discuss the running time of the forecasting models associated with ICEEMDAN. All the discussion is with the NSW dataset in JAN.

4.5.1. The Influence of Parameters for ICEEMDAN

The realization number and the noise strength are two key parameters for the ICEEMDAN. Here, we study their influence on STLF. First, we investigate the influence of the realization number in the set of while fixing the other parameters as set in Section 4.3. The results are shown in Figure 9.

From this table, it can be seen that when the realization number is less than 75, all the evaluation criteria are poor, while when it is equal to or greater than 75, all the results are rather stable. Especially when it equals 1000, all the evaluation criteria obtain the best values. Therefore, to balance the computation cost and effectiveness, a value equal to or greater than 75 is reasonable for the realization number of the ICEEMDAN.

For the white noise strength, when the value is less than 0.09, the results are rather bad. However, when the value falls in the range of , the results show obvious stability, as shown in Figure 10. More specifically, the proposed approach achieves the best results when the white noise strength equals 0.2.

4.5.2. The Influence of the Lag Order

The lag order determines the length of the input data. A large lag order means more computation resources, while a small one may result in bad forecasting results. We test the lag orders in the range of , and the results are shown in Figure 11.

We can see that when the lag orders are equal to or less than 48, the results are very close, while after that, the results become worse and worse with the increase of the lag order. Because the experiments use half-hour load data and it just has 48 data points for one day, 48 is a reasonable lag order to balance the computation resource and forecasting results.

4.5.3. The Influence of GWO-MKELM’s Parameter Settings

In the proposed approach, the combined kernel is built by four types of kernels, among which the number of RBF kernels is usually large. Here, we explore the influence of different numbers of RBF kernels, say , as shown in Figure 12. It is shown that the proposed approach achieves similar results when the number of RBF kernels is less than 60, and after that, the results become worse with the increase in the number of RBF kernels. The ideal value for such parameter is in the range of .

The iteration number and search agent number (population size) are two key parameters for GWO, whose impacts are shown in Figures 13 and 14, respectively.

From Figure 13, we can see that when the number of iteration varies from 5 to 15, the experimental results become better and better. When the iteration number is equal to 20, the results become slightly worse, and then the results remain stable with the increase of the iteration number. Likewise, the number of search agents has a relatively stable impact on the forecasting results. The optimal number of search agents is 50, and for other numbers of search agents, the forecasting results are very close, as shown in Figure 14.

4.5.4. Running Time

We report the running time of the methods associated with ICEEMDAN in Table 5. It can be seen that ELM and DBN take the least time for training and testing, respectively. The GWO-MKELM takes the most time to train the forecasting model and test the samples because it has to perform matrix operations many times when evaluating each individual in the wolf pack. Fortunately, the training process can be conducted offline, and it needs to be performed only once. With the optimal weights and parameters optimized by GWO, the running time for testing the 298 samples is 0.5156 seconds. In other words, it takes about 0.0017 seconds to test one sample, which is acceptable in real-world applications.

5. Conclusions

STLF plays an important role in smart grids. Because of the nonstationary and nonlinearity, it is a challenging task of STLF from the raw load data. To address this issue, we propose a novel approach for STLF based on the well-known “decomposition and ensemble” framework. We firstly present a novel multiple kernel learning approach that uses GWO to optimize the weights and the parameters of every single kernel in the combined kernel for ELM simultaneously. After that, a new approach integrating ICEEMDAN, GWO, and Multiple Kernel ELM, namely, ICEEMDAN-GWO-MKELM, is proposed for STLF. We also compare the proposed approach with some state-of-the-art forecasting methods on publicly accessible datasets containing half-hour load data of 4 months from 5 regions in Australia. From the experimental results, we can draw the following conclusions:(1)Ensemble models significantly outperform the corresponding single models because the decomposed components can better express the intrinsic nature of the load data than the raw signal(2)ICEEMDAN is superior to EMD for the decomposition of the load data(3)The GWO-MKELM outperforms ELM with a single kernel (KELM) as well as nonkernel ELM for load forecasting(4)The proposed ICEEMDAN-GWO-MKELM approach is advantageous over the state-of-the-art competitive methods in terms of the evaluation criteria, showing that the ICEEMDAN-GWO-MKELM is very promising for STLF.

One limitation of the proposed ICEEMDAN-GWO-MKELM is that it has to evaluate each wolf and many kernel matrices must be computed. Hence, more time is needed during the training phase. Fortunately, the training process can be performed offline. Therefore, once the forecasting model is built, the running time for testing is acceptable.

In the future, we will apply the ICEEMDAN-GWO-MKELM to forecasting other energy time series, such as electricity prices, natural gas prices, and wind speed.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Ministry of Education of Humanities and Social Science Project (Grant no. 19YJAZH047), the Fundamental Research Funds for the Central Universities (Grant no. JBK2003001), and the Scientific Research Fund of Sichuan Provincial Education Department (Grant no. 17ZB0433).

Supplementary Materials

All the experimental data used in this manuscript are stored in an EXCEL file and are included in the “Optional Supplementary Materials.” All the data were accessed and downloaded via the Australian Energy Market Operator 2014 (“https://www.aemo.com.au/Electricity/National-Electricity-Market-NEM/Data-dashboard#aggregated-data”) on March 14, 2019. The EXCEL file contains 21 sheets. Please refer to README sheet for detailed information. Each of the other sheets contains half-hour load data of one region in one month. For example, the sheet “NSW-Jan” contains half-hour load data of New South Wales in January 2014. (https://www.aemo.com.au/Electricity/National-Electricity-Market-NEM/Data-dashboard#aggregated-data Supplementary Materials)