1 Introduction

An efficient, robust and smart power grid system is the key driving force for the development of the energy sector in the twenty-first century [1]. Smart grid plays a vital role to improve efficiency via reducing carbon emissions. In recent years, smart grid has become the emerging trend because of its flexible and steadfast energy distribution through a duplex communication system between the supplier control hub and the smart meters on the consumer end [2,3,4]. A smart meter is employed for monitoring power consumption with two-way communication and consumers can access detailed information about their usages and the quality of the service. These acquired data provide information to the customers regarding their overall power consumption, how they consume power, and also inform them in a way to reduce their consumption [5,6,7,8,9]. Similarly, the finer resolution of data is used for the supply to loads during peak time by the power companies. Hence, the ultimate aspiration of the smart meter is not only to increase effectiveness of the power management, but to integrate new power generation techniques into the distribution level. Moreover, it reduces excessive power consumption and alerts people about it [10]. Now-a-days, advanced smart meters are equipped with monitoring of voltage disturbances, harmonics and power factor that assist power companies for better understanding of the power quality (PQ) [11].

The quality of power and its assurance have become a prime factor for the utility sector in the recent past. The equipments on the consumer ends are highly sensitive to numerous PQ problems. Apart from this, they have negative impact on the power supply system [12, 13]. Poor multifarious power system interruptions such as notches, transients, momentary disorder, voltage swell and sag, under-voltage, over-voltage, harmonic, etc., are caused due to poor quality of power supply [14, 15]. The majority of industrial electronic instruments are highly sensitive to PQ issues. Therefore, a cost effective solution is essential to keep them away from the malfunction and unnecessary expenses to install PQ monitoring system [16]. Simultaneously, it is necessary to know the root of the disturbance for both sides - utility and supply, before taking suitable mitigating action to facilitate electric PQ and energy efficient system. A continuous in-sit observation of PQ enables consumers to inform the distribution companies regarding the issues. Therefore, a real-time on-line quality monitoring and assurance is the only solution to correct various actions such as reduction of consumption, power factor improvement and load demand balance.

In this paper, we propose a combination of discrete wavelet transform (DWT) and two types of machine learning-based algorithms to extract features from the power signal and sort out PQ issues. From a voltage transformer, continuous data are given as input to wavelet transform (WT) filters, then the output feature set is fed to a model trained by one-class support vector machine (OCSVM). The disturbance is detected and tagged as abnormalities if any. In the case of disturbance detection, another multi-class support vector machine (SVM) will analyze the corrupted data to define the disturbance types.

In the literature, various solutions for this problem are presented which are mainly composed of two steps: \(\textcircled {1}\) extracting useful information from the waveform as a feature set; \(\textcircled {2}\) training a model based on the provided feature sets and labeled data for supervised pattern recognition. However, the driver events and various components of the distribution network along with scarcity of all kinds of abnormal data in the training period, hinder the effective applications of the proposed methods. Also for PQ detection at the distribution level, such as meter level, we need light but robust method due to the limited computational resources in smart meters. Based on the aforementioned data driven pipeline for this problem, majority of the research papers defined and simulated the possible disturbance patterns and then trained their models based on the simulated data. While these methods can detect the abnormalities of the defined pattern, they fail to capture the unknown types of abnormalities. In this paper, with focus on the real power data, we redress this shortcoming by applying a semi-supervised technique.

Our proposed method applies a cascade two-level classification algorithm on a simulated data set of power voltage after carefully preprocessing the data. The contribution of our study is:

  1. 1)

    To develop real-time PQ monitoring system for smart meter level.

  2. 2)

    To detect and classify any type of disturbance even the novel form.

  3. 3)

    To provide a light but robust method for PQ assessment.

The rest of this paper is as follows: In Section 2, the literature review and preliminaries of our applied techniques have been described. The system model is presented in Section 3. The simulation results are illustrated in Section 4. Finally, a brief conclusion is included in Section 5.

2 Literature review

In the literature, several research directions can be found for PQ disturbance detection. One direction explores the efficient, accurate and high-speed techniques of feature extraction from signals using various methods such as Fourier transform (FT), S-transform (ST), WT, root mean square (RMS), fast Fourier transform (FFT), and fast dyadic Fourier transform (FDFT), etc. [17,18,19,20]. Another direction investigates the optimal sets of features out of all the extracted features from the signal [21, 22]. This direction performs PQ disturbance detection and segregation using diverse artificial intelligence (AI) and machine learning (ML) techniques such as fuzzy logic [23], neural network [24], SVM [25, 26], decision tree [27, 28], expert system [2931], and hidden Markov model [32]. Most of the researchers not only considered the classification and anomaly detection performance but also gave importance on the computational efficiency and processing speed [27].

FT being one of the earliest techniques for signal analysis can detect the existence of a specific frequency in power waveforms. However, it is incapable of recognizing time evolving effect of non-stationary signals [33]. Short-time Fourier transform (STFT) was proposed afterwards to solve the aforementioned problem of FT. Although STFT or windowed FT improves the FT, it suffers from fixed window width. Fixed window size cannot analyze low-frequency and high-frequency of transient signals at the same time which is important for disturbance detection [34]. Later, WT gained popularity among researchers because of its capability to analyze power system’s non-stationary root in various disturbances [35, 36].

In [37], authors applied WT on the wave as the first step to remove the noise in the signal. Following this, parameters such as peak value and periods can be calculated by FFT, which helps to identify the disturbance types. Gaouda et al. exploited the WT combined with K-nearest neighbor (KNN) for signal feature extraction and classification [38]. Although this method performed well in the low level of noise, its accuracy degraded when the level of noise increases (i.e. noise level is greater than 0.5%). In [39], the authors proposed a less computationally expensive algorithm which extracted the features of the signal by two relatively simple and quick methods of discrete Fourier transform (DFT) and RMS. Based on a limited set of features and a rule-based decision tree, they detected and defined the types of nine categories of signal disturbances in real time.

One of the earliest approaches of PQ disturbance detection conducts point-to-point comparison of the signal in a cycle. Despite its simplicity, it has poor performance when the pattern of disturbance is repetitive in nature [40]. Ghosh and Lubkeman as the pioneer, used artificial neural network (ANN) for automatic waveform classification [41]. They applied two different variations of the neural networks in the unprocessed signal. Later in 2001, a rule-based system was introduced in [42] for disturbance classification. Although this method is simple to implement, its model can not be generalized easily, and with the growth of disturbance types and patterns, the increasing number of “If” and “Else” rules hinders the system efficiency and capacity.

In [26], the authors utilized SVM with radial basis function (RBF) kernel to detect the disturbance patterns in the three-phase simulated signals. The results show a promising performance with an acceptable accuracy. In [25], Axelberg et al. applied the SVM model for classifying the voltage disturbance on the real and synthetic data. Extracted features from the data such as minimum RMS voltage, harmonic components, symmetric components, RMS voltage at selected time instants, total harmonic distortion, and the duration of the disturbance composed an informative sample space [25, 43]. In [44], weighted SVM (WSVM), FT and WT coupled hand in hand to classify five disturbance categories. As both of selected features and tuning parameters influence the classification performance, Moravej et al. incorporated a two-stage feature selection by the combination of SVM and digital signal processing (DSP) techniques [21]. Before applying the SVM on the feature set, they performed mutual information feature selection technique (MIFS) and correlation feature selection (CFS) for prominent features selection and redundancy elimination. Eristi et al., in a similar approach, added feature extraction and selection methods along with WT, aiming to reduce the feature space in order to reach higher accuracy and lower resource consumption [45]. However, all these proposed approaches are mostly for the transmission and generation levels of the grid considering possession of huge computational capacity and resources.

In order to analyze PQ at the distribution level, the utility company looks forward for techniques at the smart meter level. But the meters have limited memory, and the processing capacity requires the signal processing should be computationally light weight and fast enough. In this regard, very limited works have been found in the literature. Borges et al. proposed their technique which is executable inside the smart meter [27]. They extracted some of the features by using the FFT that are computationally low cost and routinely integrated in hardware and they captured the remaining features in time domain. For detecting the type of the disturbance, they exploited the decision tree and ANN and reported higher than 90% precision rates. All the processes were designed to be embedded in smart meter.

Reviewing above studies, we propose new method SVM for PQ detection in smart meter. Being computationally light weight and guaranteed global optimum value, our technique is more feasible in smart meter [46]. Our method not only detects the disturbances but also classifies them with higher precession compared to other methods such as Isolation Forest. Besides that, our method provides real-time PQ monitoring at meter level and can detect any kind of new disturbances.

2.1 PQ issues

In the power system, even a short period of disturbances might lead to a huge amount of power losses. Hence, PQ monitoring becomes one of the major services of the power companies. One way of guaranteeing the PQ is the precise monitoring of the waveform, capturing any distortion form and rooting out the cause of it.

Some common PQ interferences and their effects in the distribution sites are provided here to emphasize the importance of an online accurate disruption detection.

  1. 1)

    Unbalanced waveform: frequency is kept almost steady in large interconnected distribution networks, and changes are a rare incident. Therefore, frequency deviation is frequent on smaller networks, specially for those supplied by on-site generators. Lower value of the inertia constant due to reduction of connected generators to the power system is the main reason behind it. It may cause damage of electrical equipment and have the worse impact on clock (motor driven) speed.

  2. 2)

    Transients: transients can be defined as unexpected changes of voltage or current from rated values. Its duration is very small, typically lasting from 200 \(\upmu \hbox {s}\) to 1 s. Lightning strikes, electrostatic discharges (ESD), poor grounding, load switching, and faulty wiring are the main reasons behind it. Transients can delete or change computer data and make hard to identify calculation errors. In severe cases, they can damage electronic instruments and hamper power system operations.

  3. 3)

    Voltage sag: a voltage sag is a short span drop of RMS voltage. Undervoltages are defined as voltage drop greater than two minutes duration. Common reasons of undervoltages and voltage sags are faults (short circuits) on the power system, weather factors, motor starting, inclusions of customer load and introduction of large loads in the power system. Sags can shut down computers and other sensitive equipments within a moment. Undervoltage conditions can hamper certain types of electrical instruments.

  4. 4)

    Voltage swell: voltage swell is momentary rise in voltage magnitude. Overvoltages are defined as voltage increase greater than two minutes duration. Overvoltages and voltage swells are typically generated by deviations of power line switching and large load. If voltage rise has too high value, it may destroy electrical instruments and shut down power systems. The consumer’s voltage controlling device is unable to act as quickly as to give protection from all swells or sags.

  5. 5)

    Interruptions waveform: when voltage magnitude reduces to zero, interruptions occur in the power system. They are categorized as long-term, temporary or momentary. Momentary disruptions happen when utility system is being interrupted and automatically restored within a short duration (less than 2 s).

Researchers have considered diverse types of disturbances and they trained their models based on their predefined types. Uyar et al. [47], Koleva [48] and Kostadinov [49] defined 6 classes of disturbances: sag, swell, outage, harmonic, swell with harmonic and sag with harmonic. Sahani [50] composed 9 classes, including momentary interruption, sag, swell, harmonics, flicker, notch, spike, transient, and sag with harmonics. Khokhar [51] presented 6 more categories in addition to [52] which are swell with harmonics, interruption and harmonics, impulsive transient, flicker with harmonics, flicker with swell, and flicker with sag.

Considering the fact that various types of disturbances can occur together and make several new disturbance categories, defining a closed set of disturbance has an obvious drawback of missing some undefined, and unknown pairs of anomalies [52]. This is one of the limitation of many proposed automatic disturbance detection and classification techniques. To address this issue, by considering the fact that any types of unknown disturbance can possibly happen together and results a new form of disturbance with a varying degree of noise, our model (OCSVM) is trained particularly on the normal dataset. Training a semi supervised model on the abundant set of normal data granted the model the ability of abnormality detection with almost 93% accuracy. After detection of the disturbance by OCSVM model, a complementary multi-class classification model will capture the correct label of the disturbances. This method will at least detect the disturbance even if it fails in capturing the appropriate type. This technique is highly advantageous in the uncertain and real environment while the types of abnormality cannot be accurately predicted prior to the application of the system or when enough samples of different abnormalities are not available for training the system.

2.2 WT

WT has emerged as a useful tool for presenting a signal in the time-frequency domain. It is superior than FT when the frequency content of a signal is non stationary. This transform provides both time and frequency information required for extracting transient information from non stationary signals. WT has multiple implementation such as DWT, continuous wavelet transform (CWT) and wavelet packet transform (WPT). Among all, DWT and WPT have been applied in real world problems.

In our study, we apply DWT which is one of the most practical types of WT. DWT is mostly used for decomposing a time series signal S(t) into components such as detailed coefficients and approximation coefficients [53]. The low pass and high pass filter yield approximation coefficients and detailed coefficients, respectively. The output of low pass filter is further decomposed into level 2 components, consisting of approximation coefficients and detailed coefficients which is shown in Fig. 1.

Fig. 1
figure 1

Two-level filter analysis in DWT

2.3 Signal features extraction and selection

Signal features extraction and selection are two important steps for signal classification. Effective features set can heavily impact the performance of the classifier. Feature selection reappears a subset of the original features while feature extraction produces new feature from signal having native features.

Each signal carries many native features of basically two types – irrelevant and redundant. Removing those does not cause information loss [54]. Irrelevant and redundant features are two separate ideas. One pertinent feature can be considered as redundant in the existence of another feature with which it is powerfully connected [55]. Though wavelet and multi-resolution analysis (MRA) of signal extract the significant information, they are inefficient and sometimes misleading to apply the classifier on the large feature set. The optimized distinct features must be extracted and selected in a way to reduce feature vector’s dimension and maximize classification performance.

Many features such as entropy (Ent), energy (E), mean value (M), standard deviation (Sd), and RMS were widely used in several research papers as the most informative and discriminating features. By carefully evaluating those research studies [14, 52, 56,57,58], the feature set provided by probabilistic neural network based artificial bee colony (PNN-ABC) optimal feature collection algorithm by Khokhar et al. [51] is exploited in our study. Proposed favorable features are [E (d1), Kurtosis (KT)(d2), RMS (d3), Skewness (SK)(d4), SK (d5), E (d6), RMS (d7), Ent (d8), KT (d8)] which extracted from Daubechies mother wavelet at level 4 and in different levels of decompositions. These features are calculated for all the 3 phases of voltage signal. Calculating a group of attributes for each phase of the voltage, all three separate sets are concatenated linearly and composed a row vector corresponding to a 3 phase voltage signal.

Let us consider a signal S(t) fed into DWT for disturbance detection and classification. To detect disturbance events in the signal, recognizing factor (\(R_f\)) [6] can be calculated as:

$$\begin{aligned} R_f=\root \of {\frac{\sum \limits _{y=1}^M 2^y \sum \limits _{x=1}^{M_{d_y}}d_y^2(x)}{2^M \sum \limits _{x=1}^{M_{c_M}} c_M^2(x)}} \end{aligned}$$
(1)

where M, \(M_{d_y}\) and \(d_y\) are the highest decomposition number, detail coefficient numbers at level y and the coefficients of approximation at y level, respectively; \(M_{c_M}\) and \(c_M\) are detail coefficient numbers and the coefficients of approximation at M level, respectively.

If \(R_f >1\%\), classification will be done by calculating wavelet coefficients. For \(R_f <1\%\), further calculation will be in halted state and it prevents unnecessary calculation.

The signal S(t) to be resolved into M parts uses DWT as follows:

$$\begin{aligned} S(t)=S_{c_1}(t) + S_{d_1}(t)+S_{d_2}(t)+ \cdots +S_{d_M}(t) \end{aligned}$$
(2)

where \(S_{c_1}(t), S_{d_1}(t), S_{d_2}(t), \ldots , S_{d_M}(t)\) are the decomposed components of S(t). We can further express S(t) as:

$$\begin{aligned} S(t)=\sum _{i}^{} c_1 (i) \phi (t-i) + \sum _i \sum _{j=1}^{M} d_j(i) 2^{j/2} \varPsi (2^j t-i) \end{aligned}$$
(3)

where j is the level of resolution; \(d_j\) is the detail coefficient in level j; \(\phi ,\varPsi \in \mathbf{R }\). According to MRA, a set of nested subspaces \(V_j\) and \(W_j\) are calculated as belows:

$$\begin{aligned} \left\{ \begin{array}{lll} V_M \supset V_{M-1} \supset \cdots \supset V_j \supset \cdots \supset V_2 \supset V_1 \\ V_{j+1}=V_j \oplus W_j \\ V_j \cap W_j= \varnothing \\ \end{array} \right. \end{aligned}$$
(4)

where a summation of two subspaces is marked as \(\oplus\).

The input signal S(t) is resolved into corresponding subsets in accordance with subsets \(V_1\) and \(W_j\), respectively:

$$\begin{aligned} S_{c_1}(t)= & {} \sum _i c_1(i) \phi (t-i) \end{aligned}$$
(5)
$$\begin{aligned} S_{d_j}(t)= & {} \sum _i d_j(i) 2^{j/2} \varPsi (2^i t-1) \end{aligned}$$
(6)

The average value from the signal S(t) detail coefficients at decomposition j level \((|{\bar{S}}_{d_j}|)\) is:

$$\begin{aligned} |{\bar{S}}_{d_j}|= \frac{1}{2^j M_{d_j}} \sum _i | d_j(i)| \end{aligned}$$
(7)

where \(M_{d_j}\) is the number of detail coefficients at j level.

Accordingly, the average value of the input signal at individual decomposition levels calculated from detail coefficients and approximation coefficients is as below:

$$\begin{aligned} {\bar{{\varvec{{S}}}}}(t)=[ |{\bar{S}}_{c_1}(t)|,|{\bar{S}}_{d_1}(t)|,\, \ldots \, ,|{\bar{S}}_{d_M}(t)| ] \end{aligned}$$
(8)

The standard deviation (SD) of detail coefficients’ absolute values at j level \((\sigma _{S_{dj}}(t))\) is:

$$\begin{aligned} \sigma _{S_{dj}(t)}=\sqrt{\frac{1}{2^j N_{d_j}}\sum _i |d_j(i)|- {\bar{S}}_{d_j}} \end{aligned}$$
(9)

The SD of detail coefficients absolute values in individual decomposition level is:

$$\begin{aligned} {\varvec{\sigma }}_{S_{d_j}(t)}=[\sigma _{S_{c_1}(t)},\, \sigma _{S_{d_2}(t)},\, \ldots \, ,\sigma _{S_{d_M}(t)}] \end{aligned}$$
(10)

After sampling at a rate of 20 kHz, the feature vector (wavelet network (WN) input) [6] is:

$$\begin{aligned} {{\varvec{x}}}_1= \frac{\varvec{\sigma }_{S_{d_{1,2,3}}(t)}}{|{\bar{S}}_{d_{1,2,3}}(t)|} \end{aligned}$$
(11)

where \({\varvec{x}}_1\) is the ratio of the standard deviation measured from input signal S(t) detail coefficients at decomposition 1, 2 and 3 levels \(({\varvec \sigma} _{S_{d_{1,2,3}}(t)})\) to the average value collected from input signal S(t) detail coefficients at identical decomposition levels (\(|{\bar{S}}_{d_{1,2,3}}(t)|\)). This determines change of detail coefficients at 1, 2 and 3 levels, from average values without concerning about dimension of those coefficients. Moreover, this technique both diminishes the dimension of WN by normalizing the details data and maintains significant characteristics of the input signals. In the same way, vectors \({\varvec{x}}_2\), \({\varvec{x}}_3\) and \({\varvec{x}}_4\) are defined as:

$$\begin{aligned} {\varvec{x}}_2= & {} \frac{\varvec{\sigma }_{S_{d_{4,5,6}}(t)}}{|{\bar{S}}_{d_{4,5,6}}(t)|} \end{aligned}$$
(12)
$$\begin{aligned} {\varvec{x}}_3= & {} \frac{\varvec{\sigma }_{S_{d_{7,8}}(t)}}{|{\bar{S}}_{d_{7,8}}(t)|} \end{aligned}$$
(13)
$$\begin{aligned} {\varvec{x}}_4= & {} \frac{\varvec{\sigma }_{S_{d_{9,10,11,12}}(t)}}{|{\bar{S}}_{d_{9,10,11,12}}(t)|} \end{aligned}$$
(14)

The \({\varvec{x}}_5\), \({\varvec{x}}_6\), \({\varvec{x}}_7\) and \({\varvec{x}}_8\) inputs are calculated as follows:

$$\begin{aligned} \left\{ \begin{array}{lr} {\varvec{x}}_5=\max |S_{d_2}|\\ {\varvec{x}}_6=\max |S_{d_5}|\\ {\varvec{x}}_7=\max |S_{d_8}|\\ {\varvec{x}}_8=\max |S_{d_{12}}|\\ \end{array} \right. \end{aligned}$$
(15)

All the vector inputs (\({\varvec{x}}_1\) to \({\varvec{x}}_8\)) are extracted from the distorted waveforms. Following this, all the 8 elements constitute the feature vector for the input of WN. \({\varvec{x}}_1\) is greater for transient disturbances than other PQ disturbances. High values of \({\varvec{x}}_4\) and \(\varvec{x}_2\) are responsible for voltage flicker and harmonic distortions respectively. During the voltage swell and sag, \({\varvec{x}}_3\) and \({\varvec{x}}_5\) are higher than other elements. \({\varvec{x}}_5\) and \({\varvec{x}}_7\) are responsible for the voltage interrupt and notching, respectively. WN can detect DC offset by evaluating \({\varvec{x}}_8\). Therefore, different features related to PQ can be extracted by calculating \({\varvec{x}}_1\) to \({\varvec{x}}_8\).

2.4 OCSVM in a nutshell

SVM is a machine learning algorithm based on modern statistical learning theory [59,60,61,62]. It separates two classes by constructing a hyper surface in the input space. In the input space, input is mapped to higher dimensional feature space by non-linear mapping. In this section, we will explain the multi-class SVM followed by OCSVM.

Let us consider, a data space \(\varvec{\varPsi }={({\varvec{x}}_i,{y}_i)}\), \(i=\{1,2,\ldots ,n\}\) where \({\varvec{x}}_i\in \mathbf{R }^{n}\) is input data and \({y}_i\in \{-1,+1\}\) is corresponding output pattern in the dedicating class membership. For simplicity, we define the input and projected data as \({\varvec{x}}\) and \({\varvec{y}}\). SVM first projects the input vector \({\varvec{x}}\) to a higher dimensional space \({\mathcal {H}}\) by a non-linear operator \({\varvec{\varPhi }}(\cdot ):\mathbf {R} ^n\longleftrightarrow {\mathcal {H}}\) where the data projection is linearly separable.

The non-linear SVM classification is expressed as (16), where \({\varvec{w}}\) is the hyper plane direction; b is the offset scalar:

$$\begin{aligned} \varOmega ({\varvec{x}})={{\varvec{w}}^{\text {T}}} \varvec{\varPhi }({\varvec{x}}) +b \qquad {\varvec{w}}\in {\mathcal {H}}, b\in {\bf {{R}}} \end{aligned}$$
(16)

which is linear in consideration of projected data \(\varvec{\varPhi }({\varvec{x}})\) and non-linear in consideration of original data \({\varvec{x}}\).

SVM tries to maximize the margin in hyperplane. Slack variables (\(\xi _i\)) are proposed to permit some data to lie within margin in order to protect SVM from over fitting with turbulent data (or introducing soft margin). So the objective function which includes minimization of \(||{\varvec{w}} ||\) can be written as:

$$\begin{aligned} \min _{{\varvec{w}},b,\xi _i} \bigg (\frac{||{\varvec{w}} ||}{2} + C \sum _{i=1}^n \xi _i\bigg ) \end{aligned}$$
(17)

subject to:

$$\begin{aligned} \left\{ \begin{array}{ll} \displaystyle y_i({\varvec{w}}^{\text {T}}\varvec{\varPhi }({\varvec{x}}_i) +b)\ge 1- \xi _i \\ \displaystyle \xi _i\ge 0 \\ \end{array} \right. \end{aligned}$$
(18)

where C is the regularization parameter (usually greater than 0) that regulates the trade-off of enlarging the margin and number of trained data within that margin (thus reducing the training errors); \(\xi _i\, (i=0,1,\ldots ,n)\) is the slack variable; n is the number of input data.

To minimize the objective function of (17) using Lagrange multipliers technique, the necessary condition for \({\varvec{w}}\) is:

$$\begin{aligned} {\varvec{w}}=\sum _{i=1}^{n} \gamma _i y_i \varvec{\varPhi }({\varvec{x}}_i) \end{aligned}$$
(19)

where \(\gamma _i>0\) is Lagrange multiplier corresponding to the constraints in (18). \(\gamma _i\) can be solved from (17) and written as:

$$\begin{aligned} \max {\varvec{w}}(\gamma _i) = \sum _{i=1}^{n} \gamma _i -\frac{1}{2} \sum _{i=1}^{n}\sum _{j=1}^{n} \gamma _i \gamma _j y_i y_j k({\varvec{x}}_i,{\varvec{x}}_j) \end{aligned}$$
(20)

subject to:

$$\begin{aligned} \left\{ \begin{array}{lr} \displaystyle 0\le \gamma _i\le C \\ \displaystyle \sum _{i=1}^{n} \gamma _i y_i=0 \\ \end{array} \right. \end{aligned}$$
(21)

This \(k({\varvec{x}},{\varvec{y}})=\varvec{\varPhi }({\varvec{x}})^{\text {T}} \varvec{\varPhi }({\varvec{y}})\) is known as kernel function. It determines the mapping of input vector to high dimensional feature space.

Gaussian RBF kernel for multi-class SVM is:

$$\begin{aligned} k({\varvec{x}},{\varvec{y}})=\exp \left(\frac{{||{\varvec{x}}-{\varvec{y}} ||}^2 }{\sigma }\right) \end{aligned}$$
(22)

where \(\sigma \in {\mathbf{R }}\) is the width of RBF function.

OCSVM, which is a variation of the SVM, detects the abnormal data within a class [61, 62]. OCSVM maps the input vector to feature dimension according to the kernel function and separates them from origin keeping high margin. It penalizes the outliers by employing slack variables \(\xi\) in the objective function and controls carefully the trade off between empirical risk and regularization of penalty.

The quadratic programming minimization function is:

$$\begin{aligned} \min _{{\varvec{w}},\xi _i,\rho } \bigg (\frac{1}{2} {||{\varvec{w}}||}^{2} + \frac{1}{vn} \sum _{i=1}^{n} \xi _i-\rho \bigg ) \end{aligned}$$
(23)

subject to:

$$\begin{aligned} \left\{ \begin{array}{ll} \displaystyle {\varvec{w}}\cdot \varvec{\varPhi } ({\varvec{x}}_i) \ge \rho -\xi _i\\ \displaystyle {{\xi _{i}} \ge 0} \\ \end{array} \right. \end{aligned}$$
(24)

where \(v\in (0,1]\) is a prior fixed constant; \(\rho\) is the resolved value that indicates whether a stated point falls within the considered high density area.

Then the resultant decision function \(f_{w,p}^m ({\varvec{x}})\) takes the form:

$$\begin{aligned} f_{w,p}^m ({\varvec{x}})=\text {sgn}(({\varvec{w}}^{\varvec{*}})^{\text {T}}\varvec{\varPhi }({\varvec{x}})-\varvec{\rho ^{*}}) \end{aligned}$$
(25)

where \(\varvec{\rho ^{*}}\) and \(\varvec{w^{*}}\) are values of \({\varvec{w}}\) and \(\varvec{\rho }\) solving from (23).

In OCSVM, v characterizes the solution instead of C (smoothness operation) in that:

  1. 1)

    It determines a top boundary limit on the fraction of outliers.

  2. 2)

    It finds a lower boundary on the number of trained instances considered as support vector.

Due to high significance of v, OCSVM is also termed as v-SVM.

3 System model

Considering the fact that abnormality patterns in controlled experiment are different from the reality and that the real abnormal patterns are not easily accessible for training purpose, a functional model should be trained on the available data with less assumption about the types of abnormalities. As it is required to run the model in a smart meter in real time, it should take the least possible computation time to learn the pattern of normal and abnormal signals from a small set of samples.

To create the mentioned model, signal samples are simulated with the Simulink toolbox in MATLAB with the help of different circuits [63], mimicking the real condition in the distribution network. The sampling time for all the samples is set to \(5\,\upmu \hbox {s}\), and 500 samples of normal and 500 samples of abnormal waveforms, equally spread in different five categories with 0.2 s length are generated. However, the entire sample set is not used for training and testing in each experiment. The input voltage and current to the circuits randomly change between 0.05 higher or lower than the defined standard in all the circuits resembling the real power change in distribution circuits. We attempt to keep a standard setting for all the parameters in all the circuits to avoid the effects of confounding variables. The DWT is exploited to extract the most informative features out of each signal. Having 3 phase signal, we process each phase separately extracting the following features: RMS, Ent, E, average value (l), KT, standard deviation (\(\delta _r\)), SK, range (RG), detail value (D), and approximation (A) coefficients. All these are being measured using MATLAB built in wavelet decomposition function. The process flow is provided in Fig. 2.

Fig. 2
figure 2

Process flow of disturbance detection by SVM

The processed features are fed into OCSVM for disturbance detection. If disturbance is appeared, the data is further processed by multi-class SVM for getting details disturbance classifications. The process is described in Fig. 3.

Fig. 3
figure 3

Disturbance detection and classification

4 Simulation result

Having the same set of features for each signal phase, they all are concatenated linearly to compose a sample set of feature vectors. Creating the sample space, OCSVM is trained exclusively with the 300 samples. Another 200 samples are kept for testing phase. To define the hyper parameters of \(\nu\) and \(\gamma\), grid search is performed and the results are shown in Fig.  4. The figure represents the confusion matrix for different parameters setting, representing the numbers of true positive (TP), false positive (FP), true negative (TN) and false negative (FN) by adjusting different values for \(\nu\) and \(\gamma\). Considering the fact that we aim at detecting all the abnormalities with the cost of some false alarms, we picked \(\nu\) = 0.01 and \(\gamma\) = 0.1 with the highest number of TN equal to 199 and TP equal to 188, which means that among all 200 abnormal samples and 200 normal samples, OCSVM with (Gaussian) radial basis function kernel is able to detect 188 normals and 199 abnormal samples correctly with 13 false alarms. In this case, the average accuracy becomes \((188/200 \,+ 199/200)/{2}\times 100\%\approx 97\%\).

Principle component analysis (PCA) is applied on the dataset to reduce the dimensionality and provide results representable in two dimensional (2D) space. The OCSVM boundaries with RBF kernel and two different \(\nu\) and \(\gamma\) on a sample set of data are illustrated in Fig. 5. The training samples are green spots in the center of the figure surrounded by decision boundaries. The blue clouds show nearest vectors to the defined plane. The closer it gets, it becomes darker and it shows the higher risk area. The pink area surrounded by red boundary is the safe area, and all the samples which fall in this area is flagged as normal. For \(\nu\) = 0.001 and \(\gamma\) = 1, the accuracy is 80% (160/200). On the other hand, accuracy as high as 93% (187/200) is found for \(\nu\) = 0.01 and \(\gamma\) = 0.1. While representation of the entire samples in the 2D space reduces the interpret-ability of the figure, a subset of the sample data is selected and represented here to clarify the effects of hyper parameter in decision boundaries and classification performance.

Fig. 4
figure 4

Grid search result of hyper parameters

Receiver operating characteristic curve (ROC) which is used for evaluating classifier output quality, is shown in Fig. 6. TP and FP rates are represented on the y-axis and x-axis, respectively. The red dashed line indicates that if we do not use any binary classification algorithm and just randomly label the samples, we will tag them correctly by 50% chance. Therefore, the ROC will be 50%. The more we cover the area, the better and more accurate the classifier is. The blue line shows the diagnostic ability of the proposed classifier, while discrimination threshold is varied. In the figure the top left corner of the graph is the ideal point where a zero FP rate, and a one TP rate are found.

Fig. 5
figure 5

Effects of hyper parameter adjustment on decision boundary and algorithm performance

Fig. 6
figure 6

ROC of OCSVM

Following the provided process schematic (Fig. 2), after abnormality detection, the type of the abnormality is defined in the next step by a multi-class SVM classification algorithm. This two-step approach increases the robustness of the model specially in the detection phase when an unknown disturbance appears.

To the best of our knowledge, the majority of the research in the field has simulated their own dataset with an arbitrary assumption about the size of the sample data and the disturbance pattern complexity as well as the simulation parameter settings. Considering the fact that the accuracy and F-measure can vary dramatically based on the complexity of the underlying input dataset and knowing that there is no access either to earlier research dataset or to any other public dataset in the scope [64], there is no actual baseline available to compare the current result with it. Furthermore, considering the fact that this kind of dataset is unbalanced (abnormal samples are not as abundant as normal samples), classification accuracy alone is not informative enough. We use Precision, Recal, Confusion Matrix and F-measure metrics to have a better understanding of how the method performs [65]. Precision is the ratio of TP number to TP and FP numbers, and Recall is the ratio of TP number to TP and FN numbers.

After applying anomaly detection by OCSVM, types of the anomaly should be detected by a multi-class classification algorithms. Considering the fact that in the training time the abnormal samples are scare, to achieve a realistic result multiple algorithms are trained on a relatively small training dataset in which the number of abnormality samples are at most 10 in each class. The accuracy of the algorithms in an unbalanced testing dataset are shown in the Table 1. It is noted that the F-measure (F1-score or F-score) is a measure of an algorithm’s accuracy and is defined as the weighted harmonic mean of the precision and recall of the test. F-measure can be calculated in multiple way and one of them is F1-macro, as shown in Table 1. The outcomes obviate the superiority of the multi-class SVM and random forest algorithms for this multi-class classification task.

To attest the effect of the training size on the accuracy and F-measure, the same experiment is repeated on a larger dataset composed of 20 abnormal samples in each classes. The result shown in Table 2 clarifies the direct relation of the training size and classification accuracy. This comparison is done to demonstrate that different algorithms and techniques can not be compared unless they are applied in the same training and testing set.

In Table 2, the algorithm onevsall_svm or one-vs-rest is a classification strategy, in which a single classifier per class is trained, with the samples of that class as positive samples and all other samples as negatives. Repeating one-vs-all strategy on a multi-class data can discriminate the data into more than two classes. This technique is reducing the problem of multi-class classification to multiple binary classification problems.

Table 1 Classification result with a small training set
Table 2 Classification result with a large training set

5 Conclusion

PQ reports by consumer meter is important addition to smart grid as seen by utility company. Following this, in this study, we propose machine learning based disturbance detection. For segregation between regular data and abnormal data, we propose one-class version of SVM. Further, for categorization of disturbances, we propose multi-class SVM. OCSVM detects disturbances with \(93\%\) accuracy. On the other hand, multi-class SVM can classify the detected disturbances with accuracy as high as \(90\%\) depending on the training data set. The outcome of SVM will be reported to utility company back office. This will help to get insight into the PQ issue at lowest distribution level and maintain good PQ.