Keywords

1 Introduction

Telecommunication operator needs to understand the factors that directly influence revenue, to stop any revenue loss. Factors for revenue loss can either be internal (Ex.: network quality issues) or external (Ex.: competitive offers). Traditional machine learning approaches are considered as black boxes. Modelling this as a causal network [5], helps us determine the various cause effect relationship amongst the variables, quantify how they impact the revenue and helps the domain expert to interpret (and validate) the learnt model. Most of the existing causal network learning algorithms require stationarity of time series, but in the telecommunication domain cause effect relationships undergoes changes with time and the learning algorithm should be able to capture the non-stationarity present, without which the causal network will be stale and forecasting of revenues will be less accurate. Also data volumes of the telecommunication subscriber database is huge and subscriber’s call data records are continuous streams and it is not possible to aggregate all the data available for learning at one go. Hence need arises for the learning algorithm to be incremental.

If we try to learn a global model for subscriber’s usage to revenue mapping it is non-linear. Instead in data there are groups (clusters) of subscribers. Within the group, a model that maps recharge pattern and service utilization to revenue is actually linear. Hence we propose to use work of Ickstadt et. al [1] where it is shown how to learn non-parametric Bayesian network with infinite Gaussian mixture models (Bayesian network is a causal network if every edge displays cause effect relationship). It is well established that the mixture of multivariate Gaussian can approximate any density on \(R^d\) provided that the number of components can get arbitrarily large.

Fig. 1.
figure 1

Pictorial representation of non-stationary temporal causal model

To overcome the key challenges of non-stationarity and large volume of data, we model it as a non-stationary Temporal Causal Network (nsTCN - Fig. 1) and propose a incremental learning algorithm for learning nsTCN. Learning of a nsTCN involves learning the infinite Gaussian mixture model followed by learning transition causal network per cluster. We are learning the temporal network under first order Markovian assumption (FOMA), hence data record used for learning have variables from 2 successive time instances. Transition Causal network under first order Markovian assumption learns the cause effect relationship amongst the variables of current time instance along with the transition causal relationship between variables of previous time instance and current time instance. We have transition probabilities at cluster level to capture across cluster transition. We also associate hidden parameters with each of the clusters, which helps us model the hidden external factors (or confounding factors) influencing the cause effect relationship present within the cluster.

Our contributions for incremental learning of nsTCN are:

  • Rules to determine concept drift, these rules are triggered on every batch of incoming data.

  • A new algorithm to incrementally learn nsTCN’s from streams of data, triggered only if a concept drift is detected.

More details on the solution are explained in the subsequent sections. Section 2 discusses about the related research, Sect. 3 discusses about the proposed solution, Sect. 4 discusses the experiments and its results, Sect. 5 is on future work and conclusion.

2 Related Research

Most causal discovery methods assume that cause effect relationship are static and try to learn it from the data. Pearl in her work [5] shows how causal inference in statistics can be modelled as a graphical model. We extend this to learn nsTCN in a incremental fashion.

Zhou et al. [6] modelled causal analysis in non-stationary setup as a Granger causality, instead we are interested in learning the causal network. Huang et al. [7] in their work, model a time-dependent causal network as part of which they model time as one of the causes for changing causal influences. They propose Gaussian Process regression for estimating the causal influence. Gaussian Process Regression learnt model has a memory requirement of \(O(ND+N^2)\) which is quadratic in No. of training samples leading to a practical limit on No. of samples. Our work is different as (i) we propose an incremental learning algorithm (ii) we associate re-learning with identifying concept drift. (iii) we associate our cluster’s with transition probabilities enabling it to forecast concept drift.

3 Proposed Solution

We cannot continuously learn from streaming data. Instead we first identify concept drift, if concept drift is detected then incrementally relearn nsTCN. Hence the proposed framework for incremental learning of nsTCN has following major tasks

  • Rules to determine concept drift

  • Algorithm to incrementally learn nsTCN

3.1 Rules to Identify Concept-Drift

A domain is non-stationary if it is associated with concept drift over time. Concept drift (CD) means that the statistical properties of the random variables have changed over time in unforeseen ways, leading to changes in cause effect relationship and the strength of their relationships.

Specific to nsTCN, rules are defined to determine the different types of concept drift, so that re-learning can be triggered accordingly.

  • First type of concept drift: for new batch of data determine the KL divergence for subscriber record to cluster distribution, if the KL divergence is beyond a threshold then re-learning of type 1 is required.

  • Second type of concept drift: For the new batch of data likelihood of the records associated with the clusters are determined. For any cluster if likelihood has dropped below a threshold re-learning of type 2 is required.

    We use these rules in Algorithm 1. The information regarding type of relearning associated with each type is explained in the following section.

3.2 Algorithm to Incrementally Learn nsTCN

To learn clusters from streaming data we use work proposed by Huynh et al. [2]. We propose changes to conditional independence test equation used in PC-Stable Algorithm 3.2 to enable incremental learning of causal network per cluster.

Summarizing PC-Stable Algorithm with changes to partial correlation tests [4]

1. Learn skeleton by iteratively verifying pairwise conditional independence of variables given a set of observed variables \(\{X^{(r)}; r \in k\}\). In each iteration the size of the set \(\{X^{(r)}; r \in k\}\) (observed variables) is increased by 1.

2. To test whether \(X^{(i)} \perp X^{(j)} | \{X^{(r)}; r \in k\}\) we compute Fisher’s Z-Transform \(Z(i,j|k)=\frac{1}{2} log \frac{1+ \rho _{i,j|k}}{1-\rho _{i,j|k}}\)

3. The partial correlation co-efficient \(\rho _{i,j|k}\) can be learnt from pairwise correlation using dynamic programming as it has repetitive sub problem structure. \(\rho _{i,j|k} = \frac{\rho _{i,j|k \setminus h} - \rho _{i,h|k \setminus h} \rho _{j,h|k \setminus h}}{\sqrt{(1-\rho ^{2}_{i,h|k \setminus h})(1-\rho ^{2}_{j,h|k \setminus h})}}\)

4. When the observed variable set is of size 1 the partial correlation equation reduces to \(\rho _{i,j|k} = \frac{\rho _{i,j}- \rho _{i,k} \rho _{j,k}}{\sqrt{(1-\rho ^{2}_{i|k \setminus h})(1-\rho ^{2}_{j|k \setminus h})}}\).

5. Equation 1 shows how \(ci-suffStat\) collected per batch helps determine the partial correlation without having to visit the actual batch data

(1)
figure a

The key factor for designing an incremental algorithm is to identify additive operations and collect the required ci-sufficient stats. \(ci-suffStat_{b_c} = (\sum X^{(i)}_{b} X^{(j)}_b, \sum X^{(i)}_{b}, \sum X^{(j)}_{b}, \sum X^{(i)^2}_{b}, \sum X^{(j)^2}_{b})\). Equation 1, show how using \(ci-suffStat_{b_c}\) we determine the correlation between pair of variables, which in turn are rolled up to determine the skeleton.

The over all algorithm which incrementally learns the complete nsTCN from streaming (batchwise) data is presented in 1. The algorithm is triggered for every batch of data, which collects and caches the \(ci-suffStat_{b_c}\) for every cluster, determines if concept drift has occurred and updates the causal network accordingly. We have different types of re-learning to be done as re-learning the cluster is computationally costly and type of relearning is determined by the concept drift rules.

4 Experimentation

For non-stationary causal analysis of ARPU (Average revenue per user) in telecommunication domain, raw data for the subscribers (.091 million) were collected, for a 8 month period (Aug 2016 to March 2017), from one of the leading Indian telecommunication service provider’s database. The features are identified and a new dataset is built for the study. The dataset has a total of  .7 million records (.091 million * 8 months) and around 32 features (one record includes subscriber’s info from 2 successive time instance).

Except for GROSS ARPU and NET ARPU, all other features are resultants of a subscriber’s transaction. A transaction is either of call made/received, sms sent, datausages and recharge. These transaction values are month-wise aggregated to form features. GROSS ARPU and NET ARPU are the operator determined values which represent the subscriber’s overall monthly Revenue. Features are:

Decrement: value which is deducted per transaction from subscriber’s core balance.

MrpOfRechargeDone: Market retail price of the recharge

TotalOgMou & TotalOgRev: Subscriber’s out going call minutes and associated revenue generated for operator.

DataUsage, Data revenue & Data Arpu: data usage by subscriber and resultant revenue for the operator.

GrossArpu: GROSS average revenue per user. NetArpu: NET average revenue per user. StdOgMou & StdOgRev: national out going call minutes for the subscriber and resultant revenue for the operator.

TotalIcMou, TotalLocalIcMou & TotalStdIcMou: Overall, local and Std incoming call minutes for the subscriber.

LocalOnnetOgMou & LocalNetOgMou: within operator local outgoing call minutes for the subscriber.

As per the framework proposed, data is fed to the algorithm in batches, algorithm identified the cluster’s and per cluster learnt associated causal network which explains how different usages affects the revenue. The framework also detected concept drift and updated required causal network accordingly.

As can be seen from the Fig. 2a the memory requirement for nsTCN is much lower than Huang et al.’s [7] GP Regression based causal analysis. The memory requirement for GP Regression is \(O(ND+N^2)\) where as for nsTCN it is \(O(KB(D^2) + KD^2)\) where N is the number of training records (samples), D is the number of features/random variables, K is the number of cluster identified and B is the number of batches for which the sufficient statistics are maintained.

Fig. 2.
figure 2

Comparisons

Now let us discuss results of the algorithm, which identified concept drift and associated nsTCN, on telecommunication domain’s subscriber dataset. Figure 2b pictorially depicts cluster density distribution for top 3 clusters. It can be seen that for the month of December cluster 24 has seen an increase of \(~30\%\) (from 21139 to 27724) subscriber records mapped to it, and the subscriber from the cluster 17 and cluster 15 have seen a drop in the subscriber mapped to them. The KL divergence was 0.09 above the threshold of 0.02.

In Cluster 17 the direct causal factors for Gross ARPU is Decrement. The direct causal factors for Decrement are TotalOgRev, DataRev, MrpOfRechargeDone. Each of the revenues related to call, sms and data are influenced by respective usages. In Cluster 24 the data revenue did not contribute towards Gross ARPU of the subscriber.

Fig. 3.
figure 3

Non-stationary temporal causal network associated with clusters’

Figure 3 shows the causal network for subscribers who underwent concept drift between the months of November and December. The complete disappearance of the edge suggests that the revenue dropped down to zero.

From the difference in causal network we can infer that the factors that caused the concept drift have to be related to subscriber’s datausage. This heuristic matches with the launch of free services by a competitive Indian telecommunication operator in the month of December 2016 whose data related services were widely accepted by the subscribers. Using nsTCN to identify the concept drift and relearning, results in improvement for ARPU forecasting root mean square error (RMSE) to drop from 85.6 to 27.1.

5 Future Works and Conclusion

We propose a framework to incrementally learn nsTCN, as part of which we define rules to identify the concept drift and propose an algorithm to incrementally learn non-stationary temporal causal networks associated to a domain. We use our proposed framework to model real world telecommunication problem and identify the concept drift that occurred and see how non-stationary causal modelling helps us understand the impact on revenue. Also the Causal Networks provides us the insight that was very well matched with the dominant market forces.

As part of the future work without any modification to the algorithm we can add new variables in the dataset which captures seasonality and region, helping us to understand their influence on revenue. Also we can extend the algorithm to automate identification of hidden external factors based on the heuristic learnt.