Skip to main content

Advanced Regression Analysis

  • Chapter
  • First Online:
Essentials of Business Analytics

Part of the book series: International Series in Operations Research & Management Science ((ISOR,volume 264))

Abstract

Three topics are covered in this chapter. In the main body of the chapter, the tools for estimating the parameters of regression models when the response variable is binary or categorical are presented. The appendices cover two other important techniques, namely, maximum likelihood estimate (MLE) and how to deal with missing data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    R may produce different looking output for the same chart.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vishnuprasad Nagadevara .

Editor information

Editors and Affiliations

1 Electronic Supplementary Material

Supplementary Data 8.1

Advanced_Regression_Analysis (R 10 kb)

Supplementary Data 8.2

Employee_attrition (CSV 903 bytes)

Supplementary Data 8.3

Employee_attrition_nvars (CSV 4 kb)

Supplementary Data 8.4

Quality_Index (CSV 1003 kb)

Appendices

Appendix 1: Missing Value Imputation

Missing data is a common issue in almost all analyses. There are a number of ways for handling missing data, but each of them has its own advantages and disadvantages. This section discusses some of the methods of missing value imputation and the associated advantages and disadvantages.

The type of missing values can be classified into different categories. These categories are described below:

  1. (a)

    Missing at random: when the probability of non-response to a question depends only on the other items where the response is complete, then it is categorized as “Missing at Random.”

  2. (b)

    Missing completely at random: if the probability of a value missing is the same for all observations, then it is categorized as “Missing Completely at Random.”

  3. (c)

    Missing value that depends on unobserved predictors: when the value that is missing depends on information that has not been recorded and the same information also predicts the missing values. For example, a discomfort associated with a particular treatment might lead to patients dropping out of a treatment leading to missing values.

  4. (d)

    Missing values depending on the variable itself: this occurs when the probability of missing values in a variable depends on the variable itself. Persons belonging to very-high-income groups may not want to report their income which leads to missing values in the income variable.

Handling Missing Data by Deletion

Many times, missing data problem can be handled simply by discarding the data. One method is to exclude all observations where the values are missing. For example, in regression analysis, any observation which has either the values of the dependent variable or any independent variable is missing, such observation is excluded from the analysis. There are two disadvantages of this method. The first is that the exclusion of observations may introduce bias, especially if those excluded differ significantly from those which are included in the analysis. The second is that there may be only a few observations left for analysis after deletion of observations with missing values.

This method is often referred to as “Complete Case Analysis.” This method is also called “List-wise Deletion”. This method is most suited when there are only a few observations with missing values.

The next method of discarding the data is called “Available Case Analysis” or “Pair-wise Deletion.” The analysis is carried out with respect to only those observations where the values are available for a particular variable. For example, let us say, out of 1000 observations, information on income is available only for 870 and information on age is available for 960 observations. The analysis with respect to age is carried out using 960 observations, whereas the analysis of income is carried out for 870 observations. The disadvantage of this method is that the analysis of different variables is based on different subsets and hence these are neither consistent nor comparable.

Handling Missing Data by Imputation

These methods involve imputing the missing values. The advantage is that the observations with missing values need not be excluded from the analysis. The missing values are replaced by the best possible estimate.

  • Mean Imputation

This method involves substituting missing values with the mean of the observed values of the variable. Even though it is the simplest method, mean imputation reduces the variance and pulls the correlations between variables toward “zero.” Imputation by median or mode instead of mean can also be done. In order to maintain certain amount of variation, the missing values are replaced by “group mean.” The variable with missing values is grouped into different bins and mean values of each group (bin) are calculated. The missing values in any particular group are replaced by the corresponding group mean, instead of overall mean.

  • Imputation by Linear Regression

The missing value can be imputed by using simple linear regression. The first step is to treat the variable with missing values as the dependent variable and identify several predictor variables in the dataset. The identification of the predictor variables can be done using correlation. These variables are used to create a regression equation, and the missing values of the dependent variable are estimated using the regression equation. Sometimes, an iterative process is used where all the missing values are first imputed and using the completed set of observations, the regression coefficients are re-estimated, and the missing values are recalculated. The process is repeated until the difference in the imputed values between successive iterations is below a predetermined threshold. While this method provides “good estimates” for the missing values, the disadvantage is that the variables tend to fit too well because the missing values themselves are estimated using the other variables of the dataset itself.

  • Imputation with Time Series Data

Certain imputation methods are specific to time series data. One method is to carry forward the last observation or carry backward the next observation. Another method is to linearly interpolate the missing value using the adjacent values. This method works better where the time series data exhibits trend. Wherever seasonality is involved, linear interpolation can be carried out after adjusting for seasonality.

  • Imputation of Categorical Variables

Categorical variables, by their nature, require different methods for imputation. Imputation by mode is the simplest method. Yet it will introduce bias, just the same way as imputation by mean. Missing values of categorical variables can be treated as a separate category in itself. Different prediction models such as classification trees, k-nearest neighbors, logistic regression, and clustering can be used to estimate the missing values. The disadvantage of these methods is that it requires building high-level predictive models which in itself can be expensive and time consuming.

As mentioned earlier, missing data is a major issue in data analysis. While there are number of methods available for imputing the missing values, there is no such method that is the “best.” The method that needs to be used depends on the type of dataset, the type of variables, and the type of analysis.

1.1 Further Reading

  • Enders, C. K. (2010). Applied missing data analysis. New York: The Guilford Press.

  • Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). Hoboken, NJ: John Wiley & Sons, Inc.

  • Yadav, M. L. (2018). Handling missing values: A study of popular imputation packages in R. Knowledge-Based Systems. Retrieved July 24, 2018, from https://doi.org/10.1016/j.knosys.2018.06.012 .

Appendix 2: Maximum Likelihood Estimation

One of the most commonly used techniques for estimating the parameters of a mathematical model is the least squares estimation, which is commonly used in linear regression. Maximum likelihood estimation (MLE) is another approach developed for the estimation of parameters where the least squares method is not applicable, especially when the estimation involves complex nonlinear models. MLE involves an iterative process, and the availability of computing power has made MLE more popular recently. Since MLE does not impose any restrictions on the distribution or characteristics of independent variables, it is becoming a preferred approach for estimation.

Let us consider a scenario where a designer boutique is trying to determine the probability, π, of a purchase made by using a credit card. The boutique is interested in calculating the value of p which is a maximum likelihood estimate of π. The boutique had collected the data of 50 purchases and found that 32 out of 50 were credit card purchases and the remaining are cash purchases.

The maximum likelihood estimation process starts with the definition of a likelihood function, L(β), where β is the vector of unknown parameters. The elements of the vector β are the individual parameters β0, β1, β2, …, βk. The likelihood function, L(β), is the joint probability or likelihood of obtaining the data that was observed. The data of the boutique mentioned above can be described by binomial distribution with 32 successes observed out of 50 trials. The likelihood function for this example can be expressed as

$$ P\left(X=32|N=50\ and\ \pi \right)=L\left(\pi \right)=K{\pi}^{32}{\left(1-\pi \right)}^{18} $$

where K is constant, N is the number of trials, and π is the probability of success, which is to be estimated. We can take the first derivative of the above function, equate it to zero, and solve for the unknown parameter. Alternatively, we can substitute different values for π (i.e., possible estimated values of π; let us call it p) and calculate the corresponding value of the likelihood function. Usually, these values turn out to be extremely small. Consequently, we can take the log values of the likelihood function. Since there is a one-to-one mapping of the actual values of the likelihood function and its log value, we can pick the value of π which will maximize the log likelihood function. The values of the log likelihood function (also known as log likelihood function) are plotted against the possible values of π in Fig. 8.5. The value of the log likelihood function is maximum when the value of p is equal to 0.64. This 0.64 is the maximum likelihood estimate of π.

Fig. 8.5
figure 5

Log of likelihood function and possible values of p

The above example is extremely simple, and the objective here is to introduce the concept of likelihood function and log likelihood function and the idea of maximization of the likelihood function/log likelihood function. Even though in reality most applications of MLE involve multivariate distributions, the following example deals with univariate normal distribution. The concepts of an MLE with univariate normal distribution can be easily extended to any other distribution with multiple variables.

Consider the data on income levels presented in Table 8.20. There are 20 observations, and we would like to estimate the two parameters, mean and standard deviation of the population from which these observations are drawn.

Table 8.20 Sample data

The estimation process involves the probability density function of the distribution involved. Assuming that the above observations are drawn from a univariate normal distribution, the density function is

$$ {L}_i=\frac{1}{\sqrt{2\pi}\sigma }{e}^{-\frac{{\left({X}_i-\mu \right)}^2}{2\sigma^2 }} $$

where X i is the value of ith observation, μ is the population mean, σ is the population standard deviation, and Li is the likelihood function corresponding to the ith observation. Here, Li is the height of the density function, that is, the value of the density function, f(x).

The joint probability of two events, Ei and Ej, occurring is the product of the two probabilities, considering that the two events are independent of each other. Even though Li and Lj, the two likelihood functions associated with observations i and j, are not exactly probabilities, the same rule applies. There are 20 observations in the given sample (Table 8.20). Thus, the likelihood of the sample is given by the product of the corresponding likelihood values.

The sample likelihood is given by

$$ L=\prod_{i=1}^{20}\left[\frac{1}{\sqrt{2\pi}\sigma }{e}^{-\frac{{\left({X}_i-\mu \right)}^2}{2\sigma^2 }}\right] $$

The likelihood values of the 20 observations are presented in Table 8.21. These values are calculated based on μ = 100 and σ2 = 380 (these two values are taken from a set of all possible values).

Table 8.21 Likelihood values of the sample values with μ = 100 and σ2 = 380

To get the sample likelihood, the above likelihood values are to be multiplied. Since the likelihood values are small, the sample likelihood will have an extremely small value (in this case, it happens to be 7.5707 × 10−39). It will be much better to convert the individual Li values to their log values, and the log likelihood of the sample can be obtained simply by adding the log values. The log likelihood of the sample is obtained by

$$ \log L=\sum_{i=1}^{20}\mathit{\log}\left[\frac{1}{\sqrt{2\pi}\sigma }{e}^{-\frac{{\left({X}_i-\mu \right)}^2}{2\sigma^2 }}\right] $$

Table 8.22 presents the values of Li along with their log values. The log likelihood value (log L value) of the entire sample obtained from Table 8.22 is –87.7765. The Log Li values in the above table are based on assumed values of μ = 100 and σ2 = 380. The exercise is repeated with different values of μ. The log L values of the entire sample are plotted against the possible values of μ in Fig. 8.6.

Table 8.22 Likelihood values along with the log values
Fig. 8.6
figure 6

Log L values of the entire sample and possible values of μ

The maximum likelihood estimate for σ2 based on Fig. 8.7 is 378. It can be concluded that the sample data has a univariate normal distribution with the maximum likelihood estimates of \( \widehat{\mu}=98.5\ and\ {\sigma}^2=378 \).

Fig. 8.7
figure 7

Log L values of the entire sample and possible values of σ2

It can be seen from Fig. 8.6 that the maximum value of log L is obtained when value of μ = 98.5. This is the maximum likelihood estimate (\( \widehat{\mu} \)) for the population parameter, μ. The entire exercise is repeated with different possible values of σ2 while keeping the value of \( \widehat{\mu} \) at 98.5. The values of log L corresponding to different values of σ2 are plotted in Fig. 8.7.

This appendix provides a brief description of the maximum likelihood estimation method. It demonstrated the method by using a univariate normal distribution. The same technique can be extended to multivariate distribution functions. It is obvious that the solution requires many iterations and considerable computing power. It may also be noted that the values of likelihood function are very small, and consequently, log likelihood function values tend to be negative. It is a general practice to use −log likelihood function (negative of log likelihood function) and correspondingly minimize it instead of maximizing the log likelihood function.

1.1 Further Reading

  • Eliason, S. R. (2015). Maximum likelihood estimation: Logic and practice. In Quantitative applications in the social sciences book (p. 96). Thousand Oaks, CA: Sage Publications.

Appendix 3

We provide the R functions and command syntax that are used to build various tables and charts referred in the chapter. Refer to Table 8.23. It can be helpful for practice purpose.

Table 8.23 Relevant R functions

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Nagadevara, V. (2019). Advanced Regression Analysis. In: Pochiraju, B., Seshadri, S. (eds) Essentials of Business Analytics. International Series in Operations Research & Management Science, vol 264. Springer, Cham. https://doi.org/10.1007/978-3-319-68837-4_8

Download citation

Publish with us

Policies and ethics