Binormal Precision–Recall Curves for Optimal Classification of Imbalanced Data

Liu, Zhongkai; Bondell, Howard D.

doi:10.1007/s12561-019-09231-9

Binormal Precision–Recall Curves for Optimal Classification of Imbalanced Data

Published: 11 February 2019

Volume 11, pages 141–161, (2019)
Cite this article

Statistics in Biosciences Aims and scope Submit manuscript

609 Accesses
19 Citations
Explore all metrics

Abstract

Binary classification on imbalanced data, i.e., a large skew in the class distribution, is a challenging problem. Evaluation of classifiers via the receiver operating characteristic (ROC) curve is common in binary classification. Techniques to develop classifiers that optimize the area under the ROC curve have been proposed. However, for imbalanced data, the ROC curve tends to give an overly optimistic view. Realizing its disadvantages of dealing with imbalanced data, we propose an approach based on the Precision–Recall (PR) curve under the binormal assumption. We propose to choose the classifier that maximizes the area under the binormal PR curve. The asymptotic distribution of the resulting estimator is shown. Simulations, as well as real data results, indicate that the binormal Precision–Recall method outperforms approaches based on the area under the ROC curve.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Precision–recall curve (PRC) classification trees

Article 14 April 2021

Jiaju Miao & Wei Zhu

Empirical analysis of performance assessment for imbalanced classification

Article 23 January 2024

Jean-Gabriel Gaudreault & Paula Branco

Assessing Imbalanced Datasets in Binary Classifiers

References

Bache K, Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine
Google Scholar
Bamber D (1975) The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 12(4):387–415
Article MathSciNet MATH Google Scholar
Box GE, Cox DR (1964) An analysis of transformations. J R Stat Soc Ser B 26(2):211–252
MATH Google Scholar
Boyd K, Eng KH, Page CD (2013) Area under the precision-recall curve: point estimates and confidence intervals. In: Blockeel H (ed) Machine learning and knowledge discovery in databases, vol 8190. Springer, New York, pp 451–466
Google Scholar
Brodersen KH, Ong CS, Stephan KE, Buhmann JM (2010a) The balanced accuracy and its posterior distribution. In: Pattern Recognition (ICPR), 2010 20th International Conference on IEEE, pp 3121–3124
Brodersen KH, Ong CS, Stephan KE, Buhmann JM (2010b) The binormal assumption on precision-recall curves. In: Pattern Recognition (ICPR), 2010 20th International Conference on IEEE. pp 4263–4266
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Clémençon S, Vayatis N (2009) Nonparametric estimation of the precision-recall curve. In: Proceedings of the 26th Annual International Conference on Machine Learning. ACM, pp 185–192
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Craven JBM (2005) Markov networks for detecting overlapping elements in sequence data. Adv Neural Inf Process Syst 17:193
Google Scholar
Davis J, Burnside ES, de Castro Dutra I, Page D, Ramakrishnan R, Costa VS, Shavlik JW (2005) View learning for statistical relational learning: with an application to mammography. In: Proceeding of the 19th international joint conference on artificial intelligence (IJCAI), pp 677–683
Davis J, Goadrich M (2006) The relationship between precision-recall and roc curves. In: Proceedings of the 23rd international conference on Machine learning, pp 233–240
Dorfman DD, Alf E (1968) Maximum likelihood estimation of parameters of signal detection theorya direct solution. Psychometrika 33(1):117–124
Article Google Scholar
Fan Y, Kai Z, Qiang L (2014) A revisit to the class imbalance learning with linear support vector machine. In: Computer Science & Education (ICCSE), 2014 9th International Conference on, IEEE, pp 516–521
Friedman J, Popescu BE (2003) Gradient directed regularization for linear regression and classification. Technical report. Statistics Department, Stanford University, Stanford
Google Scholar
Han H, Wang WY, Mao BH (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing. Springer, pp 878–887
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143(1):29–36
Article Google Scholar
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
Article MATH Google Scholar
Kok S, Domingos P (2005) Learning the structure of Markov logic networks. In: Proceedings of the 22nd international conference on Machine learning, pp 441–448
Krzanowski WJ, Hand DJ (2009) ROC curves for continuous data. CRC Press, Boca Raton
Book MATH Google Scholar
LeDell E, Petersen M, van der Laan M (2015) Computationally efficient confidence intervals for cross-validated area under the roc curve estimates. Electron J Stat 9(1):1583
Article MathSciNet MATH Google Scholar
Ma S, Huang J (2005) Regularized roc method for disease classification and biomarker selection with microarray data. Bioinformatics 21(24):4356–4362
Article Google Scholar
Ma S, Song X, Huang J (2006) Regularized binormal roc method in disease classification using microarray data. BMC Bioinform 7(1):253
Article Google Scholar
Metz CE, Kronman HB (1980) Statistical significance tests for binormal roc curves. J Math Psychol 22(3):218–243
Article MATH Google Scholar
Metz CE, Pan X (1999) proper binormal roc curves: theory and maximum-likelihood estimation. J Math Psychol 43(1):1–33
Article MathSciNet MATH Google Scholar
Nash WJ (1994) The Population Biology of Abalone (Haliotis Species) in Tasmania: Blacklip Abalone (H. Rubra) from the North Coast and the Islands of Bass Strait. Sea Fisheries Division, Marine Research Laboratories-Taroona, Department of Primary Industry and Fisheries, Tasmania
Pepe MS (2003) The statistical evaluation of medical tests for classification and prediction. Oxford University Press, Oxford
MATH Google Scholar
Pepe MS, Cai T, Longton G (2006) Combining predictors for classification using the area under the receiver operating characteristic curve. Biometrics 62(1):221–229
Article MathSciNet MATH Google Scholar
Raghavan V, Bollmann P, Jung GS (1989) A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans Inf Syst 7(3):205–229
Article Google Scholar
Siebert JP (1987) Vehicle recognition using rule based methods. Project report, Turing Institute
Singla P, Domingos P (2005) Discriminative training of markov logic networks. AAAI 5:868–873
Google Scholar
Zou KH, Hall W (2000) Two transformation models for estimating an roc curve derived from continuous data. J Appl Stat 27(5):621–631
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

North Carolina State University, Raleigh, NC, USA
Zhongkai Liu
University of Melbourne, Melbourne, Australia
Howard D. Bondell

Authors

Zhongkai Liu
View author publications
You can also search for this author in PubMed Google Scholar
Howard D. Bondell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Howard D. Bondell.

7 Appendix

1.1 7.1 Appendix A: Proof of Proposition 2

The population version of the problem can be described as

(7.15)

We solve the optimization problem (7.15) separately for each $b\in \{-1,1\}$ to find the overall maximum. We refer to Lagrange multipliers method to find the maxima of the function subject to four equality constraints, with the objective function f written as

$$\begin{aligned} f(\mu _p,\mu _n,\sigma _p,\sigma _n,\beta ,{\varvec{\lambda }})&=\int _0^1 \frac{\pi t}{\pi t+(1-\pi )\Phi \left( \frac{\mu _n-\mu _p}{\sigma _n}+\frac{\sigma _p}{\sigma _n}\Phi ^{-1}(t)\right) }\mathrm{{d}}t \nonumber \\&-\lambda _1\left( \mu _p-b\mu _{11}-\beta \mu _{12}\right) -\lambda _2\left( \mu _n-b\mu _{01}-\beta \mu _{02}\right) \nonumber \\&-\lambda _3\left( \sigma _p^2-\sigma _{11}^2-\beta ^2 \sigma _{12}^2-2b\beta c_1\right) \nonumber \\&-\lambda _4\left( \sigma _n^2-\sigma _{01}^2-\beta ^2 \sigma _{02}^2-2b\beta c_0\right) . \end{aligned}$$

(7.16)

Setting the gradient $\nabla _{{\varvec{\lambda }}} f(\mu _p,\mu _n,\sigma _p,\sigma _n,\beta ,{\varvec{\lambda }})=0$ yields the system of constraint equations listed in (7.15), and setting the gradient $\nabla _{\mu _p,\mu _n,\sigma _p,\sigma _n}f=0$ gives Lagrange multipliers ${\varvec{\lambda }}$, where functions $\lambda _j=\lambda _j(b,\beta ,\mu _p,\mu _n,\sigma _p,\sigma _n)$ for $j=1,\cdots ,4$ as follows.

$$\begin{aligned}&\frac{\partial f}{\partial {\mu _p}}=0 \Longrightarrow \lambda _1=\int _0^1 \frac{\pi (1-\pi )t\phi \left( \frac{\mu _n-\mu _p}{\sigma _n}+\frac{\sigma _p}{\sigma _n} \Phi ^{-1}(t)\right) }{\sigma _n\left[ \pi t+(1-\pi )\Phi \left( \frac{\mu _n-\mu _p}{\sigma _n}+\frac{\sigma _p}{\sigma _n} \Phi ^{-1}(t)\right) \right] ^2}\mathrm{{d}}t, \nonumber \\&\frac{\partial f}{\partial {\mu _n}}=0 \Longrightarrow \lambda _2=-\lambda _1,\nonumber \\&\frac{\partial f}{\partial {\sigma _p}}=0 \Longrightarrow \lambda _3=-\frac{1}{2\sigma _p}\int _0^1 \frac{\pi (1-\pi )t\Phi ^{-1}(t)\phi \left( \frac{\mu _n-\mu _p}{\sigma _n} +\frac{\sigma _p}{\sigma _n}\Phi ^{-1}(t)\right) }{\sigma _n\left[ \pi t+(1-\pi )\Phi \left( \frac{\mu _n-\mu _p}{\sigma _n}+\frac{\sigma _p}{\sigma _n} \Phi ^{-1}(t)\right) \right] ^2}\mathrm{{d}}t, \nonumber \\&\frac{\partial f}{\partial {\sigma _n}}=0 \Longrightarrow \lambda _4=\frac{1}{2\sigma _n}\int _0^1 \frac{\pi (1-\pi )t\left( \mu _n-\mu _p+\sigma _p\Phi ^{-1}(t)\right) \phi \left( \frac{\mu _n-\mu _p}{\sigma _n}+\frac{\sigma _p}{\sigma _n}\Phi ^{-1}(t)\right) }{\sigma _n^2\left[ \pi t+(1-\pi )\Phi \left( \frac{\mu _n-\mu _p}{\sigma _n}+\frac{\sigma _p}{\sigma _n} \Phi ^{-1}(t)\right) \right] ^2}\mathrm{{d}}t, \end{aligned}$$

(7.17)

where $\phi (x)$ denotes the probability density function of standard normal distribution, i.e., $\phi (x)=\frac{1}{\sqrt{2\pi }} e^{-x^2/2}$.

With the gradient $\nabla _{\beta }f=0$, we have

$$\begin{aligned} \lambda _{10} \mu _{12}+\lambda _{20} \mu _{02}+2\lambda _{30}\beta _0\sigma _{12}^2+2\lambda _{30} b_0 c_1+2\lambda _{40}\beta _0\sigma _{02}^2+2\lambda _{40} b_0 c_0=0.\nonumber \\ \end{aligned}$$

(7.18)

1.2 7.2 Appendix B: Proof of Theorem 1

Let ${\bar{X}}_{11}=\sum \limits _{i=1}^{n_1}X_{11i}/n_1$, ${\bar{X}}_{12}=\sum \limits _{i=1}^{n_1}X_{12i}/n_1$, ${\bar{X}}_{01}=\sum \limits _{i=1}^{n_0}X_{01i}/n_0$, ${\bar{X}}_{02}=\sum \limits _{i=1}^{n_0}X_{02i}/n_0$, ${\hat{\sigma }}_{11}^2=\frac{1}{n_1}\sum \limits _{i=1}^{n_1}\left( X_{11i}-{\bar{X}}_{11}\right) ^2$, ${\hat{\sigma }}_{12}^2=\frac{1}{n_1}\sum \limits _{i=1}^{n_1}\left( X_{12i}-{\bar{X}}_{12}\right) ^2$, ${\hat{\sigma }}_{01}^2=\frac{1}{n_0}\sum \limits _{i=1}^{n_0}\left( X_{01i}-{\bar{X}}_{01}\right) ^2$, ${\hat{\sigma }}_{02}^2=\frac{1}{n_0}\sum \limits _{i=1}^{n_0}\left( X_{02i}-{\bar{X}}_{02}\right) ^2$, ${\hat{c}}_1=\frac{1}{n_1}\sum \limits _{i=1}^{n_1}\left( X_{12i}-{\bar{X}}_{12}\right) \left( X_{11i}-{\bar{X}}_{11}\right) $, and ${\hat{c}}_0=\frac{1}{n_0}\sum \limits _{i=1}^{n_0}\left( X_{02i}-{\bar{X}}_{02}\right) \left( X_{01i}-{\bar{X}}_{01}\right) $. We can replace the means and variances in (7.15) by the sample versions, and let $({\hat{b}},{\hat{\beta }})$ be the corresponding solution. Then ${\hat{\mu }}_p={\hat{b}}{\bar{X}}_{11}+{\hat{\beta }}{\bar{X}}_{12}$, ${\hat{\mu }}_n={\hat{b}}{\bar{X}}_{01}+{\hat{\beta }}{\bar{X}}_{02}$, ${\hat{\sigma }}_p^2=\sum \limits _{i=1}^{n_1}\left( {\hat{b}}X_{11i}-{\hat{b}}{\bar{X}}_{11}+{\hat{\beta }}\left( X_{12i}-{\bar{X}}_{12}\right) \right) ^2/n_1$, and ${\hat{\sigma }}_n^2=\sum \limits _{i=1}^{n_0}\left( {\hat{b}}X_{01i}-{\hat{b}}{\bar{X}}_{01}+{\hat{\beta }}\left( X_{02i}-{\bar{X}}_{02}\right) \right) ^2/n_0$. Consequently, we can get the estimating equation ${\hat{g}}$ as follows, i.e., the sample version of Equation (7.18).

$$\begin{aligned} {\hat{g}}= & {} {\hat{\lambda }}_1 {\bar{X}}_{12}+{\hat{\lambda }}_2 {\bar{X}}_{02}+2{\hat{\lambda }}_3 {\hat{\beta }}{\hat{\sigma }}_{12}^2+2{\hat{\lambda }}_3 {\hat{b}}{\hat{c}}_1 +2{\hat{\lambda }}_4{\hat{\beta }}{\hat{\sigma }}_{02}^2+2{\hat{\lambda }}_4 {\hat{b}}{\hat{c}}_0 \nonumber \\= & {} {\hat{\lambda }}_1 {\bar{X}}_{12}+{\hat{\lambda }}_2 {\bar{X}}_{02}+\frac{2{\hat{\lambda }}_3 {\hat{\beta }}}{n_1}\sum \limits _{i=1}^{n_1}\left( X_{12i}-{\bar{X}}_{12}\right) ^2 +\frac{2{\hat{\lambda }}_3 {\hat{b}}}{n_1}\sum \limits _{i=1}^{n_1}(X_{12i}-{\bar{X}}_{12})(X_{11i}-{\bar{X}}_{11}) \nonumber \\+ & {} \frac{2{\hat{\lambda }}_4{\hat{\beta }}}{n_0}\sum \limits _{i=1}^{n_0}( X_{02i}-{\bar{X}}_{02})^2+\frac{2{\hat{\lambda }}_4 {\hat{b}}}{n_0}\sum \limits _{i=1}^{n_0}\left( X_{02i}-{\bar{X}}_{02}\right) \left( X_{01i}-{\bar{X}}_{01}\right) =0, \end{aligned}$$

(7.19)

where ${\hat{\lambda }}_j=\lambda _j({\hat{b}},{\hat{\beta }},{\hat{\mu }}_p,{\hat{\mu }}_n,{\hat{\sigma }}_p,{\hat{\sigma }}_n)$ for $j=1,\ldots ,4$.

Expand the estimating equation ${\hat{g}}$ in a 1st order Taylor series around $\beta _0$ as

$$\begin{aligned} 0={\hat{g}} \approx g+g'\cdot ({\hat{\beta }}-\beta _0), \end{aligned}$$

(7.20)

where $g'$ denotes the first derivative of g with respect to $\beta $ and $g(\beta )$ is expressed in Eq. (7.18).

As a result,

$$\begin{aligned} \sqrt{n}({\hat{\beta }}-\beta _0)\approx \frac{\sqrt{n}({\hat{g}}-g)}{g'}. \end{aligned}$$

(7.21)

By WLLN and the sample version of Eq. (7.17) with plugging in sample means and variances as estimates for the population ones, we can have

$$\begin{aligned} g' \xrightarrow {p} B= & {} \lambda _{10}^{'} \mu _{12}+\lambda _{20}^{'} \mu _{02} +2\lambda _{30}^{'} \beta _0 \sigma _{12}^2+2\lambda _{30}\sigma _{12}^2+2\lambda _{40}^{'} \beta _0 \sigma _{02}^2\nonumber \\&+2\lambda _{40}\sigma _{02}^2+2\lambda _{30}^{'} b_0 c_1+2\lambda _{40}^{'} b_0 c_0, \end{aligned}$$

(7.22)

where $\lambda _{j0}^{'}=\frac{\partial {\lambda _{j0}}}{\partial {\beta _0}}$ for $j=1,\ldots ,4$.

In addition to the independence structure among predictors, by taking $\frac{n_1}{n} \rightarrow \pi ,~\frac{n_0}{n} \rightarrow 1-\pi $ into consideration, the Central Limit Theorem (CLT) gives us

$$\begin{aligned}&\sqrt{n}\left( \left( {\bar{X}}_{11},{\bar{X}}_{12},{\bar{X}}_{01},{\bar{X}}_{02}, {\hat{\sigma }}_{11}^2,{\hat{\sigma }}_{12}^2,{\hat{\sigma }}_{01}^2, {\hat{\sigma }}_{02}^2, {\hat{c}}_1,{\hat{c}}_0\right) ^T\right. \nonumber \\&\quad \left. -\left( \mu _{11},\mu _{12},\mu _{01},\mu _{02},\sigma _{11}^2, \sigma _{12}^2,\sigma _{01}^2,\sigma _{02}^2,c_1,c_0\right) ^T \right) \nonumber \\&\qquad \xrightarrow {d} N(0,\Sigma ), \end{aligned}$$

(7.23)

where $\Sigma $ is a diagonal matrix with the elements of vector $\Big (\frac{\sigma _{11}^2}{\pi }, \frac{\sigma _{12}^2}{\pi }, \frac{\sigma _{01}^2}{1-\pi }, \frac{\sigma _{02}^2}{1-\pi }, \frac{\mu _{114}-\sigma _{11}^4}{\pi }, \frac{\mu _{124}-\sigma _{12}^4}{\pi }, \frac{\mu _{014}-\sigma _{01}^4}{1-\pi }$, $\frac{\mu _{024}-\sigma _{02}^4}{1-\pi }, \frac{\sigma _{11}^2\sigma _{12}^2}{\pi }, \frac{\sigma _{01}^2\sigma _{02}^2}{1-\pi }\Big )^T$ on the main diagonal, and $\mu _{ij4}=\text {E}\left[ (X_{ij}-\mu _{ij})^4\right] $ with $i \in \{0,1\}$ and $j \in \{1,2\}$.

Consequently, according to multivariate delta method, we have

$$\begin{aligned}&\sqrt{n}\left( {\hat{g}}-g) \xrightarrow {d} N(0,D = \mathbf{g}'^{T} \Sigma \mathbf{g}'\right) ,\nonumber \\ \hbox {where }&\mathbf{g}' = \Big (\frac{\partial {g}}{\partial {\mu _{11}}}, \frac{\partial {g}}{\partial {\mu _{12}}}, \frac{\partial {g}}{\partial {\mu _{01}}}, \frac{\partial {g}}{\partial {\mu _{02}}}, \frac{\partial {g}}{\partial {\sigma _{11}^2}}, \frac{\partial {g}}{\partial {\sigma _{12}^2}}, \frac{\partial {g}}{\partial {\sigma _{01}^2}}, \frac{\partial {g}}{\partial {\sigma _{02}^2}}, \frac{\partial {g}}{\partial {c_{1}}}, \frac{\partial {g}}{\partial {c_{0}}}\Big )^T.\nonumber \\ \end{aligned}$$

(7.24)

Finally, by Eq. (7.21), the asymptotic distribution of ${{\varvec{\beta }}}$ is

$$\begin{aligned} \sqrt{n}({\hat{\beta }}-\beta ) \xrightarrow {d} N(0,V), \end{aligned}$$

(7.25)

where $V=D/B^2$ with B and D defined in Eqs. (7.22) and (7.24) separately.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, Z., Bondell, H.D. Binormal Precision–Recall Curves for Optimal Classification of Imbalanced Data. Stat Biosci 11, 141–161 (2019). https://doi.org/10.1007/s12561-019-09231-9

Download citation

Received: 07 September 2017
Revised: 14 January 2019
Accepted: 05 February 2019
Published: 11 February 2019
Issue Date: 15 April 2019
DOI: https://doi.org/10.1007/s12561-019-09231-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Binormal Precision–Recall Curves for Optimal Classification of Imbalanced Data

Abstract

Access this article

Similar content being viewed by others

Precision–recall curve (PRC) classification trees

Empirical analysis of performance assessment for imbalanced classification

Assessing Imbalanced Datasets in Binary Classifiers

References

Author information

Authors and Affiliations

Corresponding author

7 Appendix

1.1 7.1 Appendix A: Proof of Proposition 2

1.2 7.2 Appendix B: Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Binormal Precision–Recall Curves for Optimal Classification of Imbalanced Data

Abstract

Access this article

Similar content being viewed by others

Precision–recall curve (PRC) classification trees

Empirical analysis of performance assessment for imbalanced classification

Assessing Imbalanced Datasets in Binary Classifiers

References

Author information

Authors and Affiliations

Corresponding author

7 Appendix

7 Appendix

1.1 7.1 Appendix A: Proof of Proposition 2

1.2 7.2 Appendix B: Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation