Skip to main content
Log in

Mining outlying aspects on numeric data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

When we are investigating an object in a data set, which itself may or may not be an outlier, can we identify unusual (i.e., outlying) aspects of the object? In this paper, we identify the novel problem of mining outlying aspects on numeric data. Given a query object \(o\) in a multidimensional numeric data set \(O\), in which subspace is \(o\) most outlying? Technically, we use the rank of the probability density of an object in a subspace to measure the outlyingness of the object in the subspace. A minimal subspace where the query object is ranked the best is an outlying aspect. Computing the outlying aspects of a query object is far from trivial. A naïve method has to calculate the probability densities of all objects and rank them in every subspace, which is very costly when the dimensionality is high. We systematically develop a heuristic method that is capable of searching data sets with tens of dimensions efficiently. Our empirical study using both real data and synthetic data demonstrates that our method is effective and efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://sports.yahoo.com/nba/stats.

  2. The object id and dimension id in Tables 7 and 8 are consistent with the original data sets in Keller et al. (2012).

References

  • Aggarwal CC (2013) An introduction to outlier analysis. Springer, New York

    Book  Google Scholar 

  • Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. ACM Sigmod Record, ACM, vol 30, pp 37–46

  • Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases, VLDB ’94, pp 487–499

  • Angiulli F, Fassetti F, Palopoli L (2009) Detecting outlying properties of exceptional objects. ACM Trans Database Syst 34(1):7:1–7:62

    Article  Google Scholar 

  • Angiulli F, Fassetti F, Palopoli L, Manco G (2013) Outlying property detection with numerical attributes. CoRR abs/1306.3558

  • Bache K, Lichman M (2013) UCI machine learning repository

  • Bhaduri K, Matthews BL, Giannella CR (2011) Algorithms for speeding up distance-based outlier detection. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’11, pp 859–867

  • Böhm K, Keller F, Müller E, Nguyen HV, Vreeken J (2013) CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: Proceedings of the 13th SIAM international conference on data mining, SDM ’13, pp 198–206

  • Breunig MM, Kriegel HP, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, SIGMOD ’00, pp 93–104

  • Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15:1–15:58

    Article  Google Scholar 

  • Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco

    Google Scholar 

  • Härdle W (1990) Smoothing techniques: with implementations in S. Springer, New York

    Book  Google Scholar 

  • Härdle W, Werwatz A, Müller M, Sperlich S (2004) Nonparametric and semiparametric modelss., Springer Series in StatisticsSpringer, Berlin

    Book  Google Scholar 

  • He Z, Xu X, Huang ZJ, Deng S (2005) FP-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst/ComSIS 2(1):103–118

    Article  Google Scholar 

  • Keller F, Müller E, Böhm K (2012) HiCS: high contrast subspaces for density-based outlier ranking. In: Proceedings of the 28th international conference on data engineering, ICDE ’12, pp 1037–1048

  • Knorr EM, Ng RT (1999) Finding intensional knowledge of distance-based outliers. In: Proceedings of the 25th international conference on very large data bases, VLDB ’99, pp 211–222

  • Kriegel HP, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08, pp 444–452

  • Kriegel HP, Kröger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proceedings of the 13th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD ’09, pp 831–838

  • Müller E, Schiffer M, Seidl T (2011) Statistical selection of relevant subspace projections for outlier ranking. In: Proceedings of the 27th IEEE international conference on data engineering, ICDE ’11, pp 434–445

  • Müller E, Assent I, Iglesias P, Mülle Y, Böhm K (2012a) Outlier ranking via subspace analysis in multiple views of the data. In: Proceedings of the 12th IEEE international conference on data mining, ICDM ’12, pp 529–538

  • Müller E, Keller F, Blanc S, Böhm K (2012b) OutRules: a framework for outlier descriptions in multiple context spaces. In: ECML/PKDD (2), pp 828–832

  • Paravastu R, Kumar H, Pudi V (2008) Uniqueness mining. In: Proceedings of the 13th international conference on database systems for advanced applications, DASFAA ’08, pp 84–94

  • Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, SIGMOD ’00, pp 427–438

  • Rymon R (1992) Search through systematic set enumeration. In: Proceedings of the 3rd international conference on principle of knowledge representation and reasoning, KR ’92, pp 539–550

  • Scott DW (1992) Multivariate density estimation: theory, practice, and visualization., Wiley Series in Probability and StatisticsWiley, New York

    Book  MATH  Google Scholar 

  • Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall/CRC, London

    Book  MATH  Google Scholar 

  • Tang G, Bailey J, Pei J, Dong G (2013) Mining multidimensional contextual outliers from categorical relational data. In: Proceedings of the 25th international conference on scientific and statistical database management, SSDBM ’13, pp 43:1–43:4

  • Zimek A, Schubert E, Kriegel HP (2012) A survey on unsupervised outlier detection in high-dimensional numerical data. Stat Anal Data Min 5(5):363–387

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

The authors thank the editor and the anonymous reviewers for their invaluable comments, which help to improve this paper. Lei Duan’s research is supported in part by Natural Science Foundation of China (Grant No. 61103042), China Postdoctoral Science Foundation (Grant No. 2014M552371). Work by Lei Duan at Simon Fraser University was supported in part by an Ebco/Eppich visiting professorship. Jian Pei’s and Guanting Tang’s research is supported in part by an NSERC Discovery grant, a BCIC NRAS Team Project. James Bailey’s work is supported by an ARC Future Fellowship (FT110100112). All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Duan.

Additional information

Responsible editors: Toon Calders, Floriana Esposito, Eyke Hüllermeier, Rosa Meo.

Appendices

Appendix 1:Proof of Proposition 1

Proof

For any dimension \(D_i \in S\,(1 \le i \le d)\), the mean value of \(\{o.D_i \mid o \in O\}\), denoted by \(\mu _i\), is \(\frac{1}{|O|}\sum \limits _{o \in O}o.D_i\), the standard deviation of \(\{o.D_i \mid o \in O\}\), denoted by \(\sigma _i\), is \(\sqrt{\frac{1}{|O|}\sum \limits _{o \in O}(o.D_i - \mu _i)^2}\), and the bandwidth of \(D_{i}\,(h_i)\) is \(1.06\min \{\sigma _i, \frac{R}{1.34}\}|O|^{-\frac{1}{5}}\), where \(R\) is the difference between the first and the third quartiles of \(O\) in \(D_i\).

We perform the linear transformation \(g(o).D_i = a_io.D_i + b_i\) for any \(o \in O\). Then, the mean value of \(\{g(o).D_i \mid o \in O\}\) is \(\frac{1}{|O|}\sum \limits _{o \in O}(a_i o.D_i + b_i) = a_i \mu _i + b_i\), and the standard deviation of \(\{g(o).D_i \mid o \in O\}\) is \(\sqrt{\frac{1}{|O|}\sum \limits _{o \in O}(a_i o.D_i + b_i - a_i \mu _i - b_i)^2} = a_i \sqrt{\frac{1}{|O|}\sum \limits _{o \in O}(o.D_i - \mu _i)^2} = a_i \sigma _i\).

Correspondingly, the bandwidth of \(D_i\) is \(1.06\min \{a_i\sigma _i, \frac{a_iR}{1.34}\}|O|^{-\frac{1}{5}}\) after the linear transformation. As the distance between two objects in \(D_i\) is also enlarged by \(a_i\), the quasi-density calculated by Eq. 7 keeps unchanged. Thus, the ranking is invariant under linear transformation. \(\square \)

Appendix 2: Proof of Theorem 1

Proof

  1. (i)

    Given an object \(o' \in TN_S^{\epsilon ,o}\), for any dimension \(D_i \in S\), \(\min \limits _{o'' \in O}\{|o.D_i - o''.D_i|\} \le |o.D_i - o'.D_i| \le \epsilon _{D_i}\). Thus,

    $$\begin{aligned} e^{- \sum \limits _{D_i \in S} \frac{\epsilon _{D_i}^2}{2h_{D_i}^2}} \le e^{- \sum \limits _{D_i \in S} \frac{|o.D_i - o'.D_i|^2}{2h_{D_i}^2}} \le e^{- \sum \limits _{D_i \in S} \frac{\min \limits _{o'' \in O}\left\{ |o.D_i - o''.D_i|\right\} ^2}{2h_{D_i}^2}}. \end{aligned}$$

    That is, \(dc_S^\epsilon \le dc_S(o, o') \le dc^{max}_S(o)\).

  2. (ii)

    Given an object \(o' \in LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}\), for any dimension \(D_i \in S\), \(\min \limits _{o'' \in O}\{|o.D_i - o''.D_i|\} \le |o.D_i - o'.D_i| \le \max \limits _{o'' \in O}\{|o.D_i - o''.D_i|\}\). Thus,

    $$\begin{aligned} e^{- \sum \limits _{D_i \in S} \frac{\max \limits _{o'' \in O}\left\{ |o.D_i - o''.D_i|\right\} ^2}{2h_{D_i}^2}} \le e^{- \sum \limits _{D_i \in S} \frac{|o.D_i - o'.D_i|^2}{2h_{D_i}^2}} \le e^{- \sum \limits _{D_i \in S} \frac{\min \limits _{o'' \in O}\left\{ |o.D_i - o''.D_i|\right\} ^2}{2h_{D_i}^2}}. \end{aligned}$$

    That is, \(dc^{min}_S(o) \le dc_S(o, o') \le dc^{max}_S(o)\).

  3. (iii)

    Given an object \(o' \in O \setminus LN_S^{\epsilon ,o}\), for any dimension \(D_i \in S\), \(\epsilon _{D_i} < |o.D_i - o'.D_i| \le \max \limits _{o'' \in O}\{|o.D_i - o''.D_i|\}\). Thus,

    $$\begin{aligned} e^{- \sum \limits _{D_i \in S} \frac{\max \limits _{o'' \in O}\{|o.D_i - o''.D_i|\}^2}{2h_{D_i}^2}} \le e^{- \sum \limits _{D_i \in S} \frac{|o.D_i - o'.D_i|^2}{2h_{D_i}^2}} < e^{- \sum \limits _{D_i \in S} \frac{\epsilon _{D_i}^2}{2h_{D_i}^2}}. \end{aligned}$$

    That is, \(dc^{min}_S(o) \le dc_S(o, o') < dc_S^{\epsilon }\).

\(\square \)

Appendix 3: Proof of Corollary 1

Proof

We divide \(O\) into three disjoint subsets \(TN_S^{\epsilon ,o}\), \(LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}\) and \(O \setminus LN_S^{\epsilon ,o}\). By Theorem 1, for objects belonging to \(TN_S^{\epsilon ,o}\), we have

$$\begin{aligned} |TN_S^{\epsilon ,o}| \ dc_S^{\epsilon } \le \sum \limits _{o' \in TN_S^{\epsilon ,o}}dc_S\left( o, o'\right) \le |TN_S^{\epsilon ,o}| \ dc_S^{max}(o) \end{aligned}$$

For objects belonging to \(LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}\), we have

$$\begin{aligned}&\left( |LN_S^{\epsilon ,o}|-|TN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o) \le \sum \limits _{o' \in LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}}dc_S\left( o, o'\right) \\&\quad \le \left( |LN_S^{\epsilon ,o}|-|TN_S^{\epsilon ,o}|\right) \ dc_S^{max}(o) \end{aligned}$$

For objects belonging to \(O \setminus LN_S^{\epsilon ,o}\), we have

$$\begin{aligned} \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o) \le \sum \limits _{o' \in O \setminus LN_S^{\epsilon ,o}}dc_S\left( o, o'\right) < (|O|-|LN_S^{\epsilon ,o}|)\ dc_S^{\epsilon } \end{aligned}$$

As

$$\begin{aligned} \tilde{f}_S(o)&= \sum \limits _{o' \in O}dc_S\left( o, o'\right) = \sum \limits _{o' \in TN_S^{\epsilon ,o}}dc_S\left( o, o'\right) + \sum \limits _{o' \in LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}} dc_S\left( o, o'\right) \\&+ \sum \limits _{o' \in O \setminus LN_S^{\epsilon ,o}}dc_S\left( o, o'\right) , \end{aligned}$$

Thus,

$$\begin{aligned} \tilde{f}_S(o)&\!\ge \|TN_S^{\epsilon ,o}| \ dc_S^{\epsilon } \!+\! \left( |LN_S^{\epsilon ,o}|\!-\!|TN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o) \!+\! \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o) \\&= |TN_S^{\epsilon ,o}| \ dc_S^{\epsilon } + \left( |O|-|TN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o)\\ \tilde{f}_S(o)&\!\le |TN_S^{\epsilon ,o}| \ dc_S^{max}(o) \!+\! \left( |LN_S^{\epsilon ,o}|\!-\!|TN_S^{\epsilon ,o}|\right) \ dc_S^{max}(o) \!+\! \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{\epsilon } \\&= |LN_S^{\epsilon ,o}| \ dc_S^{max}(o) + \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{\epsilon } \end{aligned}$$

Moreover, if \(LN_S^{\epsilon ,o} \subset O\), i.e. \(O \setminus LN_S^{\epsilon ,o} \ne \emptyset \), then

$$\begin{aligned} \tilde{f}_S(o) < |LN_S^{\epsilon ,o}| \ dc_S^{max}(o) + \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{\epsilon } \end{aligned}$$

\(\square \)

Appendix 4: Proof of Corollary 2

Proof

Since \(O' \subseteq TN_S^{\epsilon ,o}\), for objects belonging to \(O\!\setminus \! O'\), we divide them into \(TN_S^{\epsilon ,o}\!\setminus \!O'\), \(LN_S^{\epsilon ,o} \!\setminus \! TN_S^{\epsilon ,o}\) and \(O \!\setminus \! LN_S^{\epsilon ,o}\). Then

$$\begin{aligned} \tilde{f}_S(o)&= \tilde{f}^{O'}_S(o) + \sum \limits _{o' \in TN_S^{\epsilon ,o} \setminus O'}dc_S\left( o, o'\right) \\&+\, \sum \limits _{o' \in LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}} dc_S\left( o, o'\right) + \sum \limits _{o' \in O \setminus LN_S^{\epsilon ,o}}dc_S\left( o, o'\right) , \end{aligned}$$

By Theorem 1, for objects belonging to \(TN_S^{\epsilon ,o}\setminus \! O'\), we have

$$\begin{aligned} \left( |TN_S^{\epsilon ,o}|-|O'|\right) \ dc_S^{\epsilon } \le \sum \limits _{o' \in TN_S^{\epsilon ,o} \setminus O'}dc_S\left( o, o'\right) \le (|TN_S^{\epsilon ,o}|-|O'|) \ dc_S^{max}(o) \end{aligned}$$

For objects belonging to \(LN_S^{\epsilon ,o}\!\setminus \! TN_S^{\epsilon ,o}\), we have

$$\begin{aligned}&\left( |LN_S^{\epsilon ,o}|-|TN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o) \le \sum \limits _{o' \in LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}}dc_S\left( o, o'\right) \\&\quad \le \left( |LN_S^{\epsilon ,o}|-|TN_S^{\epsilon ,o}|\right) \ dc_S^{max}(o) \end{aligned}$$

For objects belonging to \(O {\setminus } LN_S^{\epsilon ,o}\), we have

$$\begin{aligned} (|O|-|LN_S^{\epsilon ,o}|) \ dc_S^{min}(o) \le \sum \limits _{o' \in O \setminus LN_S^{\epsilon ,o}}dc_S\left( o, o'\right) < (|O|-|LN_S^{\epsilon ,o}|) \ dc_S^{\epsilon } \end{aligned}$$

Thus,

$$\begin{aligned} \tilde{f}_S(o)&\ge \tilde{f}^{O'}_S(o) + \left( |TN_S^{\epsilon ,o}| - |O'|\right) \ dc_S^{\epsilon } + \left( |LN_S^{\epsilon ,o}|-|TN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o)\\&+ \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o)\\&= \tilde{f}_{S}^{O'}(o)+ \left( |TN_S^{\epsilon ,o}| - |O'|\right) \ dc_S^\epsilon + \left( |O| - |TN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o)\\ \tilde{f}_S(o)&\le \tilde{f}^{O'}_S(o) + \left( |TN_S^{\epsilon ,o}|-|O'|\right) \ dc_S^{max}(o) + \left( |LN_S^{\epsilon ,o}|-|TN_S^{\epsilon ,o}|\right) \ dc_S^{max}(o)\\&+\left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{\epsilon } \\&= \tilde{f}^{O'}_S(o) + \left( |LN_S^{\epsilon ,o}|-|O'|\right) \ dc_S^{max}(o) + \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{\epsilon } \end{aligned}$$

Moreover, if \(LN_S^{\epsilon ,o} \subset O\), i.e. \(O \setminus LN_S^{\epsilon ,o} \ne \emptyset \), then

$$\begin{aligned} \tilde{f}_S(o) < \tilde{f}^{O'}_S(o) + \left( |LN_S^{\epsilon ,o}|-|O'|\right) \ dc_S^{max}(o) + \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{\epsilon } \end{aligned}$$

\(\square \)

Appendix 5: Proof of Corollary 3

Proof

Since \(TN_S^{\epsilon ,o} \subset O' \subseteq LN_S^{\epsilon ,o}\), for objects belonging to \(O\! \setminus \! O'\), we divide them into \(LN_S^{\epsilon ,o} \setminus \! O'\) and \(O \setminus \! LN_S^{\epsilon ,o}\). Then

$$\begin{aligned} \tilde{f}_S(o) = \tilde{f}^{O'}_S(o) + \sum \limits _{o' \in LN_S^{\epsilon ,o} \setminus O'} dc_S\left( o, o'\right) + \sum \limits _{o' \in O \setminus LN_S^{\epsilon ,o}}dc_S\left( o, o'\right) , \end{aligned}$$

By Theorem 1, for objects belonging to \(LN_S^{\epsilon ,o} {\setminus } O'\), we have

$$\begin{aligned} \left( |LN_S^{\epsilon ,o}|\!-\!|O'|\right) \ dc_S^{min}(o) \!\le \! \sum \limits _{o' \in LN_S^{\epsilon ,o} {\setminus } TN_S^{\epsilon ,o}}dc_S\left( o, o'\right) \!\le \! \left( |LN_S^{\epsilon ,o}|\!-\!|O'|\right) \ dc_S^{max}(o) \end{aligned}$$

For objects belonging to \(O {\setminus } LN_S^{\epsilon ,o}\), we have

$$\begin{aligned} \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o) \le \sum \limits _{o' \in O \setminus LN_S^{\epsilon ,o}}dc_S\left( o, o'\right) < \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{\epsilon } \end{aligned}$$

Thus,

$$\begin{aligned} \tilde{f}_S(o)&\ge \tilde{f}^{O'}_S(o) + \left( |LN_S^{\epsilon ,o}|-|O'|\right) \ dc_S^{min}(o) + \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o) \\&= \tilde{f}_{S}^{O'}(o) + (|O| - |O'|) \ dc_S^{min}(o) \\ \tilde{f}_S(o)&\le \tilde{f}^{O'}_S(o) + \left( |LN_S^{\epsilon ,o}|-|O'|\right) \ dc_S^{max}(o) + \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{\epsilon } \end{aligned}$$

Moreover, if \(LN_S^{\epsilon ,o} \subset O\), i.e. \(O {\setminus } LN_S^{\epsilon ,o} \ne \emptyset \), then

$$\begin{aligned} \tilde{f}_S(o) < \tilde{f}^{O'}_S(o) + (|LN_S^{\epsilon ,o}|-|O'|) \ dc_S^{max}(o) + (|O|-|LN_S^{\epsilon ,o}|) \ dc_S^{\epsilon } \end{aligned}$$

\(\square \)

Appendix 6: Proof of Corollary 4

Proof

Since \(LN_S^{\epsilon ,o} \subset O' \subseteq O\), Then

$$\begin{aligned} \tilde{f}_S(o) = \tilde{f}^{O'}_S(o) + \sum \limits _{o' \in O \setminus O'} dc_S\left( o, o'\right) , \end{aligned}$$

By Theorem 1, for objects belonging to \(O \!\setminus \! O'\), we have

$$\begin{aligned} \left( |LN_S^{\epsilon ,o}|-|O'|\right) \ dc_S^{min}(o) \le \sum \limits _{o' \in O \setminus O'} \ dc_S\left( o, o'\right) \le (|O|-|O'|) \ dc_S^{\epsilon } \end{aligned}$$

Thus,

$$\begin{aligned} \tilde{f}_S(o)&\ge \tilde{f}^{O'}_S(o) + \left( |O|-|O'|\right) \ dc_S^{min}(o)\\ \tilde{f}_S(o)&\le \tilde{f}^{O'}_S(o) + \left( |O|-|O'|\right) \ dc_S^{\epsilon } \end{aligned}$$

\(\square \)

Appendix 7: Proof of Theorem 2

Proof

We prove by contradiction.

Given a set of objects \(O\), a subspace \(S\), two neighborhood distances \(\epsilon _1\) and \(\epsilon _2\). Let \(q \in O\) be the query object. For an object \(o \in O\), denote by \(L_{\epsilon _1}\) the lower bound of \(\tilde{f}_S(o)\) estimated by \(\epsilon _1\), \(U_{\epsilon _2}\) the upper bound of \(\tilde{f}_S(o)\) estimated by \(\epsilon _2\).

Assume that \(\tilde{f}_S(q) < L_{\epsilon _1}\) and \(\tilde{f}_S(q) > U_{\epsilon _2}\).

As \(L_{\epsilon _1}\) is a lower bound of \(\tilde{f}_S(o)\), and \(U_{\epsilon _2}\) is an upper bound of \(\tilde{f}_S(o)\), so that \(L_{\epsilon _1} < \tilde{f}_S(o) < U_{\epsilon _2}\). Then, we have \(\tilde{f}_S(q) < L_{\epsilon _1} < \tilde{f}_S(o)\) and \(\tilde{f}_S(o) < U_{\epsilon _2} < \tilde{f}_S(q)\). Consequently, \(\tilde{f}_S(o) < \tilde{f}_S(q) < \tilde{f}_S(o)\). A contradiction.

Thus, \(rank^{\epsilon _1}_S(q) = |\{o \in O \mid \tilde{f}_S(o) < \tilde{f}_S(q)\}|+1 =rank^{\epsilon _2}_S(q)\). \(\square \)

Appendix 8: Proof of Theorem 3

Proof

We prove by contradiction.

Let \(Ans\) be the set of minimal outlying subspaces of \(q\) found by OAMiner, \(r_{best}\) the best rank. Assume that subspace \(S \notin Ans\) satisfying \(S \subseteq D\) and \(0 < |S| \le \ell \) is a minimal outlying subspace of \(q\).

Recall that OAMiner searches subspaces by traversing the subspace enumeration tree in a depth-first manner. As \(S \notin Ans\), \(S\) is pruned by Pruning Rule 1 or Pruning Rule 2.

In the case that \(S\) is pruned by Pruning Rule 1, \(S\) is not minimal. A contradiction;

In the case that \(S\) is pruned by Pruning Rule 2, then there exist a subspace \(S'\), such that \(S'\) is a parent of \(S\) in the subspace enumeration tree and \(|Comp_{S'}(q)| \ge r_{best}\). By the property of competitors, we have \(Comp_{S'}(q) \subseteq Comp_S(q)\). Correspondingly, \(rank_S(q) \ge |Comp_S(q)| \ge |Comp_{S'}(q)| \ge r_{best}\). A contradiction. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Duan, L., Tang, G., Pei, J. et al. Mining outlying aspects on numeric data. Data Min Knowl Disc 29, 1116–1151 (2015). https://doi.org/10.1007/s10618-014-0398-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-014-0398-2

Keywords

Navigation