Skip to main content

Advertisement

Log in

Determining the familial risk distribution of colorectal cancer: a data mining approach

  • Original Article
  • Published:
Familial Cancer Aims and scope Submit manuscript

Abstract

This study was aimed to characterize the distribution of colorectal cancer risk using family history of cancers by data mining. Family histories for 10,066 colorectal cancer cases recruited to population cancer registries of the Colon Cancer Family Registry were analyzed using a data mining framework. A novel index was developed to quantify familial cancer aggregation. Artificial neural network was used to identify distinct categories of familial risk. Standardized incidence ratios (SIRs) and corresponding 95 % confidence intervals (CIs) of colorectal cancer were calculated for each category. We identified five major, and 66 minor categories of familial risk for developing colorectal cancer. The distribution the major risk categories were: (1) 7 % of families (SIR = 7.11; 95 % CI 6.65–7.59) had a strong family history of colorectal cancer; (2) 13 % of families (SIR = 2.94; 95 % CI 2.78–3.10) had a moderate family history of colorectal cancer; (3) 11 % of families (SIR = 1.23; 95 % CI 1.12–1.36) had a strong family history of breast cancer and a weak family history of colorectal cancer; (4) 9 % of families (SIR = 1.06; 95 % CI 0.96–1.18) had strong family history of prostate cancer and weak family history of colorectal cancer; and (5) 60 % of families (SIR = 0.61; 95 % CI 0.57–0.65) had a weak family history of all cancers. There is a wide variation of colorectal cancer risk that can be categorized by family history of cancer, with a strong gradient of colorectal cancer risk between the highest and lowest risk categories. The risk of colorectal cancer for people with the highest risk category of family history (7 % of the population) was 12-times that for people in the lowest risk category (60 %) of the population. Data mining was proven an effective approach for gaining insight into the underlying cancer aggregation patterns and for categorizing familial risk of colorectal cancer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Taylor DP, Burt RW, Williams MS, Haug PJ, Cannon-Albright LA (2010) Population-based family history-specific risks for colorectal cancer: a constellation approach. Gastroenterology 138(3):877–885

    Article  PubMed  PubMed Central  Google Scholar 

  2. Baglietto L, Jenkins MA, Severi G et al (2006) Measures of familial aggregation depend on definition of family history: meta-analysis for colorectal cancer. J Clin Epidemiol 59(2):114–124

    Article  PubMed  Google Scholar 

  3. Al-Sukhni W, Aronson M, Gallinger S (2008) Hereditary colorectal cancer syndromes: familial adenomatous polyposis and lynch syndrome. Surg Clin North Am 88(4):819–844

    Article  PubMed  Google Scholar 

  4. Fain PR, Goldgar DE (1986) A nonparametric test of heterogeneity of family risk. Genet Epidemiol Suppl 1:61–66

    Article  CAS  PubMed  Google Scholar 

  5. Negri E, Braga C, La Vecchia C et al (1998) Family history of cancer and risk of colorectal cancer in Italy. Br J Cancer 77(1):174–179

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Johns LE, Houlston RS (2001) A systematic review and meta-analysis of familial colorectal cancer risk. Am J Gastroenterol 96(10):2992–3003

    Article  CAS  PubMed  Google Scholar 

  7. Fuchs CS, Giovannucci EL, Colditz GA, Hunter DJ, Speizer FE, Willett WC (1994) A prospective study of family history and the risk of colorectal cancer. N Engl J Med 331(25):1669–1674

    Article  CAS  PubMed  Google Scholar 

  8. Goldgar DE, Easton DF, Cannon-Albright LA, Skolnick MH (1994) Systematic population-based assessment of cancer risk in first-degree relatives of cancer probands. J Natl Cancer Inst 86(21):1600–1608

    Article  CAS  PubMed  Google Scholar 

  9. Ahsan H, Neugut AI, Garbowski GC et al (1998) Family history of colorectal adenomatous polyps and increased risk for colorectal cancer. Ann Intern Med 128(11):900–905

    Article  CAS  PubMed  Google Scholar 

  10. Winawer SJ, Zauber AG, Gerdes H et al (1996) Risk of colorectal cancer in the families of patients with adenomatous polyps. National Polyp Study Workgroup. N Engl J Med 334(2):82–87

    Article  CAS  PubMed  Google Scholar 

  11. Slattery ML, Kerber RA (1994) Family history of cancer and colon cancer risk: the Utah population database. J Natl Cancer Inst 86(21):1618–1626

    Article  CAS  PubMed  Google Scholar 

  12. Newcomb PA, Baron J, Cotterchio M et al (2007) Colon Cancer Family Registry: an international resource for studies of the genetic epidemiology of colon cancer. Cancer Epidemiol Biomark Prev 16(11):2331–2343

    Article  Google Scholar 

  13. Win AK, Lindor NM, Young JP et al (2012) Risks of primary extracolonic cancers following colorectal cancer in Lynch syndrome. J Natl Cancer Inst 104(18):1363–1372

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Kohonen T (2001) Self-organizing maps, 3rd edn. Springer, Berlin

    Book  Google Scholar 

  15. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  16. Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York

    Book  Google Scholar 

  17. Haykin SS (2009) Neural networks and learning machines, 3rd edn. Prentice Hall, New York

    Google Scholar 

  18. Vesanto J, Himberg J, Alhoniemi E, Parhankangas J (2000) SOM toolbox for Matlab. Tech Rep Laboratory of Computer and Information Science, Helsinki University of Technology

  19. The MathWorks I (2010) MATLAB version 7.10.0. In: Natick, Massachusetts

  20. Breslow NE, Day NE (1987) Statistical methods in cancer research. Volume II—the design and analysis of cohort studies. IARC Sci Publ 82:1–406

    Google Scholar 

  21. Parkin DM, Whelan SL, Ferlay J, Raymond L, Young J (1997) Cancer incidence in five continents, vol VII. International Agency for Research on Cancer, Lyon

  22. Gould W (1995) Jackknife estimation. Stata Tech Bull 4:25–29

    Google Scholar 

  23. Ries L, Eisner M, Kosary C et al (2003) SEER cancer statistics review, 1975–2000. National Cancer Institute, Bethesda

    Google Scholar 

  24. StataCorp (2009) Stata statistical software: release 11. StataCorp LP, College Station, TX

  25. Kerber RA, O’Brien E (2005) A cohort study of cancer risk in relation to family histories of cancer in the Utah population database. Cancer 103(9):1906–1915

    Article  PubMed  Google Scholar 

  26. Teerlink CC, Albright FS, Lins L, Cannon-Albright LA (2012) A comprehensive survey of cancer risks in extended families. Genet Med 14(1):107–114

    Article  PubMed  Google Scholar 

  27. Andrieu N, Launoy G, Guillois R, Ory-Paoletti C, Gignoux M (2004) Estimation of the familial relative risk of cancer by site from a French population based family study on colorectal cancer (CCREF study). Gut 53(9):1322–1328

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Hopper JL (2011) Disease-specific prospective family study cohorts enriched for familial risk. Epidemiol Perspect Innov 8(1):2

    Article  PubMed  PubMed Central  Google Scholar 

  29. Win AK, Ait Ouakrim D, Jenkins MA (2014) Risk profiling: familial colorectal cancer. Cancer Forum 38(1):15–25

    Google Scholar 

  30. Hopper JL, Carlin JB (1992) Familial aggregation of a disease consequent upon correlation between relatives in a risk factor measured on a continuous scale. Am J Epidemiol 136(9):1138–1147

    CAS  PubMed  Google Scholar 

Download references

Acknowledgments

The authors thank all study participants of the Colon Cancer Family Registry and staff for their contributions to this project.

Funding

This work was supported by Grant UM1 CA167551 from the National Cancer Institute, National Institutes of Health (NIH) and through cooperative agreements with the following Colon Cancer Family Registry (CCFR) centers: Australasian Colorectal Cancer Family Registry (U01/U24 CA097735), Mayo Clinic Cooperative Family Registry for Colon Cancer Studies (U01/U24 CA074800), Ontario Familial Colorectal Cancer Registry (U01/U24 CA074783), Seattle Colorectal Cancer Family Registry (U01/U24 CA074794), and USC Consortium Colorectal Cancer Family Registry (U01/U24 CA074799). Seattle CCFR research was also supported by the Cancer Surveillance System of the Fred Hutchinson Cancer Research Center, which was funded by Control Nos. N01-CN-67009 (1996–2003) and N01-PC-35142 (2003–2010) and Contract No. HHSN2612013000121 (2010–2017) from the Surveillance, Epidemiology and End Results (SEER) Program of the National Cancer Institute with additional support from the Fred Hutchinson Cancer Research Center. The collection of cancer incidence data used in this study was supported by the California Department of Public Health as part of the statewide cancer reporting program mandated by California Health and Safety Code Section 103885; the National Cancer Institute’s Surveillance, Epidemiology and End Results Program under contract HHSN261201000035C awarded to the University of Southern California, and contract HHSN261201000034C awarded to the Public Health Institute; and the Centers for Disease Control and Prevention’s National Program of Cancer Registries, under agreement U58DP003862-01 awarded to the California Department of Public Health. The ideas and opinions expressed herein are those of the author(s) and endorsement by the State of California, Department of Public Health the National Cancer Institute, and the Centers for Disease Control and Prevention or their Contractors and Subcontractors is not intended nor should be inferred. This work is also supported by Centre for Research Excellence grant APP1042021 and Program grant APP1074383 from National Health and Medical Research Council (NHMRC), Australia. AKW is a NHMRC Early Career Fellow. MAJ is an NHMRC Senior Research Fellow. JLH is a NHMRC Senior Principal Research Fellow. DDB is a University of Melbourne Research at Melbourne Accelerator Program (R@MAP) Senior Research Fellow.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aung Ko Win.

Ethics declarations

Conflict of interest

The authors have no conflict of interest to declare with respect to this manuscript.

Additional information

Disclaimer The content of this manuscript does not necessarily reflect the views or policies of the National Cancer Institute or any of the collaborating centers in the CFRs, nor does mention of trade names, commercial products, or organizations imply endorsement by the US Government or the CFR.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 35 kb)

Supplementary material 2 (PPTX 722 kb)

Appendices

Appendix 1: Clustering families using self-organizing map

Let \({\mathbf{x}}_{i} \in R^{N} \left( {1 \le i \le M} \right)\) be the familial aggregation vector of the ith family in the dataset, where N is the number of cancer categories included in the analysis, and M is the total number of families. The self-organizing map consists of a regular grid of nodes. Each node is associated with an N-dimensional codebook vector. Let \({\mathbf{m}}_{j} = [m_{jn} \left| {1 \le n \le N} \right.]\,(1 \le j \le G)\) be the codebook vector of the jth node on the map. The training algorithm for forming the familial aggregation space is given as follows:

  1. 1.

    Present an input vector x i for training at random.

  2. 2.

    Find the winning node s on the map with the vector m s which is closest to x i such that

    $$\left\| {{\mathbf{x}}_{i} - {\mathbf{m}}_{s} } \right\| = \mathop {\hbox{min} }\limits_{j} \left\| {{\mathbf{x}}_{i} - {\mathbf{m}}_{j} } \right\|$$
  3. 3.

    After the winning node s is selected, update the weight of every node in the neighbourhood of node s by

    $${\mathbf{m}}_{t}^{new} = {\mathbf{m}}_{t}^{old} + \alpha (t)\left( {{\mathbf{x}}_{i} - {\mathbf{m}}_{t}^{old} } \right)$$

    where \(\alpha (t)\) is the gain term at time t (\(0 \le \alpha (t) \le 1\)) that decreases in time and converges to 0.

  4. 4.

    Increase the time stamp t and repeat the training process until it converges.

After the training process was completed, each input vector (i.e. family) was mapped to a grid node closest to it on the self-organizing map. A familial aggregation space was thus formed. This process corresponded to a projection of the multi-dimensional input vectors onto an orderly two-dimensional space where the proximity of the input vectors was preserved as faithfully as possible. Consequently, familial similarities, in terms of both the types of extracolonic cancers and the strength of the CRC aggregation were explicitly revealed by their locations and neighbourhood relationships on the map. For all families mapped to a node, a familial risk category, based on family history of cancer, was then revealed by retrieving the codebook vector correspond to a node on the self-organizing map.

Appendix 2: Partitioning the self-organizing map using k-means

The k-means algorithm used for running on the familial aggregation space was as follow:

  1. 1.

    Select k nodes from the self-organizing map as initial cluster centers.

  2. 2.

    Form k clusters by assigning each node to its closest cluster center.

  3. 3.

    Re-compute the cluster centers as the means of all its cluster members.

  4. 4.

    Repeat the process from step 2 until the cluster centers no longer change.

K-means was run for different values of k, and we chose the optimal partition of the self-organizing map, validated by the Davies–Bouldin index [1], so that distances within clusters were minimized and distances between clusters were maximized. The Davies–Bouldin index minimizes the expression:

$$\frac{1}{C}\mathop \sum \limits_{i = 1}^{C} max_{j} \left( {\frac{{S_{i} + S_{j} }}{{M_{ij} }}} \right)$$

where C is the number of clusters, S i is the dispersion of cluster i defined in terms of mean squared distance from the cluster center, and M ij is the distance between the centers of cluster i and j [2]. Thus, the optimal partition implies that, by grouping families based on similarity of family history, a family is then more similar to any family belonging to the same cluster than with any other family in a different cluster.

Finally, k cluster-wide familial risk categories were revealed by finding the prototype vectors corresponding to the k cluster centers. A cluster-wide familial risk category characterizes each family of a cluster by summarizing the global characteristics of cancer aggregation of all families in that cluster. It is essentially the mean vector of all codebook vectors associated to a cluster.

Appendix 3: Distance measure for similarity of familial aggregation

Central to every cluster algorithm is a metric for measuring distance (or similarity) between objects. Euclidean distance

$$\varvec{d}\left( {\varvec{x}_{1} ,\varvec{x}_{2} } \right) = \sqrt {\left\| {\varvec{x}_{1} } \right\|^{2} + \left\| { \varvec{x}_{2} } \right\|^{2} - 2\varvec{x}_{1}^{{\prime }} \varvec{x}_{2} }$$

is the default distance measure for most clustering algorithm, including the Self-organizing map and k-means. One limitation of the Euclidean distance is that it does not discriminate features which are present in one vector but absent in another vector [3], making it incapable of recognizing similarity of familial aggregation in epidemiological sense. For example, we have 3 families (a, b and c) and each family is represented by a 4-dimensional familial aggregation vectors featuring 4 cancers:

$$\begin{aligned} a & = \left( {0, 0.1, 0.1, 0} \right) \\ b & = \left( {0.1, 0, 0, 0.1} \right) \\ c & = \left( {0.1, 0.2, 0.2, 0.1} \right) \\ \end{aligned}$$

There is no common aggregating cancer between family a and family b, but there are two common aggregating cancers between family a and family c. In terms of familial aggregation, families sharing no common aggregating cancers should not be considered similar. Therefore, family a should be more similar to family c than family b, but Euclidean distance delivers count-intuitive result, d(a,b) = 0.2 and d(a,c) = 0.2, indicating that family a is equally similar to family b and family c.

To overcome limitations of the Euclidean distance, we adopted the extended Jaccard distance [3] as an alternative.

$$\varvec{d}\left( {\varvec{x}_{1} ,\varvec{x}_{2} } \right) = 1 - \frac{{\varvec{x}_{1}^{{\prime }} \varvec{x}_{2} }}{{\left\| {\varvec{x}_{1} } \right\|^{2} + \left\| { \varvec{x}_{2} } \right\|^{2} - \varvec{x}_{1}^{{\prime }} \varvec{x}_{2} }}$$

It is bounded between 0 and 1 with 0 representing perfect match and 1 representing there is no similarity at all. The extended Jaccard distance overcomes limitation of the Euclidean distance by comparing features shared by both vectors against features present in just either one of the two vectors. As such, it will measure similarity of familial aggregation in a more epidemiological sensible manner, by comparing weights of aggregation cancers shared by two families against weights of cancers aggregating in just either one of two families, indicating that, d(a,b) = 1 and d(a,c) = 0.5, suggesting that family a is more similar to family c than family b.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chau, R., Jenkins, M.A., Buchanan, D.D. et al. Determining the familial risk distribution of colorectal cancer: a data mining approach. Familial Cancer 15, 241–251 (2016). https://doi.org/10.1007/s10689-015-9860-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10689-015-9860-6

Keywords

Navigation