Abstract

Elderly people are the assets of the country and the government can ensure their peaceful and healthier life. Life expectancy of individuals has expanded with technological advancements and survey tells that the elderly population will become double in the year 2030. The noninfectious cognitive dysfunction is the most important risk factor among elderly people due to a decline in their physiological function. Alzheimer, Vascular Dementia, and Dementia are the key reasons for cognitive inabilities. These diseases require manual assistance, which is difficult to provide in this fast-growing world. Prevention and early detection are the wise solution for the above diseases. Diabetes and hypertension are considered as main risk factors allied with Alzheimer's disease. Our proposed work applies a two-stage classification technique to improve prediction accuracy. In the first stage, we train a Support vector machine and a Random Forest algorithm to analyze the influence of diabetes and high blood pressure on cognitive decline. In the second stage, the cognitive function of the person with the possibility of Dementia is assessed using the neuropsychological test called Cognitive Ability Test (CAT). Multinomial Logistic Regression algorithm is applied to CAT results to predict the possibility of cognitive decline in their postlife. We classified the risk factor using the operational definitions: “No Alzheimer’s,” “Uncertain Alzheimer’s,” and “Definite Alzheimer’s”. SVM of stage 1 classifier predicts with an accuracy of 0.86 and Random Forest with an accuracy of 0.71. Multinomial Logistic algorithm of stage 2 classifier accuracy is 0.89. The proposed work enables early prediction of a person at risk of Alzheimer's Disease using clinical data.

1. Introduction

Physical health and mental health carry equal importance in human life. Elderly people are normally affected with cardiovascular disease, cancer, diabetic, arthritis, depression, kidney disease, pulmonary disease, dementia and alzheimer’s disease. Dementia is a cognitive decline in mental ability, which severely affects routine life. A person suffering from dementia is always in need of someone to accomplish his everyday activity since the disease affects cognitive function in multiple domains. Alzheimer's Disease (AD) is one among the overall general neurodegenerative cortical dementia. The incurable neurodegenerative disorder primarily affects the elderly population. It gradually progresses from mild cognitive impairment to Alzheimer's and other kinds of Dementia. The projections are specifically high in South Asian countries such as India and China. The rise in AD disease is proportionate to the elderly population and it is foreseen that 5% to 7% of elders are affected by dementia. By 2050, 1 in 5 persons of low- and middle-income countries will be above 60 ages which may escalate the disease population [1].

Dementia will be an inevitable result of demographic transition and it causes damage to the brain cells. The stages of dementia span start with no cognitive decline to severe decline. The different types of dementia are Alzheimer’s Disease (AD), Vascular Dementia (VaD), Frontotemporal Dementia, etc. This impairment affects the capacity of synapses to converse with one another which in turn affects person’s thinking, emotion, and behavior. Various sorts of dementia align with a specific type of brain cell decay in brain regions. A significant level of specific proteins presents inside and outside of synapses, making it difficult for brain cells to remain healthy and to connect with others. The foremost section to be affected is “Hippocampus” region of the brain cell, which is the central point of learning and memory in the cerebrum. This is the reason why cognitive decline is perhaps the initial indication of Alzheimer's Disease. There is no effective handling or treatment available for the disease. The feasible option is to train the population with related risk factors and the defending factors.

People affected by diabetes are growing exponentially and it is expected that 640 million people will be affected by the year 2040 [2,3]. As indicated by the World Alzheimer Report 2014, people who had hypertension in their midlife (individuals age around 40–64 years old) were bound to create vascular dementia in later life [4]. Choked or decreased blood flow to the brain is the basic symptom of dementia. Many people with diabetes have brain changes that are a hallmark of Alzheimer’s disease. Hypertension causes hurt on the heart and veins and it happens when the power of blood pushing against within our veins is excessively high. This causes the cells to work tougher, which makes them less effective. A recent exploration in a journal named Neurology Trusted Source shows that elderly people have more average BP that is likely to create tangles and plaques in brain. There exists some evidence for a relation of SBP with AD, specifically tangles [5,6].

Multifactor analysis predicts Alzheimer’s disease more precisely by extracting heterogeneous information present in health records. It is feasible to predict AD using administrative, clinical information rather than images. Machine learning algorithms are the ideal alternative to apply to a large volume of health data [7]. The focus of our proposed work is twofold: (i) Predicting people with possibilities of Alzheimer in their late life by doing careful analysis on various risk factors associated with Alzheimer's. (ii) Conducting a neuropsychological test called Cognitive Ability Test (CAT) to assess the cognitive decline of a person [8]. The proposed work considers general health data available in “Data World” repository. We apply 2-stage classifier algorithms in the proposed work. In the first stage, support vector machine learning algorithm and Random Forest algorithm are used to find the associated risk factor of individuals. In the second stage, to enhance the prediction accuracy, cognitive ability test was conducted among the people identified by the stage 1 classifier. The cognitive ability of a person is estimated using CAT test, which contains simple yes or no type questions, values ranging from 0 to 30. The CAT test results are applied to Multinomial Logistic Regression to classify the severity of the disease. The score between 25 and 30 is classified as “No Dementia,” between 13 and 24 as “Uncertain Dementia,” and less than 13 as “Severe Dementia.” The proposed work combines multiple factors associated with Alzheimer's to predict the possibility of disease more accurately.

The paper is ordered as follows: Related work on dementia and Alzheimer’s disease is explored in chapter 2. Chapter 3 describes the proposed work that includes the relation between Alzheimer's with type 2 diabetes and hypertension dataset, which relates to our claim. The application of multinomial logistic expression on CAT test results to enhance the prediction process is also explored. Chapter 4 justifies the results and relevant discussions. Conclusion of the present work and extension is mentioned in chapter 5.

2. Literature Survey

Mild Cognitive Impairment (MCI) leads to Alzheimer's and various kinds of Dementia in later life. Exceptional intelligent inability of the Alzheimer’s diseased patient weighs more burden on family members and public. It has a physical, psychological, social, and economic impact. Careful review was conducted in various aspects such as cause of disease, the different test applied, clinical diagnosis procedure, statistical techniques used, AI/Machine learning techniques used, and so on in order to find the research gap. The research findings are tabulated in Table 1.

The summary leads to an understanding of the correlation between diseases such as diabetes, hypertension, depression, and cognitive impairment. Few drugs are also identified as the main cause of Alzheimer's and related dementia. Various statistical techniques such as ANOVA, t-test, Kaplan–Meier estimates (survival estimation function), and QUADAS-2 (diagnosis test against ref value) are used to analyze the data. The main issue to be addressed by the use of a statistical tool is sampling error present in the dataset. This sampling error in the data set would lead to wrong conception. Application of suitable machine learning algorithm will provide optimal solution. The extensive review shows the importance of analyzing cognitive level of the patient and also the role of machine learning algorithm for prediction and classification. Our proposed research work considers diabetes and hypertension details of the patients and applies suitable machine learning algorithm and cognitive ability test to predict the risk of Alzheimer in person’s late life.

3. Proposed Work

The proposed model aims at early prediction of cognitive decline of the people using cognitive data, clinical data, and physical data from his history. Dementia has a lengthy preclinical period during which there are no perceptible cognitive impairments, but neurogenerative changes are happening. Therefore, it is essential to identify individuals at high risk of dementia in an earlier stage to protect them from the possibility of disease in their late life [16]. Population based study among precise age group supports our understanding with fewer possible bias. Studies looking at the mid-age people with chronic diseases are particularly helpful since the chances of dementia development are higher for them. Midlife hypertension increases the risk of lacunar infarcts and stroke, which in turn increases the risk VaD.

The existing system requires imaging data or fluid collection, which imposes a delay in early detection. Huge measure of Electronic Health Records available in structured and unstructured manner supports timely diagnosis and decisions. Collection of administrative, electronic medical data requires less amount of time. Viable use of information and attaining precise outcome is the major challenge in different fields, particularly in medical field. Utilization of Machine learning is found in almost all fields like image processing, language automation, computer vision, e-business, etc. The advent of predictive models of machine learning can be applied to these valuable digitized health records for the early risk prediction of VAD and AD.

Chronic diseases like diabetics, blood pressure, heart problems, and kidney infection are increasing worldwide. It was witnessed that diabetics and blood pressure have strong relation with cognitive decline in elderly people [24,25]. The helpless diabetic control and bad adherence to physician instructions are the primary reason for the elevation of AD or dementia in their late life [26, 27]. The early detection aids to prevent AD with the help of proper diabetic control, drugs, cognitive training, and so forth. Reference [28] research finding states that there is a direct connection between glucose dysregulation and neurodegeneration. Diabetes is viewed as a key risk factor for cognitive impairment and few investigations prove that cognitive dysfunction influences both older and younger persons with diabetes [29, 30]. Type 2 diabetes patients ought to be routinely assessed for their intellectual capacity since a span of infection could be related to a decrease in cognizance.

Physicians do not screen the psychological capacity of the patient until he gets compliant from the patient or from the patient’s family. After the beginning indications, patients consult doctor. During this period, dementia is moved to a moderately advanced stage. Timely diagnosis would help the patients to overcome disease progression. The Cognitive Ability Test (CAT) is a brief neuropsychological screening test that provides an outline of cognitive function. The CAT test helps the physician to assess the cognitive function of the patient in the early stage itself [13]. The proposed work applies a support vector machine learning algorithm to identify people with a high risk of cognitive impairment in their late life and they are exposed to CAT screening tests [31, 32]. The test results are analyzed with the help of Multinomial logistic regression to classify them as “Severe dementia,” “Uncertain Dementia,” and “No Dementia.”

3.1. Data and Methodology
3.1.1. Data Set

The primary focus of the proposed work is to provide health care service to the elderly population residing in resource poor areas. People with ages between 40 and 65 years are considered as mid-age people in our case study. The proposed work mainly considers hypertension, diabetes as the most common risk factor for cognitive decline. The appropriate data set available in “Data World” repository is taken and filtered with the required features. Plausible cross-sectional examination provides the best technique for analyzing a causal connection between diabetics, blood pressure (BP), and the occurrence of dementia. To enhance the analysis, we consider two age classifications, namely midlife <65 years and late life >65 years.

The emphasis on midlife is especially relevant for dementia counteraction for two reasons. (i) Midlife is sufficiently early to make an association between risk factors and Alzheimer's before the initiation of neurodegeneration. (ii) A few examinations presented the connection between raised BP in midlife (age 40–64 years) and the beginning of dementia and AD in their late life. This study considers general health data available in “Data world,” the world’s largest collaborative data community. The database consists of 19 features and 2361 patients records whose snippet is depicted in Table 2.

3.1.2. Feature Selection

The data set contains 19 features describing age, cholesterol, glucose, etc. Since Hemoglobin A1c (HbA1c) is the important measure of long-term control of glucose in our body, it was mainly considered in the early identification of AD [30]. Along with HbA1C, the patient's systolic and diastolic BP was examined. The list of features and their descriptions are given in Table 1. HbA1c value greater than 6.5 is considered diabetes positive. The given data set is separated into two distinctive sets with respect to the age to assess the midlife attributes and their association with AD. Exploratory Data Analytics is performed to summarize the main characteristics of data and to find important features with the help of visual aids. Multivariate analytics is performed to understand different features and their interaction. Table 3 summarizes the features description.

The correlation factor associated with each pair of features helps to extract relevant attributes for study. The highly influential factors such as age, gender, HbA1c, glucose, systolic pressure, diastolic pressure, and cholesterol are considered in our case study.

3.2. Process Flow

The process flow of the proposed model to predict the level of cognitive decline is shown in Figure 1. The diabetes and pressure data set collected from “Data World” is preprocessed to remove redundant information and missing values. The highly influencing features are extracted with the help of correlation values. We apply 2-stage classification model to determine the cognitive decline more accurately. In the first stage, the selected set of features is applied to the classifier algorithm to identify the associated risk among the population. Support Vector Machine and Random Forest algorithm are used for risk classification. In the second stage, we apply CAT among the people identified in stage 1. The Multinomial Logistic Regression algorithm examines the CAT results and medical care is provided for the people predicted as “Severe Alzheimer”. CAT.

3.2.1. Algorithm

Step 1: Input: Patients with Blood Pressure, Diabetes dataset.Filter 40–65 age group data.Handle inconsistent and missing dataOutput: Preprocessed dataStep 2: Identify correlation between features of the dataset using multivariate analysisStep 3: Do Initial classification for Alzheimer Disease using stage1 classifierStep 4: If AD possible perform CAT test on those patientsStep 5: Perform second level classification to find AD Dementia, Uncertain Dementia, No-Dementia using stage 2 classifier

3.2.2. Flowchart

The process flow of the proposed model to predict the level of cognitive decline is represented as a flowchart in Figure 2.

3.2.3. Support Vector Machine

SVM is the commonly used supervised classifier, which classifies data in N-dimensional space using a hyperplane. It has been applied in enormous healthcare applications in predicting diseases from structural data [33]. Figure 3 shows the classification graph of SVM.

The line function y = ax + b helps to easily differentiate linearly separable data. The SVM uses the line equation transformed into a hyperplane which is applied in the prediction process. The model tries to find out optimal bias and variance for both train and test data set. The comprehensive review has proven that support vector machine provides good performance for big data and healthcare applications.

3.2.4. Random Forest

The Random Forest algorithm trains n different decision trees with different data subset and tuning parameters [34]. It combines the output of all n trees with the help of a voting mechanism. Hence, it is also called Ensemble learning. The working principle of Random Forest algorithm is depicted in Figure 4.

3.3. Mental Ability Test

The cognitive decay of a person ranges from mild to severe. The primary causes include medications, disorder among blood vessels, despair, and dementia. Dementia represents a severe loss of mental functioning and the common type is Alzheimer. Cognition of a person includes a blend of processes in the brain involved in all facets of his life. It includes his memory capacity, thinking skill, language, and talent to learn new things. A cognitive ability test is performed to examine the cognitive impairment of a person. With the help of a detailed review conducted to screen the cognitive function, we framed multiple questions to check the decline in mental function, and the test questionnaire is given in Table 4 [35]. The CAT test scores from 25 to 30 are considered as normal [36]. Items address orientation, memory, attention, recall, naming objects, responding to verbal and written commands, writing a sentence, and copying a figure are the tasks considered in CAT to evaluate the cognitive status of persons. The informant accompanied the patients, and the questions are administered to the informants without unduly alarming the patient.

The maximum CAT score is 30 points. A score of 25 to 30 suggests no cognitive decline, 13 to 24 recommends moderate decline, and less than 12 indicates severe cognitive decline. In every year, the CAT score of Alzheimer's diseased person declines about two to four points on average. The snippet of CAT dataset is shown in Table 5.

3.3.1. Multinomial Logistic Regression on CAT

The multinomial logistic regression model is applied to predict the severity of illness with respect to the correlation existing among the dependent variables as “Severe dementia,” “Uncertain Dementia,” and “No Dementia.” The multinomial logistic regression is applicable for the class of probe, which has more than two outcomes. Our proposed model owns three different outcomes. For N different outcomes, there are n-1 models developed as a set of independent binary regression. One outcome is referred to as Pivot class, and others are regressed against this reference class.

The probabilities for the N categories are estimated based on dependent variables.where Y is the dependent variable and X is the set of explanatory variables, is the regression coefficient for the kth category of Y.

Based on the estimated probability the output is categorized by the algorithm with reference to the threshold.

4. Experiments and Results

4.1. Stage 1 Classifier

In stage 1 we train SVM and Random Forest algorithm to diagnose chronic disease and to identify the associated risk. The data set contains 2361 records of mid-age people. The glyhb values in the range between 4% and 5.6% are considered as normal values, and between 5.7% and 6.4% informs more chance of being affected with diabetes. Values above 6.5% mean they have diabetes. Patient’s systolic and diastolic pressure are the other important factors to be considered in the development of Alzheimer's. Systolic pressure less than 120 mm Hg and diastolic pressure less than 80 mm Hg is considered as normal value and the range 120–139 of systolic and 80–89 of diastolic is the prehypertension values. Persons having >140 mm Hg of systolic and >90 mm Hg of diastolic pressure are considered as having hypertension. The chosen data set contains 526 records of persons with no diabetics and pressure, 1187 records of persons having diabetics or pressure, and 648 records of patients having both diabetics and pressure. Since the presence of either pressure or diabetes increases the chance of dementia, the total 1835 patients having either diabetics or pressure or both exposed to CAT test to assess their cognitive power. The details of records are visually represented in Figure 5.

The performance of the proposed classifier model is analyzed using confusion matrix. The total accuracy, sensitivity and specificity are calculated using the formula given.(i)Total Accuracy (TC)(ii)Sensitivity (SN)Sensitivity (or) True Positive Rate and Specificity (or) True Negative rate are calculated to assess the performance of the proposed model. True positive rate reveals the details of how effectively the model identifies actual positives present in the data set. Specificity measure is given as follows:(iii)SpecificitySpecificity (or) True Negative rate measure is given as follows:

The terms TN, TP, FP, and FN denote True negative (person with no chance of Alzheimer is identified as ‘No Alzheimer’), True positive (person subjected to Alzheimer is predicted as ‘Alzheimer’), False Positive (healthy person is detected as ‘Alzheimer’), and False Negative (person suffering from Alzheimer is identified as healthy), respectively.

4.2. Performance Analysis of Stage 1 Classifier

The proposed SVM classifier outperforms with 0.90 AUC value and for Random Forest AUC is 0.74. The probabilistic classifier shows the tradeoff between sensitivity and specificity. Table 2 shows the performance comparison of SVM and Random Forest algorithm used in our case study. NDP represents patients with No Diabetics and No Pressure, CRD represents patients having either diabetes or pressure, HDP represents patients Having both Diabetics and Pressure. Performance comparison of each measure is given in Table 6. The same is visually represented in Figures 68.

In Random Forest algorithm, it is important to consider the subsampling of data points in the tree construction process. More subsampling or no subsampling results in inconsistent effects. It is possible to enhance the accuracy of Random Forest algorithm by varying the parameters. Due to the unavailability of sample data set, it is not probable to fine-tune the parameters for RF in our case study.

4.3. Stage 2 Classifier

The CAT test result dataset contains a minimum age of 47 years and a maximum of 96 years. Clinical Dementia Rating shortly termed as CDR is a numeric scale used to quantify the severity of dementia indications and its score ranges from zero (none) to 3(severe). Summarization of CAT data set is provided in Table 7 and the same is visually represented in Figure 9.

The confusion matrix of multinomial logistic regression is considered for analysis. True Positive Rate and True Negative rate for the three different classes of output is given as follows: TPRNo Alzheimer = 98%, TPRUncertain_Alzheimer = 85% and TPRAlzheimer = 81%. True Negative rates of the three different classes are given as follows. TNRNo Alzheimer = 86%, TNRUncertain_Alzheimer = 73%, and TNRAlzheimer = 75%. The results of the proposed model show that the model can predict with improved accuracy provided ample amount of dataset for training.

4.4. Analysis with Bench Mark Models

The study of neurodegenerative diseases caused by the ageing of brain systems necessitates brain banking. The Brazilian Aging Brain Study Group’s Brain Bank collects a large number of elderly brains and their related disorders. It encourages researchers to look at a variety of aspects of ageing brain processes and related neurodegenerative illnesses.

Table 8 represents the performance analysis of our proposed model with existing benchmark models. The table explores the detail of the data set used by different authors and the employed machine learning algorithms. Reference [13] considers people above 65 years from 2002 to 2010. The main features considered in their case study include Implantable Cardioverter Defibrillator -10 codes, laboratory results, medication codes, sociodemographics, illness of a person, and his family [37]. They have trained and tested dataset with random forest, logistic regression, and SVM and to predict Alzheimer's incident in 1, 2, 3, and 4 subsequent years. For comparison average of 4 years is taken. The Alzheimer's Disease Prediction Of Longitudinal Evolution (TADPOLE) Challenge is organized in association with Alzheimer’s Disease Neuroimaging Initiative (ADNI) to find people at risk of Alzheimer's. The historical measurements of the people were considered to predict future implications. TADPOLE challenge facilitates early identification of Alzheimer disease with the help of appropriate algorithm [14] used data from the TADPOLE grand challenge and claimed their result with benchmark SVM, which produces 62% AUC and classification accuracy of 52%. Reference [15] collected records of Brain Bank of the Brazilian Aging Brain Study Group between 2004 and 2015. Among 1,037 subjects, diabetes was present, with 279 participants (27%). They proved there is no association between diabetes and dementia (OR = 1.22; 95%CI = 0.81–1.82; ) based on the multivariate analysis.

5. Conclusion

Automated healthcare techniques support physicians in making correct decisions on patient care in resource poor rural areas. The timely identification of risk factors with the help of AI based model’s safeguards the person from late life Alzheimer's. The availability of an appropriate dataset with relevant attribute is a cumbersome process in the development of a more accurate model. The proposed method supports the statistically significant diagnosis of persons at risk for Alzheimer's disease simply based on administrative health records. It allows earlier and accurate screening for further clinical testing. Our proposed work analyzes the influence of hypertension and diabetes on Alzheimer's disease. Support Vector Machine algorithm is more suitable when the dataset is not continually distributed. The performance of SVM is relatively good due its convex optimization nature. Survey conducted on the population with chronic disease for cognitive assessment provides the degree of cognitive decline in the community. The CAT test results are analyzed with the help of multinomial logistic regression to exactly identify the possibility of Alzheimer's in patient’s late life. To achieve optimum accuracy of the model, a large sample size is essential. In the future, the proposed work may be extended with more classifiers by accumulating a huge volume of samples and an increased number of surveys on CAT tests. Time series survey among the population for CAT test will further improve the precision of prediction.

Data Availability

The datasets used and/or analyzed during the current study are available in the following repository: https://staff.pubhealth.ku.dk/∼tag/Teaching/share/data/Diabetes.html.

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.

Authors’ Contributions

A. Revathi was responsible for conceptualization, data curation, formal analysis, methodology, software, and writing–original draft; R.Kala Devi was responsible for supervision, writing–review and editing, project administration, and visualization; Kadiyala Ramana was responsible for software, validation, writing–original draft, methodology, and supervision; Rutvij H.Jhaveri was responsible for supervision, writing–review and editing, and visualization; Madapuri Rudra Kumar was responsible for data curation, investigation, resources, and software; M.Sankara Prasanna Kumar was responsible for visualization, investigation, formal analysis, and software.