3.1. Cluster analysis
As a result of reviewing previous studies, in the case of cluster analysis, rather than selecting one method and deriving a result, a method of estimating the appropriate number of clusters by hierarchical method and finally determining the number of clusters using non-hierarchical method has been proposed [22, 30]. Therefore, the demographic characteristics and exercise practice behavior of elderly sports participants were selected as reference variables for the clusters. Hierarchical methods were deployed. Elderly sports participants were divided using non-hierarchical methods. It is difficult to apply a non-hierarchical method if the initial number of clusters is not known. Thus, hierarchical clusters were first executed to find the number of clusters [31]. A cluster analysis was conducted after converting the demographic (gender, age, educational background, marital status, number of household members, children, income) and sports practice (exercise frequency, health status recognition, sports facility awareness, sport for all course training experience, exercise prescription service, accompanying participants, club membership, and activity) variables to the standard score (Z score). First, for the hierarchical cluster analysis, the distance and average among the clusters were considered by analyzing the dendrogram. It was concluded that it was appropriate to determine the number of clusters within the range of 4 to 6.
Next, K-means cluster analysis, a non-hierarchical method, was conducted on the range identified. As the K-means cluster analysis method is relatively easy for researchers to process large-scale data by designating reference variables and the number of clusters in advance [32-34], and in this study, clusters were designated as 4, 5, and 6 based on the results of hierarchical cluster analysis. When four clusters were designated, the classification of clusters in recognition of sports facilities was insignificant (F=2.274, p>.05). Thus, four clusters were not appropriate. When five clusters were designated, it was significant for all items, but the number of classified cases by cluster (cluster 1:172, cluster 2:138, cluster 3:161, cluster 4:709, cluster 5:590) differed. so the number of clusters was designated as six clusters. As a result of six determined and analyzed, the distance between centers for each cluster was more stable when five clusters were designated, and the final five clusters were determined (see Table 2).
3.2 Artificial neural network model
The application of the artificial neural network model proceeded as follows. First, the algorithm applied an equation for prediction. Second, parameter estimation was organized as a ratio of 70% training set and 30% test set. Third, The sigmoid function is an activation function characterized by collecting signal strengths from multiple neurons and converting them into numbers close to 1 as the signal strength becomes greater than 0, and vice versa [35]. The training method used sigmoid functions that are commonly used in non-linear functions and artificial neural networks; the weights were designated as .9 to limit the demand for infinitely large weight values [36]. Fourth, the learning rate eta played a role in adjusting the weight modified in the process of finding the target variable by finding the direction to adapt to, and the artificial neural network model repeatedly, and this study was conducted by fixing it to the most commonly used eta value of .3 [37]. Fifth, the number of neurons in the hidden layer determined from the results were compared by applying the number of nodes in the hidden layer in various ways, such as 1, 2, 3, 4, 8, 16, and 32. In general, the rules for determining the number of neurons are as follows. First, “the number of hidden layer neurons is 2/3 of the size of the input layer” [38]. Second, “The number of neurons in the hidden layer must be less than twice the number of neurons in the input layer” [22]. Third, “The size of the hidden layer neuron is between the input layer size and the output layer size” [39]. Given that the number of input layers was fourteen and the number of output layers was two, the most suitable number of hidden layers was identified as three. The study was conducted by designating all clusters as the final three hidden layers. These steps were applied to analyze the artificial neural network model for each cluster. Clusters 1 (60.45%), 2 (79.1%), 3 (66.8%), 4 (68.3%), and 5 (61.3%) had the highest possibilities of medical cost reduction (see Table 3).
3.3 Application of logistic regression analysis
Logistic regression analysis was performed along with the artificial neural network model to analyze the classification accuracy rate for medical cost reduction in each cluster. As the medical cost reduction effect (high group=1, low group=2) was set as a binary variable, it followed a binary distribution rather than a normal one as in general regression analysis. Like the artificial neural network model, logistic regression analysis does not directly predict whether the medical cost reduction effect is negative or positive but rather refers to the probability of how accurately it is predicted according to the low and high groups. The results of logistic regression analysis were evaluated for suitability through -2 Log-likelihood verification (the lower, the better), Cox and Shell (the closer to 0, the better), standard error (the lower, the better), and Homer and Lemeshow (the less significant model) tests. The final classification accuracy rate was thus analyzed.
Cluster-specific classification accuracy rates for medical cost reduction were as follows: 64.0% for cluster 1, 74.6% for cluster 2, 70.2% for cluster 3, 67.4% for cluster 4, and 59% for cluster 5. Both models identified cluster 2 as the group with the highest possibility of reducing medical expenses (see Table 4).
3.4 Understanding cluster characteristics.
To analyze the characteristics of cluster 2, which had the highest possibility of medical cost reduction, Chi-square test with other clusters and one-way ANOVA were performed. There were significant differences in the demographic and exercise practice variables (p<.001). It was found that 61.6% were women, 39.1% were 60s, and 54.3% were high school graduates. Further, 87.7% were married, 57.2% lived in a two-person household, and 57.2% had two children. Income was 35.5%, between 2.8 thousands and 3.6 thousands dollars. Further, 30.4% exercised more than thrice a week; 52.9% considered themselves healthy, and 97.8% were aware of the surrounding sports facilities. In addition, 81.9% had experience teaching sports courses, and 91.3% had experience using exercise prescription services. As many as 36.2% participated in the exercise alone, and 42.8% were joined clubs(see Table 5).
Next, one-way ANOVA was conducted to understand the differences in the demographic characteristics and exercise practice behavior in each cluster. When significant differences were identified (P<.05). In cluster 2, the experience of sports courses, use of exercise prescription services, club membership, and activities were significantly higher than those of other clusters. Cluster 2 was called “A group of married women in their 60s who actively participated in sports” following a comparison of the results and demographic characteristics as well as exercise behavior variables.
Cluster 1 was a group of women with low income who lived alone. It was named “A group of women in their 70s, living alone.” Cluster 3 participated in sports less than once a week and had high income, were in their 60s, married, and male. It was named a “A group of married men in their 60s with insufficient exercise.” Cluster 4 was a group of married women in their 60s. They exercised more than thrice a week. It was named “A group of married women in their 60s who exercised regularly.” Cluster 5 was a group of married women in their 70s. They exercised more than thrice a week. It was named “A group of married women in their 70s who exercised regularly.”