Abstract

Gender prediction is extensively studied in recent years since it is widely applied in many fields. Several factors have been investigated to determine a gender of male or female through facial images, voice, gait, finger print, etc. In this study, we present a machine learning approach for gender determination based on Vietnamese names. A model based on N-gram for the full name, combining its own middle name feature based on the specificity of Vietnamese language, is proposed. The experimental evaluation of gender prediction tasks is applied on GenderVN1.0 dataset (with 3 million Vietnamese names) that achieves 90.9% of accuracy.

1. Introduction

Gender prediction is one of the most important problems in machine learning with various applications for marketing, advertising, e-commerce, security, and human behavior [1, 2]. There are many studies on gender identity based on facial images [3, 4], gaits [5], social media [68], facial images [9], ear images [10], and text [11]. In recent years, the identification of gender based on people’s name has been extensively focused by a number of authors [1214].

Gender identification based on name is a subtopic of natural language processing and text mining research. It can be supported and applied in many areas such as contextual advertising, question and answering system, chat-bot, and machine translation [15, 16]. In marketing, identifying the exact gender of a customer allows to propose products to the right audience. For example, users will reduce the time in systems while they need to fill their information. In order to protect and avoid fraud in declaration, the gender prediction is really helpful for different systems such as customer management systems, e-commerce, and social websites. Several online API services are proposed to predict gender based on English names such as Gender-API (https://gender-api.com/) and Genderize (https://genderize.io/). Therefore, this gender identity will be really necessary for the systems question and answering, chat-bot, or machine translation. It enables the interaction with customers becoming naturally as human beings.

Gender determination based on text is first investigated for authorship identification. For example, Cheng et al. [17] predicted the gender of content-free text’s author by proposing 545 psycho-linguistic and gender-based features. These output features are then fed to different classifiers (Adaboost decision tree, Na¨ives Bayes, and SVM) for gender prediction. Different languages are considered for gender determination based on full name by machine learning methods, such as Russian [18,19], Indonesian [14], Chinese [12], Arabic [20], English [2123], Kannada [24], Brazilian [25], Thai [26], and Bengali [27].

Several works of gender identification have been proposed such as using a dictionary of names [18], rule-based [28], and deep learning [14]. Some studies use machine learning models [7, 24] associated with the extraction of the full name feature. Since languages are different, extracting these features also depend on the characteristics of each language. In addition to using a dictionary of full names, Panchenko and Teterin [18] incorporates word ending features and characters N-gram for the Russian language. The results were quite satisfactory, with an accuracy of 96%. Tang et al. [29] introduce an approach for gender inference and behavior on Facebook’s username. They investigated nearly 1.7 million users in New York city by combining various attribute pairs. The experimental results achieve 95.2% of accuracy. Mueller and Stumme [13] use machine learning methods for gender identification based on statistics of name properties. The author has built a classification model called NamChar based on the features selected by the author such as number of syllables, number of consonants, number of vowels, vowel brightness, and ending character. The study proves that the NamChar model is more efficient than the use of the gender score with an accuracy of 70.9%, and this model is particularly efficient for unknown names. Jia and Zhao [12] have focused on simple Chinese features, combined with phonetic information (Pinyin and Hanzi). Then, these features are combined with the Chinese word embedding based on the pretrained BERT model. The results of this work achieved an accuracy of 93.5%.

The feature extraction using N-gram is quite common in text mining and natural language processing (NLP) tasks. In the gender identification problem, the use of N-gram has also been applied in many projects [14, 30, 31]. The authors have used the N-gram feature as basic features of the machine learning model. In this work, we also focused on using N-gram for feature extraction, combining with the term frequency (TF) features for Vietnamese middle name.

The rest of this paper is organized as follows. Section 2 presents the methodology for feature extraction of Vietnamese name. Section 3 describes datasets and our experiments in detail. Finally, the conclusion is discussed in section 4.

2. Methodology

2.1. Brief introduction of Vietnamese Name

Vietnam totally has 54 ethnic groups where the Kinh is the majority with nearly 86% of the population [32]. The people’s names of each ethnic group are not the same due to their own language [33]. In this work, we focus on investigating the full name of the Kinh people. Vietnamese language is tonal. Therefore, names with the same spelling but with different tones represent different meanings. These phenomena can confuse people when the accent marks are dropped [34]. Vietnamese personal names generally consist of three parts as follows:(1)A family name or last name(2)One or more middle name(s) (one of which may be taken from the mother’s family name)(3)A given name or first name

Most Vietnamese have one middle name, but it is quite possible to have two or more of them or to have no middle name at all. A Vietnamese full name must be arranged in that order. This rule is officially used for administration and daily life. Moreover, a woman’s name will be changed after marriage in contrast with other countries such as the UK, US, and China. Let us take an example of a full name: Ngô Đăng Hưng. In this case, Ngô is the family name or what we call the last name. Đăng is the individual’s middle name, and Hưng is the given or first name. The given name, which appears last, is the name used to address someone, preceded by the appropriate title of Ngô Đăng Hưng, for example. In formal usage, he is referred to by his given name (“Mr. Hưng”), not by his family name (“Mr. Ngô”). To better understand the structure of this name, we present and analyze several names in Table 1.

It is estimated that there are around 100 family names in common use, but some are far more common than others. The name Nguyn is estimated to be used by almost 40% of the Vietnamese population [35]. Naming of Vietnamese people is also rich and varied. Names can be given with deep connotations such as Nguyễn Hòa Bình and Trần Hạnh Phúc (showing peace, happiness); or it can simply rhyme with the parent’s name or even just named after a flower (Phạm Cúc (daisy), Xuân Lan (orchid), Thu Hồng (rose)…). Although Vietnamese names are not restricted, there are several things that need to be limited, such as avoiding names of someone in relatives (both the previous generation and the next generation), not naming men for women and vice versa to make it easy to distinguish, or not too giving a bad name, as well as superstitious, fanatic, and superficial, such as a champion and hero.

Additionally, there are some names that can only be used to designate people’s gender as male and female. In these cases, gender will be clearly identified. However, there are a number of names used for both male and female names. So, if we only use the names only, sometimes we cannot distinguish between the gender of male or female. When these names are combined with middle names, the gender can be easily identified. For example the name “An” can be used to denote both male and female genders or combined with the middle name, “Thành An,” “Thiên Ân” reward for female gender, and “Bình An,” “Mạnh Ân” reflect for male gender. In addition to combining names with middle names in gender identification, in middle names there are also words indicating male or female gender. Based on this appearance, we can also identify gender through middle name. For example, in middle name, where the words “thị,” “thúy,” and “thúy” appear, the gender is female. The middle name appears in the word related to “literary” and “strong,” and the gender is male. However, the formula for naming “văn for man” and “thị for women” seems to have changed a little nowadays.

2.2. Classifiers

Previous works [3639] showed a comparative study of classifiers to evaluate the performance of NLP tasks. So, we apply three well-known classifiers such as logistics regression, Naïve Bayes, and random forest.(i)The Naïve Bayes classifier is based on the probability theory of Bayes’s theorem. Therefore, this classifier relies on probability and statistics calculation to make predictions or classifications of data which are often used to solve problems with text classification, spam filtering, and emotional recognition [40, 41].(ii)Decision tree is a structured hierarchy used to classify objects based on a series of rules. The results of the decision tree model are results based on the questions. It can be applied to both regression problems and classification problems. The algorithms ID3, C4.5, J48, and CART (classification and regression trees) are algorithms that are extended from the decision tree algorithm [42]. Random forest algorithm is an extension of bagging approaches; it also combines many decision trees into a single model. Each decision tree of the forest is built from a random subset of features and only has accessed a random set of the training data points.(iii)Logictics regression analyzes the relationship between the dependent variable and one or more independent variables based on the probability by using the logistic/sigmoid. This model is like the linear regression model; they also estimate the coefficients from training dataset to minimize the errors which are different between the real output and the predicted output. This classifier is also applied to predict English names in [23].

2.3. Related Work

All names will be tokenized to individual words in order to characterize by a feature vector. The term frequency (TF) is a frequency of word that appears and the number of times of its appearance in a document, divided by the total number of words in that document [40], where is a word in a document, is a frequency of word occurring in that document, and is a total number of words from this document:

In the computational linguistics domain, an N-gram is a contiguous sequence of N items from a given sequence of text. These items can be considered as syllables, letters, words, or base pairs according. The N-grams typically are extracted from a text corpus. We can use different values of N and obtain the corresponding size as follows: size 1 is referred to as a “unigram,” size 2 is for “bigram,” and size 3 is a “trigram.” Larger sizes are sometimes referred to by the value of N, e.g., “four-gram” and “five-gram”.

Based on the above characteristics, we incorporate feature extraction by the N-gram method for the Vietnamese full name. Figure 1 illustrates a scheme of gender determination based on Vietnamese names. The feature extraction stage is applied by TF extraction with different strategies for middle name and given name. The three classifiers are considered to predict gender.

2.4. Evaluation Metric

In order to evaluate the effectiveness of the proposed approach, an accuracy metric will be used. This metric is the ratio of the model’s correct testing over test data. It depends on the four parameters TP (True Positive), TN (True Negative), FP (False Positive), and FN (False Negative) and is calculated by the following formula:

3. Experiment and Results

We built a Vietnamese full name dataset for gender determination tasks, namely, GenderVN1.0. By collecting a list of students from high school to undergraduate, we obtain nearly 3 millions names for both genders. To our knowledge, this is the first large scale for this task. A data cleaning process is applied to remove duplicate names with the same gender annotation. The characteristic of GenderVN1.0 dataset is illustrated in Table 2. The dataset generated during this study is available from the corresponding author upon request.

We decompose the GenderVN1.0 dataset into two disjoint subsets such as training and testing set by a ratio 60 : 40. The feature extraction is applied on training data and feeding into classifiers to build models. A typical Vietnamese name consists of three words including one for family name, one for middle name, and one first name. Here, we apply two strategies to extract features from considered Vietnamese names such as a full name and without using family name. Vietnamese names usually consist of 3 words for male and 4 words for female. Three values of are considered to extract -gram features. Three values of are considered to extract -gram features. For each type of feature, we apply three classifiers to predict gender independently. The N-grams are basically a set of co-occurring words within a given window and when computing the N-grams by moving one word forward. In this problem, for a common female name, “Nguyễn Thị Bạch Tuyết.” If we extract features by using bigrams , then the N-grams should be(i)Nguyễn Thị(ii)Thị Bạch(iii)Bạch Tuyết

Table 3 presents the prediction results on the testing set. We observe that the best accuracy is obtained by using 1-gram extraction with logistic regression for both strategies. Obviously, a family name cannot allow us to predict the gender of a person. The best accuracy is achieved at 90.9% in cases of using middle name and first name.

As we mentioned in Section 2.1, the middle name of Vietnamese people plays a major role for gender determination. Table 4 illustrates the prediction results by using only the middle name for feature extraction. We observe that the prediction results achieve around 76.0% for any considered classifiers or feature extraction methods. This result confirms again that the middle name of Vietnamese people allows predicting their gender accurately.

4. Conclusion

In this paper, we presented a method for gender prediction for Vietnamese names. We provided a first large-scale GenderVN1.0 dataset with more 3 millions Vietnamese corresponding to an annotated gender. The experimental results show the effectiveness of the proposed approach by achieving 90.9% for gender prediction on GenderVN1.0 dataset. The experiments also demonstrated that the sole middle name acts as a major role for gender prediction with an accuracy obtained by 76.1%. However, there are several limitations of the proposed approach. It only can recognize well the gender-based name of Kinh people. Secondly, it cannot predict the gender of Vietnamese name combined with foreigner words.

The first future of this work is now extended to improve the proposed approach by incorporating feature selection to remove irrelevant and combining with deep features. The second perspective is to represent a compact features descriptor for extracting Vietnamese names.

Data Availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by Ho Chi Minh City Open University, Vietnam.