Keywords

1 Introduction

The population of many species of primates is fast reducing due to loss of habitat caused by deforestation. Many species of chimpanzees and gorillas have recently been added to the list of endangered and critically endangered animals in the International Union for Conservation of Nature (IUCN) Red List [1]. As deforestation sees no reduction in near future, there is an enormous risk of more primate species being added to the list or inevitable extinction of its current members. To counter these catastrophic consequences, conservation of primates is the need of the hour.

Contrarily, there are several areas of the world where current population of primates have become a nuisance to urban dwellers. The ease of getting food and relatively less competition in the cities has resulted in a population explosion of such species (monkeys in particular). The growing population of monkeys is a health, safety and sanitation hazard for local population. Monkeys are known to be aggressive in nature if a human unknowingly invades their territory and mostly respond by scratching/biting them. As monkeys are known carriers of the lethal rabies disease causing virus, it is essential to minimize such incidents. Monkeys are also known to cause vandalism and have been reportedly seen vandalizing important electric/telephone connections as well as causing hazard to human lives. Hence, in this scenario, it is important to control their population.

The above two scenarios require maintaining effective track of individual primates and identify the primates which require extra care or need to be controlled using effective mitigation techniques. Hence, it is important to construct a system which can accurately identify any primate in the wild. The current systems of animal tracking are mostly invasive in nature. For example researchers have used GPS collars [3], which have to be strapped around the neck of animals and can be monitored from a remote location. Another approach of tracking is proposed by Kim et al. [13] which employs both RFID and GPS tags embedded in the body of animals to detect if they have escaped their cages in a zoo. This kind of arrangements are costly, unreliable and require a significant human intervention into the lifestyles of wild animals. In these approaches, animals have to be first drugged to put any kind of device on their bodies. Sometimes these collars/tags cause pain to the animals and hinder in their daily activities. A few more docile methods such as using a Wireless Sensor Network (WSN) to detect movements/activities of turtles in Wildlife Institute of India (WII) [12] are also employed. As investment required in an invasive technique is too high - along with problems the animals have to face, shifting to non-obtrusive and non-invasive methods for recognition is a requirement.

A step in this direction has been taken by Mason et al. [18]. They use stripe patterns of a tiger as a biometric modality to identify tigers. A similar approach has been applied to zebras in recognizing them from their patterns [14]. In the process, several new biometric modalities specific to a particular type of animal have been discovered and implemented, for example recognizing cattle from their nose-print [2]. The non-obtrusive methods are relatively cheap to set up, they only involve automatic camera traps to trigger photographs. Humans are also required to go into dangerous environments less frequently - saving time, resources and money. However, building such a system poses a different set of challenges. For instance, face recognition of tigers or monkeys may not have prior literature or database as the starting point. A very major challenge in face detection is the high amount of background clutter in the image (due to heavy cover of trees, shrubs, mud), which interferes with the detection methods. The primates are not a very docile family of mammals, most images contain a side-profile or partially-occluded faces. Hence, capturing images in a controlled environment is not feasible and images in the wild have to be considered. Another major challenge is the fact that the structure of eyes and nose of a primate are considerably different than that from a human; hence, traditional eye and nose detection methods do not work very well on them. As recognizing primates using their faces is a relatively newly pursued field, most of the research dealing with this problem have either manually cropped faces for recognition step [4] or have used traditional face detection algorithms inspired from the ones developed by Viola-Jones [23]. Few research papers have used proprietary software [5] built on the same concept as existing face detectors. Furthermore, only a handful of them have considered the problem of straightening or normalization of faces after detection [24]. However, none of the research till now has tried adapting deep learning techniques for detection. Along with the challenge of detection, to the best of our knowledge, no work has tried to find various kinds of bias affecting primate detection in the wild.

In automated face recognition, both deep learning and non-deep learning based approaches have been tried. Researchers have used a variety of techniques to extract features from the detected faces. For example, PCA [16], Fisherfaces [16], LBP features [4], and SURF [15] are utilized. Another recent research has used deep learning methods to recognize primates [9]. Similarly, the performance of recognition has not been well studied to understand the various biases.

Human face detection and recognition problem suffers from various kinds of biases arising from race [20], age [26], ethnicity [22] etc. of the subjects or the physical properties like quality [17] or orientation [25]. We conjecture that detection and recognition of primate faces also suffers from various kinds of biases arising from both intrinsic and extrinsic properties. It is worth exploring both intrinsic bias like the species of the primate to be detected or extrinsic biases like the quality of the image, orientation of the primate and amount of noise in the image. Since this domain of primate faces is relatively new and not well studied, we plan to investigate this domain further. In this paper, we present a dataset of primates, a deep learning based pipeline and experimental results that help us better understand the various biases found in primate face detection and recognition.

Fig. 1.
figure 1

Sample images of the primates in the dataset.

2 Dataset

The experiments are performed on two databases: one was provided by the Wildlife Institute of India (WII), Dehradun, India. It contained manually clicked candid images of a group of Rhesus Macaque (Macaca mulatta). The images are high resolution shots \((3648 \times 2736)\) taken by a DSLR camera. The images are an ensemble of front and side profile images with a few back profile images as well. A few images contained more than one subject especially mothers and their new born babies. The images were accompanied with manual ground truth labellings around the subject’s face in an XML format. However, the number of images corresponding to a single identity were non-uniform - ranging from minimum 4 to maximum 50 images. All images are manually inspected and the ones with total occlusion were removed. Everything included, the cleaned dataset contained a total of 56 identities.

The second dataset is acquired from the Leipzig Zoo [9] and contained images of resident common chimpanzees (Pan troglodytes) and western gorillas (Gorilla gorilla). Again, the images are high resolution candid shots \((1936 \times 1296)\). Once again the images are a mix of front and side profiles with a few photos containing groups of chimps and gorillas. However, there are no manual ground truth facial markings as in the previous collection. The number of images per identity are a little more uniform ranging between 10–20. There are a total of 18 identities of chimpanzees and 6 identities of gorillas.

Both the above collections (WII Database and Leipzig Zoo Database) are combined to form a unified dataset with total of 927 images spread over 80 identities. This dataset is used further in the experiments. Some example of images in the dataset are shown in Fig. 1.

3 Primate Face Detection and Recognition Framework

The method to analyze various biases in detection and recognition of primates consists of two modules: detection-normalization and recognition. The details of both the components are summarized in Fig. 2 and explained in the following two subsections.

Fig. 2.
figure 2

The composite pipeline used for detection, normalization and recognition.

3.1 Detection and Normalization

Currently available face detectors are primarily trained for human face detection and they are able to detect all kinds of faces including animals, sketches, and cartoons. However, in this research, we are only interested in detecting animal faces and all other faces are considered as false positives. Tiny Faces [10] is a state-of-the-art face detector, which utilizes image resolution, spatial context and object scale information to detect faces. Using these novel descriptors, it fine-tunes pre-trained ImageNet models on existing deep learning architectures and achieves best results using the ResNet101 architecture. The pre-trained Tiny Faces model was used on the primates dataset. To analyze the biases due to extrinsic properties on primate detection, all images in the dataset were classified based on two properties - quality of the image and orientation of the subject primate in the image. Images with blurring, occlusion, camera shake, etc. were adjudged as bad quality images and all others with good face clarity were adjudged as good quality images. Similarly, images in which both the eyes of the subject primate were clearly visible were adjudged as good orientation images and all the others where one or both eyes are not visible due to overexposure, side profile or growth of fur were adjudged as bad orientation images. Out of the total 927 images, 77 images were found to be of bad quality and 210 images were found to have bad orientation. Similarly, the images in the dataset were segregated into the different species- monkeys, gorillas and chimpanzees. Figure 3 demonstrates sample output of the algorithm, and two sets of challenges that are observed in the results.

  • Occurrence of false positives - Small patches of grass, leaves, and primate skin were mostly detected as faces. This is because Tiny Faces particularly focuses on finding smaller faces and in that process ends up detecting more false positives.

  • No distinction between human and primate faces - This is due to the fact that tiny faces is primarily fine-tuned for human faces. Although the datasets considered in this research consist of all primate images only and no humans among them, this property is undesired as one would not want a human face to be detected as primate. These can be seen in Fig. 3.

Fig. 3.
figure 3

Few examples of false positives produced by Tiny Faces detector. It can be seen that clutter in the background causes false positives. Detector can also not differentiate between human and primate faces. It is to be noted that the top two images are for illustration purposes only and they are not part of the dataset.

Due to the above two factors, the performance of the primate face detector is lacking. Therefore, we propose to further process the outputs of Tiny Faces by following a two step approach:

  1. 1.

    Training an eye detector and using it on positive and negative output images to filter out the false positives. The dataset for this is prepared by manually extracting the regions containing the eyes from the detected primate faces. Along with this, negatives for the eye detector are taken as manually cropped primate noses (to distinguish between eyes and nose) and random patches from detected face excluding eyes.

  2. 2.

    A CNN architecture comprising of three convolutional layers, each having 16 filters with ReLU activation function is constructed for the task of binary classification between eye and non-eye regions. The collection of about 700 cropped eye images and 1500 negatives is split randomly into train and test set (70%–30%). The train set is used to train the CNN and the accuracy of classification is computed on the test set. Finally, all the train images are combined together to retrain the CNN. A sliding window of (\(60 \times 60\)) pixels is run on the output of Tiny Faces, resized to (\(300 \times 300\)) pixels with the hyper-parameter of window step size set to 16 pixels both vertically and horizontally. The trained CNN model is run on this window to filter out the false positives. Only one window with the highest score is chosen among the overlapping windows detected as eyes. And finally total two such windows with highest scores are chosen for each image and identified as eyes. A few samples of the training images can be seen in Fig. 4.

  3. 3.

    Once the eyes are detected, the distance between the mid points of the eyes is calculated. The nose point (or the pivot point) is computed using heuristics −0.7 times the distance between eyes in this case. The position of the nose point is measured from the mid point of the line segment joining mid points of eyes. Subsequently, the rotation angle is determined as the angle formed by the line joining the nose point and mid point with the normal. The results of normalization can be seen in Fig. 6.

  4. 4.

    The model constructed may still not be able to distinguish between human and primate faces. Hence, a model is trained using Histogram of Oriented Gradients as features and AdaBoost as the classifier. The training data included about 350 primate images (obtained from the training dataset) and about 400 images of human faces (randomly chosen from the Labelled Faces in the Wild-LFW dataset) [11]. The model is cross-validated over 5 folds and then used on the images obtained from above (CNN).

Fig. 4.
figure 4

Samples of the eyes and negative patches (including noses) used for training the CNN.The top five images are those of eyes and the bottom five of negatives (first 3 are noses).

3.2 Recognition

To study the bias in recognition experiment, we used the following existing face recognition matching algorithms and independently tested them on different training and test sets.

  • Principal Component Analysis (PCA) [21] - PCA is applied on the training database to compute the eigen vectors. 115 principal components pertaining to 95% eigen-energy conservation are utilized. For testing, gallery and probe images are transformed to the trained vector space and Euclidean distance is computed from the nearest match among the gallery for each probe image.

  • Linear Discriminant Analysis (LDA) [7] - 28 components are extracted pertaining to 95% eigen-energy conservation. Again, for testing, gallery and probe images are transformed to the trained vector space and Euclidean distance is used to compute the nearest match among the gallery for each probe image.

  • VGG-Face [19] - It is an adapted VGG-16 architecture for recognition task trained on 2622 human identities and is one of the best models for human facial recognition. For testing, pre-trained weights of the VGG-face model are used and two different sets of features are extracted for the gallery and probe images of the test set.

    1. 1.

      Last Fully Connected Layer (fc8)

    2. 2.

      Second Last Fully Connected Layer (fc7) along with last MaxPool layer (pool5).

    Here cosine distance between gallery and probe feature is used to compute the nearest match.

  • VGG-Face with Finetuning - All the fully-connected layers of VGG-Face model are finetuned using the training data. For testing, these finetuned weights of the model are used. Two different sets of features are extracted and cosine distance is used to compute the nearest match similar to VGG-Face.

Fig. 5.
figure 5

ROC curve for primate face detection using Tiny Faces face detector.

4 Experiment Protocol and Results

The dataset contains 927 primate images spread over 80 identities. The experiments are performed with four times random cross-validation, with 50–50% train-test partitioning. The train and test partitions in each fold have no overlap in terms of the identities.

4.1 Detection and Normalization

For the first experiment, Tiny Faces Detector is run over all the images in the dataset with a confidence threshold of 0.5. Out of the total 927 images, the detector returned 920 true positive results. However, the total number of false positives returned were 352. The results of different biases are depicted as ROC curves in Figs. 11 and 12. Any positives less than \((10 \times 10)\) pixels is discarded, assuming reasonable size of faces. Once the faces are correctly detected, the true positives and the false positives are cropped out. Figure 5 shows the overall detection ROC curve. Now these identities are split into folds, similar to the one described before.

Fig. 6.
figure 6

Primate face detection and normalization pipeline: original image, face detection, cropped faces, eye detection, and normalization using samples from the dataset

All cropped true positives are collected together and a CNN (as described in Sect. 3.1) is trained with a 70–30 split. The negatives are populated by random patches and cropped nose images (Fig. 4). The accuracy obtained on the test set is 85.58%. Once the trained model is applied on the test set, the images classified as primates are kept and the rest are discarded.

The true positive images are again split into 70–30 splits and an AdaBoost classifier [8] using HOG features [6] is trained. As mentioned in Sect. 3.1, the classifier is trained with the 70% fold as positives and random images from LFW dataset as negatives. The classification accuracy on the 30% fold was found to be 99.08%. The images in the test set are filtered twice, once by the CNN model and then with the AdaBoost classifier. It can be safely concluded that the images in the test set only contain primate faces. A few examples of normalized images are shown in Fig. 6.

Fig. 7.
figure 7

Top 5 matches for a few sample probe images.

4.2 Recognition

The recognition experiments are performed with two different gallery sizes: 2 and 5. The recognition performance is computed for ranks ranging from 1 to gallery size. Better rate on lower gallery size would mean that the trained model can learn the features well of new identities using fewer images. On the other hand, a better rate for the higher gallery size would mean that given sufficient number of images, the trained model can learn the features well for new identities. The results for recognition baselines can be seen in Figs. 8 and 9. The various results obtained on different train and test sets are shown in Table 1.

Fig. 8.
figure 8

CMC curve for recognition using different methods when gallery size is fixed at maximum 2 images from each identity.

Fig. 9.
figure 9

CMC curve for recognition using different methods when gallery size is fixed at maximum 5 images from each identity.

4.3 Bias Analysis

Detection: The results of the extrinsic bias - quality can be seen in Fig. 11 and those of orientation can be seen in Fig. 12. As can be seen, the performance of the face detector is better on good quality images than on bad quality images, as expected. Similarly, detection results on good orientation images are better than those on bad orientation. Hence, we can conclude that images which have poor quality such as have blur, pixelation, camera-shake, overexposure, occlusion, etc. are prone to detection errors such as partial or wrong detection. The results of intrinsic bias - species can be seen in Fig. 10. As, TinyFaces is predominantly trained on human faces, the species with the closest resemblance to human face-like features are the better detected.

Fig. 10.
figure 10

ROC curve for detection across various species in the dataset - Monkeys, Chimpanzees and Gorillas.

Fig. 11.
figure 11

ROC curve for good and bad quality images using TinyFaces face detector.

Fig. 12.
figure 12

ROC curve for good and bad orientation images using TinyFaces face detector.

Recognition: The values for the Rank 1, 3 and 5 accuracies are summarized in Table 1 for different training sets. It can be seen that the predominant species in the training set is better recognized than the less dominant ones.

  • When training set consists of a single species, the test results of that species on its own test set are usually the highest among all.

  • When the training set consists of multiple species, the test results on the test sets of species present in the training set are greater than the others. However, between the training species, the test results are better on the species with a larger training set.

  • It must be noted that the number of images in the Gorilla set is much lower when compared to monkeys and chimpanzees. Consequently, for higher rank cases, we see that the accuracy for Gorilla faces is abnormally high, even reaching 100% at Rank-5.

  • The high variability in the performance of VGG-Face model is explained due to very limited training data used for fine-tuning, which never exceeds around 350.

Table 1. Rank 1, 3 and 5 accuracy percentage values using techniques - (a) PCA, (b) LDA, (c) VGG-Face Fine-Tuned, (VF-FT) for different training and test sets - (a) M-Monkey (b) C-Chimpanzee (c) G-Gorilla. The training set includes half of the randomly chosen identities of monkeys, gorillas and chimpanzees and their unique combinations - tested over 2 folds

4.4 Scope for Improvement

Although the detection pipeline achieves good results, the presence of biases (intrinsic and extrinsic) as discussed leads to improper detection of faces. The detection does not work well if these biases of quality, occlusion, orientation, species difference etc. exist in the training images. Similarly, the recognition performance is highly sensitive to illumination and face profile as can be seen in Fig. 7 if not handled properly beforehand. Additionally, the recognition performance is significantly affected due to imbalance of training data from different species. Devising strategies for overcoming these challenges can lead to substantial increase in the recognition performance (Fig. 13).

Fig. 13.
figure 13

Examples of bias in primate images from the dataset. Row 1 consists of images which are bad in quality, row 2 contains images where orientation of primates is not good, denoting extrinsic biases. Row 3 contains example images of different species in the dataset, which if not balanced may lead to an intrinsic bias.

5 Conclusion and Future Work

The paper presents a method to analyze biases in primate face detection and recognition. Experimental results on a primate dataset show that a deep learning based face detector trained for humans yields satisfying outputs on primates also. However, both intrinsic as well as extrinsic biases that may exist in human face detection or recognition models, also get extended to primate faces. We observe that the face detection problems can be handled reasonably well by employing simple techniques, e.g., shallow CNN based eye detectors paired with spatial constraints, and the use of ensemble learning techniques. However, the algorithm is still not fully immune to biases that exist in primate face datasets, and requires better understanding of the work.

Further, a deep learning based architecture, VGG-face, significantly outperforms other methods given a sufficiently large gallery size. However, an ideal recognition system should perform equally well on a smaller gallery size, as well as handle different intrinsic and extrinsic biases that may exist. Our future work would comprise of coming up with such a system which is robust to the different types of biases and also where recognition rates for both higher and lower gallery sizes are comparable.