1 Introduction

Malaria is caused by protozoan parasites of the genus plasmodium and is transmitted by the bite of anopheles mosquito. Four species of the plasmodium parasite infect humans: P. falciparum, P. vivax, P. ovale and P. malariae. The parasite’s lifecycle within humans can be divided into three distinct stages: Trophozoite, Schizont, and Gametocyte. During this lifecycle, human red blood cells (RBCs) are used as host. The shape and size of the parasite differs by the species and lifecycle stage of the parasite. However in each lifecycle stage the parasite has at least one chromatin, which is the nucleus of the parasite. A measure of severity of the infection, called parasitaemia, is the ratio of the parasite infected RBCs to the total number of RBCs. This is an important determinant in selecting appropriate treatment and drug dose.

Currently clinical diagnosis utilizes microscopy to study the prepared blood smears. However this is extremely time consuming and is dependent on the skill and experience of the examiner, and hence has limited reliability. Thus it is important to develop an automated image analysis system that can identify and count infected and un-infected RBCs in the images of blood smears. Further, in the clinical process, two different types of blood smears (thick and thin) are produced. The thin smear is used to identify the type of parasite and the thick smear to estimate parasitaemia. However, the complete information is available in the thin smear.

In this paper we present a technique for estimating parasitaemia in the images of stained thin smears of blood. The technique presented is computationally efficient and fast and automatically adapts to the variations in images such as magnification, object orientations, etc. It also addresses challenges presented by presence of dust or leftover stain on the slide. The method utilizes 4-connected sets to identify different regions existing in the image. Properties of these regions, such as area, coordinates of centroids, major and minor axes and Euler’s numbers are utilized to take decisions. Key challenges to the counting process are : 1) partially visible cells at the boundary of the image and 2) overlapped cells. The image is first processed to identify the presence of any external artifacts and address them. The total number of cells present in the field of view is then estimated. Then the number of infected cells is calculated. Using this data the degree of parasitaemia can be calculated.

The technique however assumes that the presence of the parasite in the image has already been established by the use of other automated techniques [59]. Further, the image presented for analysis should be a true colour RGB image captured with adequate magnification and resolution. It should be noted that this method cannot be applied to images of thick smears.

2 Literature Review

Most of the available literature concentrates on segmenting the chromatin dots within the RBC [511], to establish the presence of the parasite. Of the available literature, [3, 4] concentrate on measuring parasitaemia. Of these [4] uses an image mosaicing system to study the partially visible cells at the edge of the FoV. This may not be realizable in a practical clinical situation, and hence has limited utility. The method described in [3] uses a synthesized template of an RBC, parameterized using variable eccentricity and major axis. The shape of the RBC is assumed to be elliptical. This template matching process gives less than acceptable results for overlapped cells, partially visible cells at the edge of the FoV, and when the cells are oriented at an angle, resulting in poor matching with the template. Besides the method used to arrive at the radius of the RBC is very slow.

Hence there is need for method that is not affected by shape, size, orientation, or location of the RBC in the FoV.

3 Methodology

The analysis process consists of six distinct steps (Fig. 1). In the first step the image is studied for appropriateness for this study. The image is expected to be a true colour RGB image. Besides, the smear should have been stained with a gimsa or equivalent stain. This stain renders specific colour to specific portion of the parasite [2], irrespective of the lifecycle stage of the parasite: chromatin stains to a deep red colour, cytoplasm stains to a shade of blue but the exact colour varies from species to species, RBCs become pale-yellowish pink, vacuoles do not take any colour. To verify that the smear has been stained with appropriate stain, a copy of the image is converted to HSI colour space, and a population of pixels in the six hue ranges indicated in Table 1 is taken. An image is accepted for analysis only if it exhibits population in red and blue ranges. The entire hue range of 0–3600 is divided into six segments with each segment having a width of 600, and centered on a colour indicated in Table 1. The range of hues associated with each colour takes care of variances in colour existing in different images.

Fig. 1
figure 1

Process flow

Table 1 Table of hue ranges

This first step is however required only of this system is used as a standalone system. This step is not required if this process is integrated with other system for detecting the presence of malaria.

A candidate image identified by the previous step is then converted to a gray scale image. This image is used to count RBCs.

Figure 2 show extracts from images of thin smears of blood that has been stained to display malaria. These are gray version of the original true colour images. This set would be used to demonstrate this technique. Dataset 1 is used to explain the methodology too.

Fig. 2
figure 2

Extracts from malaria positive images of thin smears

The gray image is then converted to a binary image by thresholding it. To do this, we utilize the fact that the image has a predominant background that has intensity distinct from that of the foreground. The histogram is thus bi-modal (Fig. 3a). Otsu’s method was thus used to arrive at the threshold value. This was used to create the binary image shown in Fig. 3b.

Fig. 3
figure 3

a Histogram of dataset1 and b thresholded binary image

The binary image shows two types of imperfections : 1) artifacts in the background region (caused by dust, foreign bodies in slide, etc.), and 2) holes within the RBCs. Both these issues need to be addressed before proceeding. Similar procedure is used to address both the issues. We address issue 1 before addressing issue 2. Essentially, the 4-connected set property of digital images would be used to identify various regions existing in the image. The area of the regions would be used to take decisions. The study is done on a copy of the binary image. When some pixels are identified for correction, the correction is done in the original binary image.

The artifacts in the background region are black objects on a white background. To remove the artifacts, a digital negative of the copy of the binary image is converted into a 4-connected region labeled image. The population of each labeled region is an indication of area occupied by each region. This population data is used to identify these small artifacts. Any region with area less than 0.3 % of the image size was erased and the corresponding pixels in the binary image were marked as background pixels. This also removed some very small partially visible cells at the boundary of the image. Since they are not counted, this is not an issue. Figure 4a.

Fig. 4
figure 4

a Dataset 1 after removing artifacts from background, and b final binary image

Fig. 5
figure 5

a Partially visible cells identified, and b partially overlapped cells identified

Unlike the artifacts in the background, the holes within the RBCs are white objects in black background, and hence, the corrected binary image was converted into a 4-connected region labeled image again. As before, population of each labeled region was calculated and the regions with population less than particular threshold were marked as foreground pixels in the binary image. The threshold value is essentially the area of a circle of radius equal to 60 % of the radius of an RBC. (Fig. 4b). The method used to measure the radius of the RBCs is described later.

The corrected binary image will now be used to count the number of RBCs present in it. The RBCs are essentially visible as different objects in the field of view (FoV). The objects visible can be categorized into two groups: 1) partially visible RBCs existing at the edges of the FoV, and 2) Objects within the FoV. The second group consists of free standing RBCs and overlapped RBCs. These three categories of features can be easily differentiated once the image is converted into a 4-connected region labeled array. Features of the disjoint regions, such as coordinates of centroids and lengths of major and minor axis can be used to arrive at decisions. The data for the regions in data set 1 are shown in Table 2.

Table 2 Properties of regions

The first step in the counting process is to determine the diameter of the RBCs. This would vary from image to image, depending on the magnification used to capture the image. Within an image too some variation is expected due to infection or orientation of the RBC at the time of image capture. The ratio of the major axis to minor axis of all the identified regions are studied. Regions having this ratio ~1 represent free-standing RBCs. The minor axis of a region with minimum ratio (≈1 and >1) is taken as the diameter of the RBC, in pixels. The measure is rounded to the nearest integer.

The next step is to count the partially visible cells at the edge of the FoV. These cells have their centroids at a distance less than the diameter of a RBC, from the edge of the image. Distance of the centroid from the four edges of the image is considered for this decision. The number of partially visible cells is then counted. Partially visible cells can also be overlapped. For such cells the ratio of major to minor axis is >1. To calculate the number of overlapped RBCs the ratio of the major axis to the diameter of an RBC is used for decisions. This data is used to correct the count of partially visible cells. This resolution is however done only if the minor axis is more than 80 % of the diameter of a RBC.

The next step is to count the number of free standing RBCs. As discussed before such cell have the ratio of their major to minor axis ~1, and are located within the FoV.

The partially overlapped RBCs are identified by regions having their ratio of major to minor axis greater than a threshold value. For such regions the ratio of the major and minor axis to the diameter of an RBC is used to arrive at the number of overlapped RBCs in the region.

The sum of the count of the free standing cells, partially visible cells and overlapped cells gives the count of RBCs present in the image.

To count the infected RBCs, we need an image with the chromatin dots segmented. This can be obtained following any of the methods described in [511]. Figure 6 shows an image generated following one such method. A digital negative of this image is added to the image in Fig 4(b), and 4-connected set data and region properties regenerated. Eulers numbers of the regions indicates the regions that have another region embedded within it, i.e., the infected RBCs.

Fig. 6
figure 6

Image showing the chromatin dots present in dataset 1

4 Experimental Results

The prototype was developed using MATLAB. The developed algorithm was tested using images sourced from Center for Disease Control and Prevention’s malaria image library [1].

For generating the binary mask Otsu’s method was used to generate the threshold. For image in Fig. 2(a) Otsu’s formula returned a threshold value of 227. The histogram is shown in Fig. 3(a), and the segmented image in Fig. 3(b). The segmented image shows some small external artifacts in the background region.

On converting a digital negative of the image to a 4-connected region labeled image we get 68 regions of which 37 have a population less than the threshold value of 270 for this image (0.3 % of image area of 300 × 300). These regions are removed from the binary mask, see Fig. 4(b).

The process described was able to remove artifacts from the background region when any external object on the slide result in free standing objects in the background region in the binary image. However, artifacts that result in objects that are attached to other RBCs are not removed.

The cell counting process is dependent on the existence of at least one free standing RBC in the FoV. This is expected to exist in images of thin smears. However, in the extreme case when no such RBC exists, the process would fail.

After correcting for imperfections, dataset 1 had 30 4-connected regions. The properties of these regions are shown in Table 2.

Diameter of a free standing cell (identified by ratio ~1) is the diameter of the RBC. Here, diameter of free standing cell = 36. The region taken for measuring the diameter was region 17, which had the minimum ratio (1.02). While measuring the diameter, care should be taken that the target region is completely located within the FoV. Thus after measuring the diameter, the distance of the centroid of the target region from the edges is measured, to ensure that it is at a distance greater than the diameter from the edges.

Partially visible objects, whose centroids are within a distance of unit radius of an RBC from the boundary of the image, are represented by regions 1,6,12,25,29,30. Table 3 is an extract from Table 2, and shows the properties of these regions. Of these, region 12 represents two overlapping cells. For the other cases, even though the ratio was high, the minor axis is less than 80 % of the diameter of an RBC, and hence was not resolved.

Table 3 Calculation table for partially visible cells
Table 4 Free standing cells

Partially visible cells = 7

Twenty regions had the ratio less than the threshold value of 1.3 and were treated as free standing cells. Region 20 had high ratio, but the minor axis was 77 % of the diameter (<80 % threshold) and hence was treated as free standing cell.

Free standing cells = 21

Partially overlapped cells are identified by their ratio being greater than a threshold value. The application currently uses a threshold value of 1.3. For such cases the major axis and minor axis is divided by the diameter and rounded to the nearest integer to arrive at the number of overlapped cell. This resolution is not done for cases where the minor axis is less than 80 % of the diameter of the RBC. Regions 13,19 and 24 qualified as overlapped regions and were resolved. Table 5 shows properties of such regions.

Table 5 Partially overlapped cells
Table 6 Calculation of parasitaemia
$$ \begin{gathered} {\text{Number of overlapped RBCs }} = { 6} \hfill \\ {\text{Total cells }} = { 7 } + { 21 } + { 6 } = { 34}. \hfill \\ \end{gathered} $$

The process worked fairly accurately for partially overlapped cells involving two RBCs. However when more than three RBCs were involved the degree of success dependent on the degree of overlap. However, trained technicians are expected to produce slides with minimum overlapped RBCs.

5 Conclusion

Using the algorithm described, we were able to successfully demonstrate that it is possible to build an automated system to measure parasitaemia in the images of thin blood smears.