research-article

Open Access

Breaking Captcha System with Minimal Exertion through Deep Learning: Real-time Risk Assessment on Indian Government Websites

Authors:
Rajat Subhra Bhowmick

IIEST, Shibpur, India

IIEST, Shibpur, India

0000-0003-4656-1762
View Profile

,
Rahul Indra

IIEST, Shibpur, India

IIEST, Shibpur, India

0009-0007-7173-9311
View Profile

,
Isha Ganguli

Bennett University, India

Bennett University, India

0000-0002-1849-661X
View Profile

,
Jayanta Paul

IIEST, Shibpur, India

IIEST, Shibpur, India

0009-0005-0061-5165
View Profile

,
Jaya Sil

IIEST, Shibpur, India

IIEST, Shibpur, India

0000-0001-6335-4437
View Profile

Authors Info & Claims

Digital Threats: Research and Practice Volume 4 Issue 2Article No.: 28pp 1–24https://doi.org/10.1145/3584974

Published:10 August 2023Publication History

Digital Threats: Research and Practice

Abstract

Captchas are used to prevent computer bots from launching spam attacks and automatically extracting data available in the websites. The government websites mostly contain sensitive data related to citizens and assets of the country, and the vulnerability to its captcha systems raises a major security challenge. The proposed work focuses on the real-time captcha systems used by the government websites of India and identifies the risks level. To effectively analyze its captcha security, we concentrate on the problem from an attacker’s perspective. From the viewpoint of an attacker, building an effective solver to breach the captcha security system from scratch with limited feature engineering knowledge of text and image processing is a challenge. Neural network models are useful in automated feature extraction, and a simple model can be trained with a minimum number of manually annotated real captchas. Along with popular text captchas, government websites of India use text instructions–based captchas. We analyze an effective neural network pipeline for solving text captchas. The text instructions captchas are relatively new, and the work provides novel end-to-end neural network architectures to break different types of text instructions captchas. The proposed models achieve more than 80% accuracy and on a desktop GPU has a maximum inference speed of 1.063 seconds. The study comes up with an ecosystem and procedure to rate the overall risk of a captcha system used on a website. We observe that concerning the importance of available information on these government websites, the effort required to solve the captcha systems by an attacker is alarming.

1 INTRODUCTION

Captchas are now primarily the first level of standard security technology, used in the organizational websites [43, 44]. Captcha systems are widely used on websites to provide security against malicious computer programs and bots. The automated computer bots are specifically created to scrape data from institutional websites. Compromising the captcha system has a severe effect on organizational operations due to the disruption of services on websites. Most of the the top commercial sites are well equipped to deal with threats and vulnerabilities. However, government websites that mostly depend on a third party for maintenance are not always able to handle the risk. The wealth of information, possible through the extraction of data from websites, is immense. Primarily, government websites are used for public services. Therefore, the websites essentially possess critical information regarding the people and assets of the country. In today’s age of information technology, data are priceless, and data scraping through bots from government websites always poses a significant security challenge. Captcha systems are responsible for combating the threat, and periodic analysis is necessary to evaluate the overall situation. In this article, we focus on the captcha systems used on the government and state websites of India, and a few major websites are highly vulnerable. To examine the strength of a real-time security system, it is always advisable to analyze the system from the viewpoint of an attacker. If an attacker can easily evade the security system, then the risk to the system is higher. The severity of the risk is inversely proportional to the effort and resources required by an attacker to break the system. The attacker would prefer to activate a threat, which would yield maximum profit. In an attacker’s view, two points are essential to consider before initiating an attack on a captcha system. First is the effort required to break the captcha system. Second is the gain from compromising the captcha system, which includes information scraping and disruption in critical services of the web-server.

Text captchas in image format have been the most widely used captcha technique until now [7]. Other captcha formats, like object recognition from multiple images and audio recognition captchas, are used on a few commercial websites. A lot of work has been done in recent times to automatically recognizing text captchas [4, 31, 47] in image format using deep learning and traditional machine learning algorithms. The use of text instructions, along with images incorporating the text is another type of captcha that has been used by a few of the important Indian government websites. The text instructions captcha represents a domain of problems where text and image together need to be understood by a machine-learning algorithm to reach a solution. The architectures proposed in work for breaking the text instructions captcha is useful for similar use cases (e.g., e-commerce products based on images and text), which deal with image and text, simultaneously. The text instructions are represented in various formats, for example, the instruction text is depicted as “Evaluate the expression,” and alongside an image is “\(2+4\).” The captcha solution should provide “6.” Text instructions captchas throw a new challenge, because image recognition and understanding of the text instructions are both required to solve the captcha. Extracting the captcha answer in text format from such text instructions–based captcha models has not been reported yet to the best of our knowledge. The text instructions–based captcha system becomes more complex when the text instructions are embedded in the image. In this work, we propose neural network models for solving text instructions–based captchas, which include images with text embedded in the images.

From the attacker’s perspective, extensive rule representation and feature extraction to solve the captcha system is always an unacceptable proposition, because it involves extensive effort and human perception and expertise. The other factor, in attacker’s viewpoint, is inference time for solving the captcha. Deep neural network models are the best options, which have attracted the attention of researchers because of the automatic feature extraction capability of the neural network. This property of neural networks suits attackers with limited expertise in the field of image and text processing. The end-to-end deep neural network architecture requires minimal or no preprocessing, and features are extracted automatically using the Lecun (1989) backpropagation [25] algorithm. The neural network architecture with a low inference time increases the effectiveness of an attack. The higher the frequency of the attack, the more effective it will be in damaging the service provided by a web-server. The frequency of attack is determined by the number of valid requests made to the server, which is only possible after solving the captcha. Therefore, less inference time contributes to the success of an attacker. Neural network-based processing always requires a huge dataset for deep architecture, involving high training and inference time. We focus on keeping the architecture simple (not too deep) and the parameters as minimal as possible to reduce the inference time for solving the captcha systems by the attackers.

There are three types of captcha found on the government websites of India. First is the most-used basic image-based text captcha, and the rest are a form of text instructions–based captchas. A detailed analysis is presented in this article for each kind of captcha to show the effort required to break the captcha systems. We have created a real-time dataset from the government websites for all three types of captcha and put forward hour-based time analysis, needed to break the captcha systems effectively. The smaller the size of the dataset, the smaller the effort required to break the captcha. The gain, which directly signifies the value of the information available to a website, is an essential component for attackers to target the websites. The sensitiveness of information content that is vulnerable due to compromise of the captcha system and disruption of services together represent the value of information of a site. Accordingly, to obtain a notion of the overall situation on Indian government websites, we analyze the sites and identify the risk. The captcha system for each website is rated based on its robustness and classified into different risk levels. It is important to note that the proposed architectures are tested on Indian government websites. However, its applicability is not limited to that. The architectures can be used to breach the captcha security system on any social websites that use text instructions based on captchas.

Contributions of the article are summarized as follows:

Central and state government websites with captcha systems are analyzed to assess the vulnerabilities. We consider a few critical websites consisting of valuable public information. We create a real-time manually annotated dataset to test the effectiveness of each such website.
Proposed end-to-end neural network-based novel architecture to solve two types of text instructions–based captchas. The architectures are simple and not too deep (to have low inference time) and yet practical and are capable of solving the text instructions captchas, effectively. We also discuss a useful model to solve text captcha.
The study also provides an ecosystem and procedure to rate the overall risk of a captcha system used on a website. The risk is measured on the factors of how easily with the minimum effort (complexity of neural network architecture and manual annotation time) the captcha system can be compromised and the level of gain (value of information through scraping and disruption in critical services) from the attack.
We assign a vulnerability rating and also reveal the minimum number of training datasets and corresponding neural network hyper-parameters required to break the captcha system for each of the considered websites with extensive experimentation. In the process, we put forward an extended hour-based time analysis to break the captcha systems effectively.

2 RELATED STUDY

In this section, we review the literature related to the work on captcha systems. A large amount of research has been reported to automatically recognize text captchas, presented in image format using traditional machine learning and deep learning algorithms [7]. In 2003, Mori and Malik worked on breaking and image-based text captcha, namely the EZ-Gimpy captcha and Gimpy captcha, used by Yahoo [32]. They provided rules based on which a series of quick tests are performed to hypothesize locations of letters in the image. Next strings of these hypothesized letters are extracted and, finally, the most likely words are chosen from the string, representing text captcha systems.

In 2005, Chellapilla et al. [5, 6] worked with Human interaction proofs (HIPs) and compared the abilities of human and computer in recognizing single characters. In the work, results show that computers are as good as or better than humans at single character recognition. However, they assume that characters are segmented successfully and the approximate locations of individual HIP characters are known. In 2006, Yan and El Ahmad [47] identified fatal flaws in the design of image-based text captcha used by captchaservice.org. The approach has a high success rate by simply counting the number of pixels of each segmented character. However, later, the approach failed when applied on more advance captcha systems.

In 2014, Bursztein et al. [3] used reinforcement learning approach to segment a captcha using human feedback. They use four major components: Cut-point Detector to determine a potential way to segment, Slicer to obtain proper segmented character, Scorer for OCR, and Arbiter for character prediction. Reference [3] came up with a single pipeline that uses machine learning–based segmentation and recognition problems simultaneously. Their method removes the need for any hand-crafted component, making the approach generic to new captcha schemes. The approach achieves accuracy varied from 5.33% to 55.22% for the captcha systems used by Baidu, eBay, ReCaptcha, Wikipedia, and Yahoo. Karthik et al. [23] used Convolution Neural Network (CNN) for OCR on a real-time Microsoft captcha system in 2015. However, here the captchas are segmented manually, and a success rate of 57.05% was achieved. Similarly to text captchas, Xu et al. [46] presented an analysis on usability of motion-based text captchas. In the work, the authors focus on moving-text object recognition and in the process designined and implemented automated attacks on the captcha. They reported that their GPU-based implementation can decode the moving-captchas faster than humans. Later, Gao et al. [12] worked with motion-based captchas to discover the weaknesses of it. They proposed that as the camera projection on two-dimensional (2D) objects is constant (unlike 3D objects), it is possible to reconstruct the underlying text by superimposing and aggregating the parts of the object. Their algorithm is able to recognize moving text captchas with an accuracy of up to 89.2%,

In 2016, Gao et al. used Gabor filters and a k-Nearest Neighbours engine to predict the characters in text captchas [11]. The algorithm has been tested for robustness and applied on 10 captcha schemes that include recaptcha, Yahoo, Baidu, Wikipedia, and others. The success rate they achieved is in between 5.0% and 77.2%, with an average accuracy of 34.68% only. Additionally, the work reported that the speed of inference generation for solving the captchas is less than 15 seconds on a standard desktop computer. In 2016, Garg and Pollett [13] suggested a deep neural network model using CNN and Recurrent Neural Network (RNN) to predict the character in text captcha. This work presents another model using CNN with multiple softmax. The model they proposed was the first model that required no preprocessing. However, they used a large synthetic dataset to train the proposed models.

In 2017, the team of Gao introduced an algorithm to break Microsoft’s two-layer captcha for the first time [10]. They suggested a simple and effective segmentation on two-layer captcha. CNN is used to predict each character. They obtained success rate of 44.6% and reported an attack speed of 9.05 seconds on a standard desktop computer. In 2018, Ye et al. [48] came up with a significant study using GAN to train a base classifier. GAN architecture is used for learning the text captcha distribution from many text captcha schemes. A fine-tuned classifier is trained on top of the base classifier. They reported an inference time of 0.05 seconds on a standard desktop GPU while evaluating the success rate for primary 12 captcha schemes used by Sohu, eBay, Wikipedia, Microsoft, Google, and others. The accuracy range reported in the work for the base classifier is 0–83%, and for the fine-tuned classifier it is 3–92%. However, they do not use the RNN decoder approach to predict the character label; instead, they use a fully connected (FC) layer. The limitation of using an FC layer for character prediction is the need to adapt a variable-length captcha system. The base solver has been trained using 200,000 synthetic captchas. The creation of synthetic captcha requires extensive effort, as it consists of various security features. For an attacker to develop such a robust base classifier would throw an additional challenge. After training the base classifier, the fine-tuned classifier needs to be trained for each captcha system. The transfer learning [33] approach is used for the fine-tuned classifier. However, it is worthwhile to note that the convolution, pooling, and the number of FC layers, which they set experimentally, are to be retrained from the beginning in the fine-tuned classifier. For an effective model, the authors have retrained up to four convolution layers (of a total of 5) along with an FC layer for different captcha schemes, which make the process cumbersome, and the solver model architecture becomes complex. Their study includes only image-based text captchas and the base classifier requires extensive effort to train. Tang et al. [41] also did a similar extensive study as reported in Reference [48] by combining two segmentation module and character recognition by CNN network. Like Ye et al. [48] the authors in the work [41] discussed resistance mechanisms in text captchas. They obtained the success rate of attack in between 10.1% to 90%, with an average inference speed 0.45 seconds.

As we study, most of the work on captchas until now solve text captchas [7]. However, text instructions captchas are new and demand a more complex architecture. From the studies, it is clear that the neural network model has faster inference speed compared to the traditional techniques [10, 41], with also limited overhead and expertise required. Higher inference speed is preferable in the viewpoint of an attacker. In the work, we propose simple but effective neural network pipelines to solve text instructions–based captcha used in Indian government websites. We also discuss the effectiveness of simple pipeline in breaking the text captchas.

3 BACKGROUND

In this section, we discuss various types of captcha systems that are currently being used in the government websites of India. Details are provided for seven websites that are more volatile depending on the importance and type of information they handle. In this article, we consider these websites for analysis, because scraping of information from such websites will be disastrous. The basic building blocks of neural network architecture used for captcha solver models are also discussed.

3.1 Type of Captchas

The first type of captchas (Type I) are image-based text captchas, the most commonly used captcha system worldwide, as well in most Indian government websites. Such types of captchas collected from the websites are shown in Figure 1. Type II captcha includes instruction separately in text format along with the image-format character sequence, which is evaluated following the instruction. The character sequence depicted in the image is referred to as expression text. This type of captcha is hard to crack as the understanding of both text and image are required and evaluated for obtaining the captcha answer. Figure 2 depicts the example of Type II captchas. Type III captcha is similar to Type II, except here text instructions are embedded into the image along with character sequence expression text, shown in Figure 3. Type III type of captcha are more complex to solve as the character sequence depicted in the image include instruction text and expression text together. First the segregation of instruction text and expression text is required to evaluate the captcha answer.

Fig. 1. Type I: Image format text captchas.

Fig. 2. Type II: Text instructions and image format text captchas.

Fig. 3. Type III: Text instructions incorporated along with the image containing text captchas.

3.2 Major Websites

In the work, we consider seven Indian government websites based on the importance of information, likely to be extracted from the websites, during an attack. The importance or value of information in context with a website is dependent on two factors. First is the significance of the information that can be extracted using web scraping. The second factor is the real-time impact due to the stoppage of service provided by the particular website. From the attacker’s perspective, the maximum gain is proportional to the high value of information scraped from the website. In Appendix A, we consider the websites having high information value and provide details of the captcha systems along with the corresponding information type. The other government websites use captcha systems, too, but they have low information value, as mentioned in Appendix B. It is important to note that none of the websites mentioned in Appendix B has a different captcha type other than the types mentioned in Section 3.1.

3.3 Deep Network Background

Over the past few years, deep learning techniques have been used as a major development step in the field of machine learning [15, 24]. Breakthrough improvements have been observed in many applications associated to image recognition [16, 37], image captioning [22, 42], language translation [45], and natural language processing [49]. Neural networks are proved to be productive in creating and solving captcha systems [13, 27, 41, 48], due to the effectiveness of deep learning models, the use of which enhances the adaptation capacity of the models. On updating the captcha system, the model can be retrained without significant change in the architecture. We use encoder-decoder architecture inspired from sequence-to-sequence models in the field of language modeling and translation [28, 39]. Both RNNs and CNNs are used as components of the model architecture for captcha solvers.

Traditional RNNs are susceptible to the vanishing gradient problem. The two most widely used variant networks that solve the vanishing gradient problem are Long Short Term Memory (LSTM) [19] and Gated Recurrent Unit [8]. We use LSTM as a decoder unit and bi-directional LSTM [14] for a sequence encoding unit, the basic components of the architectures. For future reference and explanation, we represent the LSTM by function “\(\vec{\Gamma }\)” for a given set of sequence vector “\(\vec{\theta }\),” described in Equation (1), (1) \(\begin{equation} \vec{Y}= \Gamma ^{LSTM}(\vec{\theta }). \end{equation}\)

For the encoder-decoder model with text sequence, we also use the attention mechanism for decoder prediction [2]. It primarily helps to memorize long-range sentences in neural sequence prediction. Apart from using hidden state of the encoder, the decoder also has the weighted vector of all the encoder output states. We use attention mechanism proposed in Reference [29], which is improvement of the initial attention mechanism [2].

CNN is used as an image feature extractor for captcha image, represented by function “\(\vec{\Lambda }\)” in Equation (2) where vector “\(\vec{\phi }\)” is the input set of 2D vectors. We use Exponential Linear Unit (ELU) as activation function in CNN [9] to alleviate the vanishing gradient problem that improves the learning characteristics and speed up the learning process. ELU performs slighty better compared to rectified linear units (ReLUs), leaky ReLUs, and parametrized ReLUs [34], (2) \(\begin{equation} \vec{Y}= \Lambda ^{CNN}(\vec{\phi }). \end{equation}\)

FC networks are used for label prediction and represented by function “\(\vec{\Delta }\)” in Equation (3) using linear vector “\(\vec{\psi }\)” [21], (3) \(\begin{equation} \vec{Z}= \Delta ^{FC}(\vec{\psi }). \end{equation}\)

4 CAPTCHA SOLVER MODEL ARCHITECTURE

In this section, we discuss and propose the architectures used for cracking different types of captchas. First, the modified version of the deep learning pipeline for solving Type I text captchas has been discussed [13]. Second, the deep learning pipelines are proposed for solving text instructions–based Type II and Type III captchas. The details of the corresponding captcha schemes along with the flow of deep learning pipelines are described below.

4.1 Type I Captcha Model

The Type I captcha is the most used form of captcha, which contain an image displaying a sequence of characters. The character sequence in the image needs to be recognized to solve the captcha. The general solver architecture for the Type I captcha is shown in Figure 4 and is referred to as Architecture I. The input to the network is the raw image, and output labels are the corresponding list of characters present in the image. The model has a similar basic encoder-decoder structure, proposed earlier in Reference [13]. The image is passed through the convolution unit, which consists of three basic layers. A combination of convolution layer and max-pooling layer followed by a batch normalization function represent a convolution unit. Batch normalization helps the process in two ways. First, it allows us to have higher learning rates, which significantly reduces the training time of the model [20]. Second, it has been used as an alternative to dropout, which in turn function as a regularizer. The “ELU” activation function is used in the convolution unit. A color channel image input is passed through a set of convolution units that act as an encoded feature extractor for Type I captchas. On the decoder side, an LSTM is used to predict the output label. In the proposed modified version for the usage of architecture in different captcha schemes of Type I captchas, the size of the LSTM decoder unit is kept variable. The length of the LSTM varies according to the captcha type and image size used in that scheme. The decoder predicts a single character at a time, as shown in Figure 4, and returns a \(\lt EOS\gt\) tag when the sequence ends. Equation (4) explains the image encoder and label decoder process for input image vector \(\vec{\phi }\) and output label vector \(\vec{Z}\). Two major aspects we focus on experimentally are the number of convolution units and the variable vs. static size of the LSTM decoder.

Fig. 4. Type I captcha solver: Architecture I.

Generally, the FC layer is used to reduce the size of the encoded vector obtained from the convolution unit, which is then passed to initialize the LSTM decoder unit. The FC layer helps in keeping the size of the LSTM decoder static for every type of Type I captcha scheme. However, the presence of a static size LSTM decoder unit requires a larger amount of manually generated label data compared to the variable or dynamic length decoder. The experimental result is provided in Section 6 to demonstrate the effect of the static and dynamic decoder length with the number of captcha data-points. The direct flattened vector length of the LSTM decoder tends to perform better. We use three convolution layers for effective performance improvement over two layer convolution network used by Garg et al. [13], (4) \(\begin{equation} \vec{Z}_{Label}= \Delta ^{FC}\Gamma ^{LSTM}_{Decoder}\left(\Lambda ^{CNN}_{Captcha Image}(\vec{\phi })\right). \end{equation}\)

4.2 Type II Captcha Model

Type II captcha consists of text instructions, representing the task to perform to solve the captcha based on an expression. For example, the task in the text instructions is “Evaluate the expression,” while the expression that is present in the image depicts “\(6+8=.\)” To solve the captcha, the network has to predict the answer, “14.” We propose two neural network pipelines for solving this type of captcha. The first architecture has two encoders and one decoder, as shown in Figure 5, and is referred to as Architecture IIA. In Architecture IIA, one encoder has been implemented using a Convolution Unit and used for feature extraction from the image, while the other one is used for encoding the task. The convolution encoder used here for image encoding is similar to the encoder used in Architecture I. The task encoding is performed by a stacked bi-directional LSTM, where the encoded vector is formed by concatenating the last hidden state of both the forward and backward LSTMs, as shown in the Task Encoder Unit of Figure 5. The feature vector has been obtained after concatenating the encoder vector of the Task Encoder Unit and Convolution Unit, containing the final encoded information. The feature vector is represented by Equation (5) for input image vector \(\vec{\phi }\) and the text word sequence embedding vector \(\vec{\theta }\). The LSTM decoder unit is responsible for performing reasoning and evaluation of encoded information, required to solve and produce the answer label for the Type II captcha. The captcha answer label sequence vector \(\vec{Z}\) is obtained using the feature vector \(\vec{T}\), as shown in Equation (6). The backpropagation algorithm [18] has been applied for evaluating the error on predicting the captcha answer label, (5) \(\begin{equation} \vec{T}= \Lambda ^{CNN}_{Captcha Image}(\vec{\phi }) \oplus \Gamma ^{LSTM}_{Encoder}(\vec{\theta }), \end{equation}\) (6) \(\begin{equation} \vec{Z}_{Label}= \Delta ^{FC}\left(\Gamma ^{LSTM}_{Decoder}(\vec{T})\right). \end{equation}\)

Fig. 5. Type II captcha solver (2-Encoder 1-decoder model): Architecture IIA.

The Architecture IIA performs reasonably well when the number of different types of task is smaller in text instructions. Another architecture that has been proposed in this article for solving Type II captcha is shown in Figure 6, which is referred to as Architecture IIB. We have two types of label data in Type II captcha. One is the result of the captcha answer label (“14”), evaluated based on the task, similarly used to train the Architecture IIA. The other available output information for solving the Type II captcha is the optical character sequence expression label, depicted in the image (“\(6+8=\)”), which is not used in Architecture IIA. The character sequence expression is predicted as an output from the raw image using CNN and an LSTM decoder unit. The Architecture IIB has two encoder networks performing image encoding by the Convolution Unit and task encoding by the Task Encoder Unit, as shown in Figure 6. It has an additional LSTM decoder unit that predicts the character sequence expression labels from the image, called Image to Expression Decoder Unit. Equation (7) evaluates the predicted character sequence expression labels from the image using the 2D image vector \(\vec{\phi }\). The character sequence expression labels are then encoded using a bi-directional LSTM encoder unit, as shown in the Expression Encoder Unit of Figure 6. The final feature vector is obtained by concatenating the encoder vectors of the Task Encoder Unit and Expression Encoder Unit. Finally, the feature vector is passed through the Label Decoder Unit to obtain the Type II captcha answer label. Equation (8) represents the final feature vector \(\vec{T}\), and Equation (9) outputs the captcha answer labels \(\vec{Z}_{Label}\) for the network model, (7) \(\begin{equation} \vec{Y} = \Delta ^{FC}\left(\Gamma ^{LSTM}_{Decoder}\left(\Lambda ^{CNN}_{Captcha Image}(\vec{\phi })\right)\right), \end{equation}\) (8) \(\begin{equation} \vec{T}= \Gamma ^{LSTM}_{Encoder}(\vec{\theta }) \oplus \Gamma ^{LSTM}_{Encoder}(\vec{Y}), \end{equation}\) (9) \(\begin{equation} \vec{Z}_{Label}= \Delta ^{FC}\left(\Gamma ^{LSTM}_{Decoder}(\vec{T})\right). \end{equation}\)

Fig. 6. Type II captcha solver (3-Encoder 2Decoder Model): Architecture IIB.

In Architecture IIB, the Convolution Unit and the Image to Expression Decoder Unit together work as an independent optical character recognition unit for a specific captcha scheme. Therefore, both the unit can be trained simultaneously using character sequence expression labels from the image. Later, the predicted character sequence expression labels are encoded by Expression Encoder Unit, which is utilized to predict the captcha answer. In Architecture IIB, learning in the complete network takes place with two type of information (captcha answer labels, expression sequence labels) while one type of information (captcha answer labels) is used in Architecture IIA. Hence, Architecture IIB performs more effectively with few numbers of captcha. The results of the effectiveness of both architectures are shown and discussed in Section 6. In Architecture IIB, the first backpropagation (represented as “Backpropagation 1” in Figure 6) is performed on the sequence loss between the computed character sequence expression label and the original character sequence expression label. The second backpropagation algorithm (“Backpropagation 2”) is applied by calculating the loss between the original captcha answer label and the computed captcha answer label.

4.3 Type III Captcha Model

The Type III captcha is the most complicated captcha among the three. Like Type I captcha, it has only one input, that is, the raw image, and similarly to Type II captcha, it consists of task instruction to perform on the expression. Both the task instruction and character sequence expression are embedded in the image. So we split the task into three sub-tasks as follows: (i) image-to-character label prediction, (ii) character sequence–to–possible independent word prediction, and (iii) the final independent word–to–captcha answer label prediction. The proposed network architecture pipeline is shown in Figure 7 and referred to as Architecture III. The first sub-task is to decode the character from the image, similarly to Architecture IIB. Equation (10) evaluates the predicted character sequence label \(\vec{X}\) using the image vector \(\vec{\phi }\). The second sub-task is to use the attention-based encoder-decoder unit for prediction of an independent word from the predicted character sequence of the input image using the Character encoder Unit and Character to Independent Word Decoder Unit in Figure 7. The attention-based encoder-decoder unit is important for separating the words representing the task instruction and the characters, representing the expression. The attention module helps to predict the word based on the character occurrence in the sequence. Equation (11) evaluates the independent word sequence \(\vec{Y}\) given the character sequence \(\vec{X} as input\). The independent word sequence \(\vec{Y}\) is now passed through the last encoder-decoder module of Figure 7 to produce the captcha answer label, as given in Equation (12), (10) \(\begin{equation} \vec{X}= \Delta ^{FC}\left(\Gamma ^{LSTM}_{Decoder}\left(\Lambda ^{CNN}_{Captcha Image}(\vec{\phi })\right)\right), \end{equation}\) (11) \(\begin{equation} \vec{Y}= \Delta ^{FC}\left(\Gamma ^{LSTM}_{Decoder}\left(\Gamma ^{LSTM}_{Encoder}(\vec{X})\right)\right), \end{equation}\) (12) \(\begin{equation} \vec{Z}_{Label}= \Delta ^{FC}\left(\Gamma ^{LSTM}_{Decoder}\left(\Gamma ^{LSTM}_{Encoder}(\vec{Y})\right)\right). \end{equation}\)

Fig. 7. Type III captcha solver: Architecture III.

The use of three encoder-decoder units helps to learn the model in three levels using backpropagation. The first backpropagation (represented as “Backpropagation 1” in Figure 7) is based on the loss between the predicted character sequence label and the original character sequence label present in the image. The second backpropagation (“Backpropagation 2”) is performed on the difference of predicted independent word sequence obtained from the character to the real independent word sequence. The loss between the computed captcha answer label and the original captcha answer label is used for the third backpropagation (“Backpropagation 3”). The learning from all three levels is possible using a single manually annotated captcha. The inference of the network requires only the raw 2D captcha image.

5 EXPERIMENTAL SETUP

In this section, we describe the parameters used during experimentation for the network architectures as proposed in this article. The details about the network and captcha system are discussed along with the dataset and execution environment.

5.1 Network Parameters

The aim of experimentation is to keep the number of network parameters or the network depth as low as possible. Fewer network parameters reduce inference and training time on a low-end resource. Ye et al. [48] mentioned that the requirement of the number of manually labeled captcha is also significantly reduced with minimum network parameters. For each of the proposed architectures, the network parameters are explained below in detail.

The Architecture I captcha model solver consists of one convolution unit and one decoder unit. There are many well-known models for image recognition, starting with basic models like LeNet [26] to advance architecture like ResNet [17], VGG [37], and Inception [40]. The advanced models are difficult to train due to the large number of trainable parameters that require a sufficient number of manually trained labels. An extension of LeNet architecture is considered in this article for feature extraction from the captcha image. The simplicity of the network requires the least training data, and it performs the fastest prediction. Each convolution unit consists of three convolution layers along with batch normalization and an ELU activation function after each layer. The window size of (3 * 3) with strides as (1, 1), (2, 2), and (3, 3) are used in three convolution layers, respectively. To obtain faster inference, the filter size has been adjusted according to the performance of the model for each individual website. The number of layers is kept constant to have a generalized architecture that works efficiently for all the websites using Type I captcha. The decoder consists of an LSTM unit that recursively produces the captcha text label. After the convolution unit, the dynamic size LSTM decoder is used instead of an FC layer (which are normally used in fixing the LSTM size) for every Type I captcha. The FC layer leads to high convergence time and a less-accurate model. Figure 8 provides accuracy vs. epoch graph depicting the same effect. It is important to note that the dynamic size decoder LSTM does not change the architecture pipeline as the size is mapped automatically according to the size of the captcha image. Table 1 provides details about the maximum length of predicted sequence label for the Indian government websites that use the Type I captcha system. Table 1 also presents one sample captcha for each website and the corresponding label size, which represents the number of classes or unique characters used in the captcha scheme.

Fig. 8. Comparison of static vs. dynamic size LSTM on the Bengal Bhumi (A.3) dataset based on the website (a) Training Accuracy vs. Epoch, (b) Training Loss vs. Epoch, (c) Development Accuracy vs. Epoch, and (d) Test Accuracy vs. Epoch.

Table 1.

View Table

Table 1. Websites for Type I Captcha with Label Detail

In this article, two architectures are discussed for the Type II captcha solver model. The model in Figure 5 has a bidirectional LSTM of size 256 unit for task instruction word encoding. The GloVe [35] 100-dimensional word embedding vectors are used to represent the corresponding word in the instruction. The three convolution unit has window of size (3 * 3), (5 * 5), and (3 * 3) with the number of units as 64, 32, and 16, respectively. The strides used in each convolution layer are (1, 1), (2, 2), and (2, 2) and max-pooling with window size (3, 3). This type of captcha is used in the Indian Post website (A.6). The flattened vector obtained from the convolution unit is combined with the output of bi-directional LSTM. The concatenated vector is used for prediction of a dynamic size sequence label to obtain the answer of final captcha. The second model for the Type II captcha described in Figure 6 has the same convolution unit configuration in all three layers. The expression character sequence is obtained from the decoder. The character sequence is then encoded using the bi-directional LSTM of size 1,024 unit. The size of the bi-directional task encoder unit is kept at 64 on the basis of the experimental result. The final architecture, shown in Figure 7, is used for solving Type III captcha, deployed on the IRCTC website (A.7). The three convolution layers with a window of size (3 * 3) has been used while strides are (1, 1), (2, 2), and (1, 1) with the number of unit 64, 32, and 8, respectively, considered in three successive layers. The size of the bi-directional LSTM character encoder unit and character to independent word decoder LSTM unit are set to 256 unit. The third encoder-decoder unit also has a size of 256 unit.

5.2 Dataset

The manually labeled data are created by capturing original captcha from each of the major Indian government websites considered in this work. The process of extraction of real-time captchas began in August 2021. Annotating 1,000 data-points by scraping a website requires nearly one and a half working hours. As the time required by annotator varies from person to person for the same amount of work, a working hour is roughly calculated as the average time spent to annotate 1,000 data-points for a specific website by five human annotators. The number of data-points annotated for each website along with the working hours that are spent to do the same is shown in Table 2. The Type I captcha requires less time for manual annotation compared to Type II and Type III captcha, as it requires annotation of only captcha text characters present in the image. The time requirement for Type I captchas also varies according to the number of characters and complexity of text present in the images, as described in Table 1. In annotation of the Type II and Type III captchas, the annotator need to register task instruction, captcha solution, and the text character sequence depicted in the image. We also prepare a synthetic dataset for testing the Type III captcha architecture, as there is a class imbalance on the type of task information. The synthetic dataset is created by superimposing the task information of Type II captcha along with expression in an image to convert into Type III captcha. The synthetic dataset is used for architecture validation only.

Table 2.

Sl. No.	Corresponding Websites	CaptchaType	Total Numberof Captcha	Working Hours(Approx.)
1.	Voter ID	Type I	6,000	5
2.	AADHAAR	Type I	3,000	2
3.	Bengal Bhumi	Type I	2,500	2.5
4.	Pariwahan Seva	Type I	2,500	2.5
5.	Passport India	Type I	7,000	6.5
6.	Indian Post	Type II	6,000	7
7.	IRCTC	Type III	7,000	7.5

View Table

Table 2. Real-time Captcha Dataset Description and Annotating Time for the Websites

5.3 Executing Environment

Training and validation of the architecture are executed using the Tensorflow [1] version 1.8 library in python 3.6 [36]. The hardware system used for training the network consists of a single Nvidia Tesla V100 GPU. The inference and testing process is carried out in a smaller desktop Nvidia 1050Ti GPU.

6 RESULTS

We analyze performance of the seven major Indian government websites using specific captcha solver architecture for the corresponding type of captcha systems used by the website. We first compare accuracy to predicting the character sequence present in the captcha image by applying the training dataset in Architecture I using static size and dynamic size of decoder LSTM. We observe that, considering the training dataset, the accuracy of the model with dynamic size LSTM is greater than the static size LSTM in every epoch, as shown in Figure 8(a). The dynamic size LSTM decoder model converges quickly compare to the static size LSTM decoder, using the training dataset shown in Figure 8(b). Figures 8(c) and 8(d) represent the prediction accuracy with respect to the number of epochs while applying development and test set, respectively. It is clear from Figure 8 that the dynamic size LSTM yields better and stable performance with a minimum number of epochs in comparison to the static size LSTM using training, test, and development datasets. We demonstrate the comparison using the Bengal Bhumi website (A.3) captcha dataset and also analyzed the other website captcha datasets, as mentioned in Section 3.2, that assert the same results.

For each website, the model is trained multiple times by varying the size of the training dataset. In the analysis, the focus is on two aspects: One is to obtain maximum accuracy for all the datasets. The other is to approximate the size of the training dataset required to obtain a minimum accuracy of 80% on a test set. Figure 9 demonstrates the accuracy for development and test set with respect to the size of the training dataset (number of data samples) applied on the Indian government website using the Type I captcha scheme. The accuracy range varies from 83.4% to 98.5%, while working hours for manually annotating a single Type I dataset varies from 2 hours to 6.5 hours, as mentioned in Table 2. The maximum test accuracy for each of the website captcha datasets is mentioned in Table 5, as shown in Figure 9. The minimum number of training data samples required to obtain 80% accuracy on each government website captcha dataset is also mentioned in Table 5.

Fig. 9. Accuracy for development and test set with respect to the number of training data samples considering the websites (a) Voter ID (A.1), (b) AADHAAR (A.2), (c) Bengal Bhumi (A.3), (d) Pariwahan Seva (A.4), and (e) Passport India (A.5).

The performance analysis of the Type II captcha scheme used by the Indian Post website is presented in Figure 10. The dataset is tested for both Architecture IIA and Architecture IIB, shown in Figures 10(a) and 10(b), respectively. The maximum accuracy obtained for both model architectures is 93.52%. However, it is important to note that Architecture IIB performs better with a minimum number of training data samples in comparison to Architecture IIA. Better performance gain with a smaller number of training data samples in Architecture IIB is obvious, as learning by the network is taking place from two fronts (output prediction label and expression text label). However, with the increase in the number of training data samples, performance of both the architectures is relatively similar. From the viewpoint of an attacker, Architecture IIB is more significant as it is still effective with very few number of training data samples.

Fig. 10. Accuracy for development and test set with respect to the number of training data samples on the (a) Indian Post (A.6), Architecture IIA, (b) IndianPost (A.6) Architecture IIB, (c) IRCTC (A.7), and (d) IRCTC (A.7) synthetic datasets.

The performance analysis of Type III captcha system using Architecture III is shown in Figure 10. Figure 10(c) depicts the value of test accuracy with respect to the number of training data samples for the Type III IRCTC (A.7) captcha dataset. High test accuracy 96.07% has been achieved with the real-time dataset. However, there is a class imbalance problem in the IRCTC (A.7) captcha dataset. To test the effectiveness of Architecture III, we analyzed the model with the synthetic dataset of Type III captcha. The performance of the test result on the synthetic dataset is shown in Figure 10(d), where the maximum accuracy of more than 93% is obtained by using Architecture III. The synthetic dataset is used only for validation of the proposed architecture and is not used for the report analysis of government websites, for obvious reason.

The number of epochs required for convergence in each training phase is dependent on the captcha dataset and the corresponding architecture. We keep few of the important hyper-parameters constant during training for all the models and using the datasets, which are mentioned in Table 3. Inference time required to solve a captcha using the corresponding architecture on a low power desktop GPU is given in Table 4. It is clear from Table 4 that all the architectures have a low inference time and therefore satisfy the terms with the viewpoint of an attacker.

Table 3.

Sr. No	Hyper-parameter	Values
1.	Batch Size	128
2.	Learning Rate	0.001
3.	Dropout	0.75
4.	Optimizer	Adam

View Table

Table 3. Constant Hyper-parameter for All Training Runs

Table 4.

Sl. No.	Captch Type	Architecture Type Used	InferenceTime (s)
1.	Type I	Architecture I	0.570
2.	Type II	Architecture IIA	0.817
3.	Type II	Architecture IIB	0.945
4.	Type III	Architecture III	1.063

View Table

Table 4. Inference Time for Each Type of Captcha and Their Corresponding Model Architecture

7 DISCUSSIONS

In this section, we discuss various parameters upon which the vulnerability of the websites has been tested. We focus on the human effort for annotation and the architecture complexity to determine the overall risk of the current captcha system. In the latter part of this section, we discuss the inference we draw from the work with a few suggestions for improvement. Justification of attempting the work has been explained considering related state-of-the-art studies.

7.1 Analysis

We calculate the robustness of the captcha system to rate the vulnerabilities for each particular website. Robustness of the captcha system is directly proportional to the human efforts required to solve the system. With the context of deep architectures, the robustness of the captcha system mainly depends on two important factors. The factors are the complexity of architecture and the number of manually annotated captcha required to train an effective captcha solver. The complexity of an architecture is measured on the number of encoder-decoder module used in the respective architecture as shown in Table 2. The human effort required to annotate each captcha type differs and depends on factors like the presence of text instructions, the maximum number of characters in the label, label type, and type of operation as per instruction. Working hours for annotating each captcha dataset is specified in Table 2. The working hours are considered for evaluation of the complexity of the captcha system. The other important aspect is the accuracy; the minimum 80% of test accuracy has been considered as a threshold for an effective captcha solver model. The minimum number of annotated captcha required to obtain 80% accuracy for each model is the evaluation criteria, which is considered in this article.

The Robust score (\(\gamma\)) for a captcha system is evaluated by adding the Architecture complexity value (\(\alpha\)) and manual annotating working hours required to achieve 80% accuracy for the model (\(\beta\)), defined in Equation (13). Table 5 shows the Robust score for each dataset and the corresponding model architecture. The higher Robust score represents a more complex and hard-to-break captcha system, (13) \(\begin{equation} \begin{split} Robust\hspace{2.84526pt}Score(\gamma) = \log (\alpha) * \log (\beta + 1), \end{split} \end{equation}\) where \(\alpha\) is the number of the Encoder-Decoder unit in the solver architecture and

Table 5.

View Table

Table 5. Robust Score for the Website Dataset and their Corresponding Model Architecture

\(\beta\) is the approximate working hours require to manually annotate the data-points to reach 80% accuracy.

The risk to a website is dependent on the importance of that website and how vulnerable it is to attacks. The risk to a website is measured by subdividing it into two major categories (i) intrinsic properties corresponding to the vulnerability of the captcha system and (ii) the extrinsic properties to launch an attack.

The first property (i.e., intrinsic properties corresponding to the vulnerability of the captcha system) is inversely proportional to the Robust score, which measures the strength of the captcha system, whereas the extrinsic properties, i.e., the non-captcha factors, which attract or discourage an attacker from launching an attack on a website, is determined through the calculation of the Website value score. The Website value Score of a government website is dependent on three properties, which are whether the website has important scraping content, is susceptible to a DoS attack, and has complex fields in the web page. Complex fields on the web page refer to the text boxes, which are hard to predict by random guessing. Equation (14) represents the formulation of Website value score, and the values for the corresponding websites are shown in Table 6, (14) \(\begin{equation} {\it Website} \hspace{2.84526pt}{\it value}\hspace{2.84526pt}{\it score} (\delta)= \phi + \chi - (0.5 * \psi), \end{equation}\) where \(\begin{align*} \hspace{56.9055pt} & \phi = {\left\lbrace \begin{array}{ll} 1, & \text{if } website \hspace{2.84526pt} has \hspace{2.84526pt} important\hspace{2.84526pt}scraping\hspace{2.84526pt} content\\ 0, & \text{otherwise} \end{array}\right.} \\ & \chi = {\left\lbrace \begin{array}{ll} 1, & \text{if } website \hspace{2.84526pt} susceptible \hspace{2.84526pt} to \hspace{2.84526pt} DoS \hspace{2.84526pt} attack\\ 0, & \text{otherwise} \end{array}\right.} \\ & \psi = {\left\lbrace \begin{array}{ll} 1, & \text{if } website \hspace{2.84526pt}has \hspace{2.84526pt}complex \hspace{2.84526pt}field \hspace{2.84526pt}in \hspace{2.84526pt}the \hspace{2.84526pt}web-page \hspace{2.84526pt}along \hspace{2.84526pt}with\hspace{2.84526pt} Captcha \hspace{2.84526pt} system \hspace{2.84526pt}\\ 0, & \text{otherwise}\end{array}\right.}. \end{align*}\)

Table 6.

View Table

Table 6. Risk Assessment for the Organizational Websites

The final Risk score for each of the considered website is shown in Table 6 and calculated using Equation (15),

(15) \(\begin{equation} {\it Risk}\hspace{2.84526pt}{\it score} (\varphi)= \log (\gamma * (1 / \delta)). \end{equation}\)

7.2 Inferences and Suggestions

Based on the analysis, a few inferences have been derived from Tables 5 and 6. The solver for Type II and Type III captcha has more complex architecture and tends to have a high Robust score. However, the system using Type I captcha has high variance on the Robust score. It has been observed that the captcha systems using variable length labels are hard to converge and have a high Robust score, as shown in Table 5. On the contrary, the fixed-character Type I captchas are easy to train and have a low Robust score.

From the analysis, it is clear that a well-designed variable character Type I captcha system works at par with Type II captcha. But the Type III captcha is the most complex and most difficult to break. There is still some room for improvement while designing Type I captcha by making changes in the images. The Type II and Type III captchas both have a considerable amount of space for improvement. While designing, the possible improvement is not only limited in the images—there is a scope of improvement in text instructions and expression. One such scenario is to use a more complex mathematical expression. In the original captcha dataset of the website, the only arithmetic operations are addition and subtraction. An extended operation like multiplication, division, or any other mathematical expression that is easy to evaluate for humans is encouraged.

The combination of natural language and image in captcha could be the future of the captcha system, and a lot of improvement is possible. But the focus of our work is to explore and escalate the vulnerabilities involved in the government websites where the current captcha systems are deployed. The proposed architectures are able to solve different types of captcha systems with high efficiency using the original captcha dataset of the websites.

7.3 Justification to State of the Art

In this work, we use neural network models to solve text and text instructions–based captcha. The neural network models are highly effective in solving text captchas, as reported in the previous studies [41, 48]. The neural network models are easy to implement because of its black box nature, and with minimum effort and domain knowledge could be used by an attacker. The other significant advantage of using the neural network models is the inference speed of neural network pipelines. The work reported by the team of Gao [10, 11], Ye et al. [48], and Tang et al. [41] provide enough practical evidence that a neural network model has much faster inference speed compared to the traditional methods. Therefore, the neural network pipeline is equally effective in solving text and text instructions captchas, which we analyzed and presented in thisarticle. From the viewpoint of an attacker, the inference speed always plays a pivotal role when a web-server is highly vulnerable with the frequency of attacks. In the study, we observe that the IRCTC website (A.7) that provides critical service during Tatkal time could be highly affected with fast attacks. Along with critical service disruption, data extraction is faster with a high-speed inference model. Attacker plans to scrape data from a website by filling text fields in the web-page with a well-defined random value and solving the required captcha. Therefore, to scrape data from a large government website, the neural network models with low inference time is favored. In this article, we attempt to develop and propose neural network models for its effectiveness and inference speed, which contributes to the factor for solving the captchas.

The model used for Type I captcha is similar to the work proposed by Garg and Pollett [13] and Ye et al. [48]. Both studies have a similar basic CNN structure known as LeNet [26]. Garg and Pollett considered a complete synthetic grey captcha image dataset to train the model. The synthetic dataset contains 1 million simple images of fixed-length 5, 2 million complex images of fixed length 5, and 13 million images of variable length. However, it is not feasible to create such a large dataset on real-time captcha system for each website. Though the architecture is simple, they overestimate the requirement of the size of the dataset. Our analysis shows that the architecture could also be effective, using only a few thousand data points. Our model is a modified version to the one proposed by Garg and Pollett, consisting of three convolution layer instead of two. In their CNN model, they did not use batch normalization, which reduces the need for a huge number of training samples and training time. In this article, we demonstrate that the fine-tuned architecture could be adequately trained from scratch to break a captcha system with a limited amount of training datasets. From the analysis, we exhibit that the model is equally effective with color captcha images. Another work, as proposed by Ye et al. [48] (explained in 2) for Type I captcha uses GAN architecture, has a major bottleneck in the viewpoint of an attacker. The attacker needs to put an extensive amount of effort to design the base classifier by considering various captcha scheme. Moreover, an attacker has to effectively implement the complex fine-tuned process of captcha layers for each new captcha system. The number of captcha layers that need to be retrained adds another new hyper-parameter to the task. The model also uses a static FC layer for prediction of captcha labels in place of a dynamic decoder, which makes the architecture less adaptable to a new captcha scheme with variable length. In this article, the model considered for the analysis of Type I text captcha is simple, effective, and easy to implement. The model architecture can be implemented for any captcha system from scratch and independently from any other captcha scheme, unlike the approach used by Ye et al. [48], which is dependent on various other captcha schemes used for training of the base solver.

The text instructions–type captchas are relatively new and less explored. To the best of our knowledge, there is no solution architecture proposed to date for text instructions captcha. A novel architecture pipeline with faster inference process has been proposed in this article for text instructions (Type II and Type III) captchas. The architecture is easy to understand and effectively use multiple encoder-decoder models to predict the correct captcha answer. It is important to note that for Type II captcha, the statistical or rule-based system could be combined with the neural network model for correct label prediction, where the rule-based system can be applied for interpreting the instructions. However, the system fails when the number of instruction is large enough. The proposed models will thrive in such a situation, as the architecture incorporates the language model or language understanding approach [30, 38], which eliminates the rule-based reasoning process. The use of a neural network pipeline architecture also significantly increases the inference speed. In Type III text instructions captcha, the text instructions needs to be extracted from the image along with the text expression. The LSTM attention module helps to predict the word, which, in turn, is used to predict the text instructions. An attention module also plays a vital role in differentiating between expression and text instructions present in the single image. Having a single pipeline is the major challenge in solving such captcha and the model proposed in the work effectively satisfy the need, demonstrating the novelty of the work.

8 CONCLUSION

We for the first time, we systematically analyze the captcha vulnerabilities associated with the top government websites in India. The robustness of a captcha system is tested based on the criteria of architecture complexity used in solver models and the working hours required to manually annotate the dataset. We use original captcha data-points from each of the websites to test the associated model. The proposed effective solver models achieve a success rate of more than 80% in all the websites we consider for analysis. The result clearly indicates that a few of the captcha system models are highly vulnerable and have critical status. It is also clear from the experiment that none of the website captcha systems is unbreakable and thus require valuable attention.

The major contribution of our work is to propose a novel neural network pipeline to solve text instructions–based captcha. The instruction-based captcha has mainly two variations, one with instruction text embedded with the expression in an image. The other one features instruction text in text format with expression specified in the image. We propose three pipelines to solve both text instructions–based captcha schemes. We use more than one encoder-decoder deep learning module to combine the text and image information available in the captcha. A combination of CNN, LSTM, and LSTM with attention network was simultaneously used in the proposed architecture. A list of proven hyper-parameters for each of the models to re-create the work is provided. It is important to note that the proposed architectures for solving text instructions–based captcha has a potential to be useful in the applications that closely use text and image.

We also come up with an ecosystem and procedure to rate the overall risk of a captcha system used on a website. The extensive results for every selected website are provided, which determines the human effort required to solve the captcha system. The captcha system is rated based on their robustness. Each of the websites is thoroughly checked to determine the importance of the captcha system, and an Information value score is provided based on that. The Information value score and the robustness determines the overall risk involved in the organizational websites. The alarming prospect deduces from the study is how easily with very minimum working hours the captcha solver can be built from scratch. We hope the proposed work can inspire government organization and research community to revisit their designs of captcha system used in the websites.

APPENDICES

A DETAILS OF INDIAN GOVERNMENT WEBSITE CONSIDERED IN THE STUDY

A.1 Voter ID

The URL https://electoralsearch.in/ is the website for National Voter Services of India. The website provides information to the Indian citizens about their voter ID card, which is an identity document issued by the Election Commission of India. The voter ID information primarily serves as an identity proof for Indian citizens while casting their ballots in the country’s municipal, state, and national elections. Scraping this portal efficiently by filling up well-defined random values could effectively retrieve personal details of Indian voters. At present, the website uses a Type I captcha system to restrict automatic web scraping from the website.

A.2 AADHAAR

THe URL https://uidai.gov.in/ is the website for Unique Identification Authority of India (UIDAI). UIDAI provides an Aadhaar ID that contain a 12-digit unique identification number, called the Aadhaar number, for every Indian citizen. The Aadhaar ID contains basic information like Mobile Number, Address, Date of Birth, and so on. The system is created in such a way that a user can easily download his or her own copy of e-Aadhaar, as well as get the status of updating their application for Aadhaar card. However, extraction of basic information for a large number of Indian people from the website is a potential risk. The Type I captcha system is currently being used on the UIDAI website.

A.3 Bengal Bhumi

The URL http://banglarbhumi.gov.in/LRWEB/ is the website for the Department of Land & Land Reforms (LLRD) under the state government of West Bengal, India. The LLRD plays a crucial role in terms of administration and management of land records information. The website provides an interface to fetch Khatian (land) and plot information of every single village and city in relation to mouzas (citizens). The land asset information for every citizen can automatically be scraped from this site by breaking the Type I captcha currently used in the website.

A.4 Pariwahan Seva

The URL https://vahan.parivahan.gov.in/ is a website under the Department of The Ministry of Road Transport and Highways, which is part of the government of India. The website issues registration certificates and driver’s licenses. It is also useful in retrieving owner and vehicle information. The website uses a Type I captcha system, and through web scraping by breaking the system, it is possible to extract owner information associated with his/her vehicle.

A.5 Passport India

The https://portal2.passportindia.gov.in/ website is under the Department of Passport and Visa Division, Ministry of External Affairs, government of India. The website is mainly used for checking the passport application and passport status-related information. Any information regarding a passport is very sensitive, and basic Type I captcha is responsible for preventing the scraping process.

A.6 Indian Post

The URL www.indiapost.gov.in is the official website for the Indian Postal Service. It is used for tracking consignment, which includes speed posts and ordinary posts. The website currently uses a Type II–based captcha system. All details about consignments from one location to another can easily be scraped from the websites once the captcha system is bypassed. All such information of single-day consignment movement can be captured in a few hours, which consists of useful information about consignment tracking.

A.7 IRCTC

The URL http://irctc.co.in is the website for booking tickets on the Indian railway. It is one of the biggest websites in India in terms of traffic and usability. Tatkal Booking (emergency booking) for the next day is possible during a specific time slot during each day. The number of requests made to the site in this time slot is huge. A simple script can be written that fills up the reservation form automatically, and a service request to the server can be sent. In case the number of false service requests are high, the server will not be able to provide service to the valid requests, and disruption of service is obvious. A Type III captcha system is implemented on the website that prevents computer bots from making automatic false server requests.

B APPENDIX OTHER INDIAN GOVERNMENT WEBSITE USING CAPTCHA SYSTEM

GOVERNMENT WEBSITE	CORRESPONDING LINK
Supreme Court of India	https://www.sci.gov.in/user
government e-Marketplace	https://www.gem.gov.in
Medical Care a Digital India Initiative	https://ors.gov.in/followup/?hosid=4
National Human Rights Commission	http://www.hrcnet.nic.in/HRCNet/public/
National Commission for Women	http://ncwapps.nic.in/onlinecomplaintsv2/
Office of the Principal Scientific Adviser to the government of India	http://psa.gov.in/search/
NVSP Service Portal	https://www.nvsp.in/Forms/Forms/form6A
National Authority CWC	https://ngodarpan.nacwc.gov.in/
Direct Benefit Transfer government of India	https://dbtbharat.gov.in/auth/login
Directorate of Printing Department of Publication	http://egazette.nic.in/(S(zrcphm2ze3qycy4tswhun1ac))
Insurance Regulatory and Development Authority	https://www.irdai.gov.in/Defaulthome.aspx?page=H1

View Table

Supplemental Material

Available for Download

zip

dtrap-2022-0017-file002.zip (3 MB)

Supplementary material

REFERENCES

[1] Abadi Martín, Agarwal Ashish, Barham Paul, Brevdo Eugene, Chen Zhifeng, Citro Craig, Corrado Greg S., Davis Andy, Dean Jeffrey, Devin Matthieu, Ghemawat Sanjay, Goodfellow Ian, Harp Andrew, Irving Geoffrey, Isard Michael, Jia Yangqing, Jozefowicz Rafal, Kaiser Lukasz, Kudlur Manjunath, Levenberg Josh, Mané Dan, Monga Rajat, Moore Sherry, Murray Derek, Olah Chris, Schuster Mike, Shlens Jonathon, Steiner Benoit, Sutskever Ilya, Talwar Kunal, Tucker Paul, Vanhoucke Vincent, Vasudevan Vijay, Viégas Fernanda, Vinyals Oriol, Warden Pete, Wattenberg Martin, Wicke Martin, Yu Yuan, and Zheng Xiaoqiang. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from http://tensorflow.org/.Google Scholar
Reference
[2] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473. Retrieved from https://arxiv.org/abs/1409.0473.Google Scholar
Reference 1Reference 2
[3] Bursztein Elie, Aigrain Jonathan, Moscicki Angelika, and Mitchell John C.. 2014. The end is nigh: Generic solving of text-based captchas. In Proceedings of the 8th USENIX Workshop on Offensive Technologies (WOOT’14).Google Scholar
Reference 1Reference 2
[4] Chandavale Anjali Avinash, Sapkal Ashok M., and Jalnekar Rajesh M.. 2009. Algorithm to break visual CAPTCHA. In Proceedings of the 2nd International Conference on Emerging Trends in Engineering & Technology. IEEE, 258–262.Google ScholarDigital Library
Reference
[5] Chellapilla Kumar, Larson Kevin, Simard Patrice Y., and Czerwinski Mary. 2005. Computers beat Humans at Single Character Recognition in Reading based Human Interaction Proofs (HIPs). In Proceedings of the Annual Conference of the Council of European Aerospace Societies (CEAS’05).Google Scholar
Reference
[6] Chellapilla Kumar and Simard Patrice Y.. 2005. Using machine learning to break visual human interaction proofs (HIPs). In Advances in Neural Information Processing Systems. 265–272.Google Scholar
Reference
[7] Chen Jun, Luo Xiangyang, Guo Yanqing, Zhang Yi, and Gong Daofu. 2017. A survey on breaking technique of text-based CAPTCHA. Secur. Commun. Netw. (2017) 1–15. DOI:Google ScholarDigital Library
Reference 1Reference 2Reference 3
[8] Chung Junyoung, Gulcehre Caglar, Cho KyungHyun, and Bengio Yoshua. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning.Google Scholar
Reference
[9] Clevert Djork-Arné, Unterthiner Thomas, and Hochreiter Sepp. 2015. Fast and accurate deep network learning by exponential linear units (elus). arXiv:1511.07289. Retrieved from https://arxiv.org/abs/1511.07289.Google Scholar
Reference
[10] Gao Haichang, Tang Mengyun, Liu Yi, Zhang Ping, and Liu Xiyang. 2017. Research on the security of microsoft’s two-layer captcha. IEEE Trans. Inf. Forens. Secur. 12, 7 (2017), 1671–1685.Google ScholarDigital Library
Reference 1Reference 2Reference 3
[11] Gao Haichang, Yan Jeff, Cao Fang, Zhang Zhengya, Lei Lei, Tang Mengyun, Zhang Ping, Zhou Xin, Wang Xuqin, and Li Jiawei. 2016. A simple generic attack on text captchas. In Network and Distributed System Security Symposium. DOI:Google ScholarCross Ref
Reference 1Reference 2
[12] Gao Song, Mohamed Manar, Saxena Nitesh, and Zhang Chengcui. 2017. Emerging-image motion CAPTCHAs: Vulnerabilities of existing designs, and countermeasures. IEEE Trans. Depend. Sec. Comput. (2017).Google Scholar
Reference
[13] Garg Geetika and Pollett Chris. 2016. Neural network captcha crackers. In Proceedings of the Future Technologies Conference (FTC’16). IEEE, 853–861.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
[14] Graves Alex, Fernández Santiago, and Schmidhuber Jürgen. 2005. Bidirectional LSTM networks for improved phoneme classification and recognition. In International Conference on Artificial Neural Networks. Springer, 799–804.Google ScholarDigital Library
Reference
[15] Guo Yanming, Liu Yu, Oerlemans Ard, Lao Songyang, Wu Song, and Lew Michael S.. 2016. Deep learning for visual understanding: A review. Neurocomputing 187 (2016), 27–48.Google ScholarDigital Library
Reference
[16] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google ScholarCross Ref
Reference
[17] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceeding of the Computer Vision and Pattern Recognition (CVPR’16). Las Vegas, NV, 770–778. DOI:Google ScholarCross Ref
Reference
[18] Hecht-Nielsen Robert. 1992. Theory of the backpropagation neural network. In Neural Networks for Perception. Elsevier, 65–93.Google ScholarCross Ref
Reference
[19] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.Google ScholarDigital Library
Reference
[20] Ioffe Sergey and Szegedy Christian. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167. Retrieved from https://arxiv.org/abs/1502.03167.Google Scholar
Reference
[21] Jain Anil K., Mao Jianchang, and Mohiuddin K Moidin. 1996. Artificial neural networks: A tutorial. Computer 29, 3 (1996), 31–44.Google ScholarDigital Library
Reference
[22] Karpathy Andrej and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.Google ScholarCross Ref
Reference
[23] Karthik CHBL-P. and Recasens Rajendran Adria. 2015. Breaking Microsoft’s CAPTCHA. Technical Report.Google Scholar
Reference
[24] LeCun Yann, Bengio Yoshua, and Hinton Geoffrey. 2015. Deep learning. Nature 521, 7553 (2015), 436.Google ScholarCross Ref
Reference
[25] LeCun Yann, Boser Bernhard, Denker John S., Henderson Donnie, Howard Richard E., Hubbard Wayne, and Jackel Lawrence D.. 1989. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 4 (1989), 541–551.Google ScholarDigital Library
Reference
[26] Lecun Yann, Bottou Leon, Bengio Y., and Haffner Patrick. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86 (121998), 2278–2324. DOI:Google ScholarCross Ref
Reference 1Reference 2
[27] Lin Dazhen, Lin Fan, Lv Yanping, Cai Feipeng, and Cao Donglin. 2018. Chinese character CAPTCHA recognition and performance estimation via deep neural network. Neurocomputing 288 (2018), 11–19.Google ScholarDigital Library
Reference
[28] Luong Minh-Thang and Manning Christopher D.. 2015. Stanford neural machine translation systems for spoken language domains. In Proceedings of the International Workshop on Spoken Language Translation. 76–79.Google Scholar
Reference
[29] Luong Minh-Thang, Pham Hieu, and Manning Christopher D.. 2015. Effective approaches to attention-based neural machine translation. arXiv:1508.04025. Retrieved from https://arxiv.org/abs/1508.04025.Google Scholar
Reference
[30] Mikolov Tomáš, Karafiát Martin, Burget Lukáš, Černockỳ Jan, and Khudanpur Sanjeev. 2010. Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association.Google ScholarCross Ref
Reference
[31] Mohamed Manar, Sachdeva Niharika, Georgescu Michael, Gao Song, Saxena Nitesh, Zhang Chengcui, Kumaraguru Ponnurangam, Oorschot Paul C. van, and Chen Wei-Bang. 2014. A three-way investigation of a game-CAPTCHA: Automated attacks, relay attacks and usability. In Proceedings of the 9th ACM Symposium on Information, Computer and Communications Security. ACM, 195–206.Google ScholarDigital Library
Reference
[32] Mori Greg and Malik Jitendra. 2003. Recognizing objects in adversarial clutter: Breaking a visual CAPTCHA. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 1. IEEE, I–I.Google ScholarCross Ref
Reference
[33] Pan Sinno Jialin and Yang Qiang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2010), 1345–1359.Google ScholarDigital Library
Reference
[34] Pedamonti Dabal. 2018. Comparison of non-linear activation functions for deep neural networks on MNIST classification task. arXiv:1804.02763. Retrieved from http://arxiv.org/abs/1804.02763.Google Scholar
Reference
[35] Pennington Jeffrey, Socher Richard, and Manning Christopher. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.Google ScholarCross Ref
Reference
[36] Rossum Guido. 1995. Python Reference Manual. Technical Report. Amsterdam, The Netherlands.Google ScholarDigital Library
Reference
[37] Simonyan Karen and Zisserman Andrew. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1559.Google Scholar
Reference 1Reference 2
[38] Sundermeyer Martin, Schlüter Ralf, and Ney Hermann. 2012. LSTM neural networks for language modeling. In Proceedings of the 13th Annual Conference of the International Speech Communication Association.Google ScholarCross Ref
Reference
[39] Sutskever Ilya, Vinyals Oriol, and Le Quoc V.. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. 3104–3112.Google ScholarDigital Library
Reference
[40] Szegedy Christian, Vanhoucke Vincent, Ioffe Sergey, Shlens Jonathon, and Wojna Zbigniew. 2016. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), 2818–2826.Google Scholar
Reference
[41] Tang Mengyun, Gao Haichang, Zhang Yang, Liu Yi, Zhang Ping, and Wang Ping. 2018. Research on deep learning techniques in breaking text-based captchas and designing image-based captcha. IEEE Trans. Inf. Forens. Secur. 13, 10 (2018), 2522–2537.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
[42] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). IEEE, 3156–3164.Google ScholarCross Ref
Reference
[43] Ahn Luis Von, Blum Manuel, Hopper Nicholas J., and Langford John. 2003. CAPTCHA: Using hard AI problems for security. In International Conference on the Theory and Applications of Cryptographic Techniques. Springer, 294–311.Google ScholarCross Ref
Reference
[44] Ahn Luis Von, Blum Manuel, and Langford John. 2004. Telling humans and computers apart automatically. Commun. ACM 47, 2 (2004), 56–60.Google ScholarDigital Library
Reference
[45] Wu Yonghui, Schuster Mike, Chen Zhifeng, Le Quoc V., Norouzi Mohammad, Macherey Wolfgang, Krikun Maxim, Cao Yuan, Gao Qin, Macherey Klaus, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144. Retrieved from https://arxiv.org/abs/1609.08144.Google Scholar
Reference
[46] Xu Yi, Reynaga Gerardo, Chiasson Sonia, Frahm Jan-Michael, Monrose Fabian, and Oorschot Paul C. Van. 2013. Security analysis and related usability of motion-based captchas: Decoding codewords in motion. IEEE Trans. Depend. Sec. Comput. 11, 5 (2013), 480–493.Google ScholarCross Ref
Reference
[47] Yan Jeff and Ahmad Ahmad Salah El. 2007. Breaking visual captchas with naive pattern recognition algorithms. In Proceedings of the 23rd Annual Computer Security Applications Conference (ACSAC’07). IEEE, 279–291.Google ScholarCross Ref
Reference 1Reference 2
[48] Ye Guixin, Tang Zhanyong, Fang Dingyi, Zhu Zhanxing, Feng Yansong, Xu Pengfei, Chen Xiaojiang, and Wang Zheng. 2018. Yet another text captcha solver: A generative adversarial network based approach. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. ACM, 332–348.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
Reference 10
[49] Yin Wenpeng, Kann Katharina, Yu Mo, and Schütze Hinrich. 2017. Comparative study of cnn and rnn for natural language processing. arXiv:1702.01923. Retrieved from https://arxiv.org/abs/1702.01923.Google Scholar
Reference

Index Terms

Breaking Captcha System with Minimal Exertion through Deep Learning: Real-time Risk Assessment on Indian Government Websites
1. Security and privacy
  1. Software and application security
    1. Web application security

Recommendations

Challenges of CAPTCHA in the accessibility of Indian regional websites
COMPUTE '11: Proceedings of the Fourth Annual ACM Bangalore Conference

This paper reviews Indian Government guidelines regarding the inclusion of regional content in government websites, compares them to World Wide Web Consortium's (W3C) guidelines for multilingual version and use of CAPTCHA tests and analyzes some Indian ...
Read More
CAPTCHA: Impact of Website Security on User Experience
ICIIT '19: Proceedings of the 2019 4th International Conference on Intelligent Information Technology

As currently many people use the Internet to access websites, Internet security becomes an important topic. One popular security mechanism is Captcha or Completely Automated Public Turing Computer and Humans Apart, which determine whether or not the ...
Read More
Captcha Recognition Based on Deep Learning
ICBDR '20: Proceedings of the 4th International Conference on Big Data Research

The captcha is a Turing test used to distinguish between machines and humans. It is considered as the verification code for the security on many websites. In recent years, deep learning has been widely used in the related field such as data analysis and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Digital Threats: Research and Practice Volume 4, Issue 2
June 2023
344 pages
EISSN:2576-5337
DOI:10.1145/3615671
Editors:
Arun Lakhotia
University of Louisiana at Lafayette and Cythereal, USA
,
Leigh Metcalf
CERT, USA
Issue’s Table of Contents
Copyright © 2023 Copyright held by the owner/author(s).
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 August 2023
- Online AM: 23 February 2023
- Accepted: 6 January 2023
- Revised: 4 November 2022
- Received: 2 May 2022
Published in dtrap Volume 4, Issue 2

Check for updates
Author Tags
Captcha
security vulnerabilities
Risk Assessment
deep learning
RNN
CNN
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 1,091
  Total Downloads
- Downloads (Last 12 months)1,020
- Downloads (Last 6 weeks)113
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Breaking Captcha System with Minimal Exertion through Deep Learning: Real-time Risk Assessment on Indian Government Websites

Digital Threats: Research and Practice

Abstract

1 INTRODUCTION

2 RELATED STUDY

3 BACKGROUND

3.1 Type of Captchas

3.2 Major Websites

3.3 Deep Network Background

4 CAPTCHA SOLVER MODEL ARCHITECTURE

4.1 Type I Captcha Model

4.2 Type II Captcha Model

4.3 Type III Captcha Model

5 EXPERIMENTAL SETUP

5.1 Network Parameters

5.2 Dataset

5.3 Executing Environment

6 RESULTS

7 DISCUSSIONS

7.1 Analysis

7.2 Inferences and Suggestions

7.3 Justification to State of the Art

8 CONCLUSION

APPENDICES

A DETAILS OF INDIAN GOVERNMENT WEBSITE CONSIDERED IN THE STUDY

A.1 Voter ID

A.2 AADHAAR

A.3 Bengal Bhumi

A.4 Pariwahan Seva

A.5 Passport India

A.6 Indian Post

A.7 IRCTC

B APPENDIX OTHER INDIAN GOVERNMENT WEBSITE USING CAPTCHA SYSTEM

Supplemental Material

Available for Download

REFERENCES

Cited By

Index Terms

Recommendations

Challenges of CAPTCHA in the accessibility of Indian regional websites

CAPTCHA: Impact of Website Security on User Experience

Captcha Recognition Based on Deep Learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media