Applying Visual Cryptography to Enhance Text Captchas

Nowadays, lots of applications and websites utilize text-based captchas to partially protect the authentication mechanism. However, in recent years, different ways have been exploited to automatically recognize text-based captchas especially deep learning-based ways, such as, convolutional neural network (CNN). Thus, we have to enhance the text captchas design. In this paper, using the features of the randomness for each encoding process in visual cryptography (VC) and the visual recognizability with naked human eyes, VC is applied to design and enhance text-based captcha. Experimental results using two typical deep learning-based attack models indicate the effectiveness of the designed method. By using our designed VC-enhanced text-based captcha (VCETC), the recognition rate is in some degree decreased.


Introduction
Nowadays, lots of applications and websites, including Baidu, Sina, Jingdong and many others, utilize text-based captchas to partially protect the authentication mechanism from certain types of attacks [1].
Text-based captchas belong to visual captchas. This type of captcha asks the user to identify some characters in an image deliberately rendered with some distortion and/or noise. Text-based captchas is widely used of since it is easily understood by most humans worldwide and since childhood we have been trained to recognize characters; it has a large brute force search space; its generation is easily automated without manual effort [2].
However, by machine learning attacks, 2D text-based captchas are easily broken. More importantly, deep learning-based breaking ways [4][5][6] have posed great challenges for text-based captchas, such as convolutional neural networks (CNN) [6][7][8]. One typical breaking way [6] has a series of blocks including a convolution layer, pooling layer and a full connection layer connected to the output layer with an activation function.
Another typical deep learning breaking method is proposed in Reference [5] that includes four steps, that is, a captcha synthesizer, pre-processing, training and fine-tuning. A classic CNN model, namely LeNet-5 [9], is used in the method. It has an admirable performance on 33 captcha schemes including 11 schemes used by 32 of the top-50 popular websites, such as Wikipedia, eBay, Microsoft and Google. More importantly, the significance of the method proposed in Reference [5] is that it can achieve high performance only with small set of training captchas.
There are few studies in this area of enhanced text-based captchas, and the representative ones are as follows. In order to counter OCR and machine learning attacks, previous text-based captchas mainly use transformations and distortions. Transformation chiefly includes clockwise/counterclockwise rotation, translation and scaling, are easy for both humans and computers to solve. Then some level of distortions can be typically combined. The distortions can also be elastic deformations to the overall text (globalwarping) or deformations at the level of individual characters (localwarping).
The initial design idea of the popular "reCAPTCHA" was using distorted strings [10]. The MSN Passport captcha includes more characters highly warped to be distorted. Beheshti et al. [11] developed a model using the Human Visual System (HVS) to superimpose so as to integrate complex information presented in many frames. It utilized the concept of persistence of vision enabling humans to see the text in a continuous mode, which can use the brain to distinguish between automated computer bots and human users.
In addition to 2D text-based captcha, 3D text-based captcha was developed as well. Suzi et al. [12] introduced DotCHA based on human interaction, which allows the user to rotate a 3D text model to identify the text. It can show different letters according to the rotation angle. Unfortunately, it may be not easy to generate and identify.
In a word, traditional breaking captchas can be divided into two steps including segmentation and recognition. Thus, the above mentioned traditional captchas enhanced methods mainly focus on anti-segmentation and anti-recognition, respectively. Anti-segmentation chiefly includes overlapping characters, solid background, occluding lines, complex background and so on; anti-recognition includes rotation, distortion, waving, varied font sizes, styles and color, and so on. However, the recent development of deep learning-related ways threatens the security of text-based captchas. Thus, we have to enhance the text captchas design in a specific way, which is the motivation of this paper.
In this paper, due to the features of the randomness for each encoding process in visual cryptography (VC) and its visual recognizability with naked human eyes, VC is applied to design and enhance text-based captcha.
VC [13][14][15] for (k, n)-threshold encodes a binary secret image into several shadow images, a.k.a. shares or shadows, where each shadow image is random due to the randomness for the encoding process. When any k or more shadow images are collected, the secret image is restored by our human eyes without any computations (only Boolean OR operation), although some contrast loss appears; otherwise the number of the collected shadow images is less than k, the secret image cannot be restored even with high-performance computing devices and technologies [15,16]. In VC research field, random grids (RG)-based VC [17][18][19] is more feasible since RG-based VC has no pixel expansion and basic matrix design. VC can be applied to key management, password transmission [20], identity authentication [21,22], access control, digital watermarking [23][24][25][26], and distributive applications [27][28][29].
Since each shadow image is random, the randomness is preserved to some extent in the restored secret image. The randomness in VC and its visual recognizability with naked human eyes will be applied in this paper to design and enhance traditional text-based captcha, where the randomness is used to resist recognition and visual recognizability is served for humans. Random changes in the text-based captcha may invalidate a deep learning-based attack, which is the key idea of this paper.
In this paper, by utilizing the features of the randomness for each encoding process in visual cryptography (VC) and the visual recognizability with naked human eyes, we will apply VC to design and enhance traditional text-based captcha. Experimental results using two typical deep learning-based attack models indicate the effectiveness of the designed method. By using our designed VC-enhanced text-based captcha (VCETC) on the basis of traditional text-based captcha, the recognition rate is decreased in some degree.
The rest of the paper is presented as follows. Section 2 will introduce some preliminaries for the designed method. In Section 3, the designed method will be presented in detail. Section 4 will focus on experimental results.Finally, Section 5 will conclude this paper.

Preliminaries
In this section, we will introduce some preliminaries for the designed method. We summarize the main notations adopted in the paper in Table 1.

Notations Descriptions
(k, n) Threshold 0(resp.1) A white(resp. black) pixel S The secret binary image S The restored binary secret image AS 0 ( resp. AS 1 ) The area of all the white (resp. black) pixels in S b A bit-wise complementary operation on a binary pixel b ⊗ Stacking (OR) operation ⊕ Boolean XOR operation SC 1 , SC 2 , · · · SC n Shadow images generated by VSS schemes t Number of collecting shadow images in the recovery phase α Contrast of the restored secret image by stacking recovery P (x) The probability when any event x occurs A model trained with traditional captchas by the first (second) deep learning-based breaking method A model trained with captchas generated by our method by the first (second) deep learning-based breaking method

One Typical Text Captcha Generation Method
A typical enhanced text captcha generation method [1,6] generally includes occluding lines, overlapping, English letters and arabic numerals, complex background, varied font sizes and color, rotation, distortion and waving. The results of one traditional text captcha generation method are illustrated in Figure 1. Any previous-existing traditional text captcha generation method can be input in our design, and in this paper we will use this method as an example. For comparison, the CNN-based breaking methods will be separately trained on the captchas generated by the typical method [6] and by our method.

One Typical Deep Learning Breaking Method
Convolutional neural network (CNN) is a kind of deep learning network, which has achieved excellent results in many practical applications, such as image target recognition. One typical CNN architecture [6] is presented in Figure 2. A CNN architecture is divided into a series of blocks. The first is composed of two types of layers, that is, convolution layer and pooling layer, and the second one is the full connection layer connected to the output layer with activation function. The pooling layer is the pooling operation of data after the convolution layer is processed. CNNs are still layered networks, but the functions and forms of layers have changed. CNN is specifically designed for image recognition. Each image used in deep learning is divided into compact topological parts, each of which is processed with filters to search for specific patterns. The first CNN used in this paper includes two convolution layers and pooling layers, respectively.

Another Typical Deep Learning Breaking Method
The framework of another typical deep learning breaking method [5] is realized by first learning a captcha synthesizer to generate synthetic captchas to learn a base solver. The base solver is then refined to obtain the final, fine-tuned solver with a few clean captchas. A classic CNN, namely LeNet-5 [9] is applied. The breaking method shows admirable performance on 33 captcha schemes including 11 schemes used by 32 of the top-50 popular websites, such as Wikipedia, eBay, Microsoft and Google. The significance of Reference [5] is that it can achieve high performance only with small set of training captchas. We will use the default parameters of Reference [5] in the experiment of this paper.

VC for (k, n)-threshold
Herein, symbols ⊕ and ⊗ denote the Boolean XOR and OR.b indicates bit-wise complementary operation of any binary bit b. A binary secret image S with size of H × W is split into n shadow images, denoted by SC 1 , SC 2 , · · · , SC n . The restored binary secret image S is restored from any t(2 ≤ t ≤ n, t ∈ Z + ) shadow images by superposing.
AS 0 (resp., AS 1 ) is the white (resp., black) area of S, that is, For any pixel s of S, the probability of pixel color is transparent or white (0) is represented as P(s = 0), and the probability of pixel color is opaque or black (1) is represented as P(s = 1). In addition, . The image quality of the restored secret image S is generally evaluated by contrast, denoted by α, as follows [19]: where P 1 indicates the wrongly restored probability for the black area of S and P 0 denotes the correctly restored probability for the white area of S.
Contrast is one of typical metrics [30] to evaluate the image quality of the restored secret image, which will be adopted in this paper. Contrast will decide how well human eyes can recognize the restored secret image. For clarity corresponding to different values of contrast [31], please refer to Figure 3. The value of contrast is in [0,1], which is expected to be as large as possible to achieve high image quality, where α = 0 indicates S has no relations with S and α = 1 indicates S = S, that is, S is losslessly restored.  The encoding and restoring phases of one typical (2, 2) RG-based VC [19] are reviewed in Algorithm 1.
Step 5: Step 6: Output the 2 shadow images SC 1 and SC 2 Step 4 guarantees S(h, w) = b 1 ⊕ b 2 in step 3. In step 5, to make all the shadow images be equal to each other, the temporary 2 bits b 1 , b 2 are rearranged randomly to corresponding 2 shadow images bits.
In the restoring phase, Therefore, if S(h, w) = 1, the restoring result is black. If S(h, w) = 0, the restoring result has half chance to be black or white because b 1 is generated randomly.
After encoding the first k bits, the last nk bits can be specially designed to construct VCETC for (k, n) threshold with comfortable features [32,33].
Based on the above analyses, one RG-based VC for (k, n)-threshold [33] can be derived in Algorithm 2, which will be adopted in our design.
Step 3: Step 4: Randomly rearrange b 1 , b 2 , · · · b n to SC 1 (h, w), SC 2 (h, w), · · · SC n (h, w) Step 5: Output n shadow images SC 1 , SC 2 , · · · SC n In step 2, the randomness for each encoding process will result in the randomness of each pixel in the shadow image and thus the randomness of the restored secret image, which will be utilized in our designed VCETC to invalid deep learning-based attack. In general, randomness is effective at resisting machine learning.

The Designed Method
VC-enhanced text-based captcha (VCETC) architecture of the designed method is exhibited in Figure 4, the algorithmic steps are in Algorithm 3.

Algorithm 3: The designed VCETC
Input: Text content appeared in the captcha; pre-existed traditional captcha generation method; VC candidate pool, that is, {VC c , k, n, t}, for c = 1, 2, · · · , C Output: Output VC-enhanced text-based captcha S Step 1: Utilize pre-existed traditional captcha generation method to generate temporary captcha, denoted by S 0 , according to the text content.
Step 2: Convert S 0 into binary image with automatic threshold to obtain S 1 .
Step 3: Randomly pick up one VC method from VC candidate pool.Use the VC method to encode S 1 to obtain n shadow images SC 1 , SC 2 , · · · SC n .
Step 4: Randomly choose t shadow images, denoted by SC i 1 , SC i 2 , · · · SC i t , from all the n shadow images, to restore S based on superposing (⊗) operation, where t ≥ k.
Step 5: Output VC-enhanced text-based captcha S The basic idea of the designed method is further analyzed as follows: • In the input, pre-existed traditional captcha generation method is input and based on which we can further improve the performance. Thus, other pre-existed traditional text-based captcha generation method can also be input in our method and our method is only one enhanced method based on pre-existed traditional captcha generation method rather than a redesign. • In Step 3, the randomness of the selected VC method is applied to the captcha S . In addition, the random selection of VC method and t further increases the randomness in the captcha S . • VC candidate pool, that is, {VC c , k, n, t}, for c = 1, 2, · · · , C, can be set up through screening possible VC schemes and their parameters k, n, t whose contrast value is in [0.14, 0.36], where 0.14 is derived from clarity as Figure 3 for human recognition and 0.36 is given by our experiments.

•
We can use the VC method to encode the text captcha first and then use pre-existed traditional captcha generation method to proceed the temporary captcha as well.

•
In Step 4, we directly stack the selected t shadow images. We can further improve the randomness by performing dynamic stacking from random angles with random velocities like the gif in Reference [34].

•
Some other text-based applications can apply our method as well.

Experiments and Comparisons
In this section, experiments are performed to illustrate the effectiveness of the designed method. In the following experiments, we will adopt a traditional typical captcha generation method [6], a VC scheme and deep learning-based breaking methods as Section 2.
To show the effectiveness and advantage of our method, the two CNN-based breaking methods will be separately trained on the captchas generated by the traditional typical method [6] and by our method.

Breaking Traditional Captcha Generative Captchas by Deep Learning Way
Some captchas generated by traditional captcha generation method are illustrated in Figure 5.

The First Deep Learning Way
The training set contains 100,000 captchas and the testing set contains 10,000 captchas generated by traditional captcha generation method. First, a model, denoted by A 1 , is trained with the training set by the first deep learning-based breaking method; then A 1 is utilized to recognize the testing captchas. When applying the deep learning-based breaking method to the above captchas, the success rate is 95%, which indicates that the traditional captchas are easily to be broke by deep learning-based breaking method.

The Second Deep Learning Way
The training set contains 500 captchas and the testing set contains 4540 captchas generated by traditional captcha generation method. First, a model, denoted by A 2 , is trained with the training set by the second traditional deep learning-based breaking methods; then A 2 is utilized to recognize the testing captchas. When applying the deep learning-based breaking method to the above captchas, the success rate is 71.52%, which indicates that the traditional captchas are also easily to be broke by deep learning-based breaking method. We note that the second deep learning way only uses 500 captchas in the training process, which is significantly smaller than the first deep learning way.

Our Designed Captcha Test by Deep Learning Way
First, we will illustrate the output captchas by our method; second, we will use A i , i = 1, 2 to recognize the testing captchas generated by our method; third, adding captchas generated by our method to the training set to obtain the second model, denoted by B i , i = 1, 2, which will be utilized to recognize the testing captchas generated by our method; finally, the recognized rates with human naked eyes of the traditional captchas and our designed captchas will be evaluated.
Some captchas generated by the proposed method are illustrated in Figure 6, where k = 2, n = 5, t = 3 and most captchas can be recognized with human naked eyes.

The First Deep Learning Way
We use A 1 to recognize the testing 10,000 captchas generated by our method; Then, adding 8000 captchas generated by our method to the training set to obtain the second model, denoted by B 1 , which will be utilized to recognize the testing 2000 captchas generated by our method.
When applying A 1 to the captchas in Figure 6, the success rate is 0. When adding 8000 captchas generated by our method to the training set to train B 1 , the training process is presented in Figure 7. According to Figure 7, the loss function is not decreased all the time and training recognition rate is about 8.7%. Thus it is divergent when VCETCs are added to the training process when we set the motivating rate 80%, which means its training recognition rate is only close to 8.7%.

The Second Deep Learning Way
Five hundred captchas generated by our method are used as the training set to obtain the second model, denoted by B 2 , which will be utilized to recognize the testing 4540 captchas generated by our method. The success rate to recognize the captchas generated by our method is 53.83%.

The Subjective Recognized Rates with Human Naked Eyes
Finally, the subjective recognized rates with human naked eyes of the traditional captchas and our designed captchas are presented in Table 2. In the testing experiments, we invite totally 20 male volunteers (students aged from 22 to 35 with no experiences of the two types of captchas) to join in the subjective evaluation. Fifty captchas with the same content are randomly picked up from the traditional captchas and our designed captchas, respectively. Every volunteer separately tests totally 50 captchas by himself. After viewing and recognizing each captcha, we will record the recognition situation and recognition speed to obtain Table 2. We find that although the success rate is decreased, it slightly affects user experience.

Brief Summary of the Experiments
The brief summary of the experiments is in Table 3. By using our designed VCETC, the recognition rates are in some degree decreased, where the decreasing rates are different when with different deep learning ways. In addition, we present the experimental time complexity of the proposed framework in Table 4. The testing environments are as follows. Windows 7, CPU Intel Core i7, RAM DDR4 16GB, Hard disk 1TB 7200 rpm, matlab 2018a. To generate each captcha, the proposed method needs more 0.0521 (s) than traditional captcha generation method, which is acceptable. Based on the above experiments we conclude and analyze as follows.
• Due to the features of the randomness for each encoding process in VC and its visual recognizability with naked human eyes, our designed VCETC can in some degree enhance traditional captchas to resist some deep learning-based ways even our designed VCETC are used as the training set.

•
Due to the feature of visual recognizability with naked human eyes, VCETCs are suitable for human eyes.

•
According to subjective test, our designed VCETC slightly affects user experience with lower storage space, that is, the binary captcha needs a lower storage space and transmission bandwidth than color ones.

Use-Case Scenario
We will give a use-case scenario of our designed VCETC as follows. For the static text captchas in the original website, just like Figure 5, we can add our method directly to the generation process of the original text captchas, to generate the enhanced text captchas as shown in Figure 6. In such a way, we can complete the application of our designed VCETC.

Conclusions
This paper designed a new visual cryptography (VC)-enhanced text-based captcha (VCETC), where the features of the randomness for each encoding process in VC and its visual recognizability with naked human eyes are utilized to countermeasure the automatic recognition of text-based captchas by some deep learning-based ways. Experiments validate the recognition rate is decreased by using our designed VCETC. The user experience is acceptable for our designed VCETC. In the future, we will (1) test more typical and improved deep learning-based ways, (2) exploit more admirable applied ways of VC to captcha, (3) apply VC to many other fields, such as enhancing text-based identification to resist automatic recognition and keywords-based secret information to resist automatic monitoring.
More specifically, we may further extend our work in the following ways.
• There are many practically oriented programs for solving the captchas problem to circumvention the need of human participation expected by website, which are not based on CNN, such as "Universal Share Downloader" (USD) based on plugins and direct optical character recognition (OCR) to recognize some typical captchas. Due to the features of the randomness for each encoding process in VC, our method may enhance such text captchas.

•
To further improve our method, we may use recommendation mechanisms to recommend text-based captchas close to user's characteristics and profile [36], and individual differences in cognitive processing [37].

•
We will provide additional information and discussion to elaborate more on the use case scenario, and how we envision to include the recommender systems.

•
Our method can add many dynamic mechanisms to further improve the performance.

•
Other recent attempts to improve text-based captchas have been proposed in the scientific literature as well. We will compare our method to the more state of the art enhanced methods.