Document Liveness Challenge Dataset (DLC-2021)

Various government and commercial services, including, but not limited to, e-government, fintech, banking, and sharing economy services, widely use smartphones to simplify service access and user authorization. Many organizations involved in these areas use identity document analysis systems in order to improve user personal-data-input processes. The tasks of such systems are not only ID document data recognition and extraction but also fraud prevention by detecting document forgery or by checking whether the document is genuine. Modern systems of this kind are often expected to operate in unconstrained environments. A significant amount of research has been published on the topic of mobile ID document analysis, but the main difficulty for such research is the lack of public datasets due to the fact that the subject is protected by security requirements. In this paper, we present the DLC-2021 dataset, which consists of 1424 video clips captured in a wide range of real-world conditions, focused on tasks relating to ID document forensics. The novelty of the dataset is that it contains shots from video with color laminated mock ID documents, color unlaminated copies, grayscale unlaminated copies, and screen recaptures of the documents. The proposed dataset complies with the GDPR because it contains images of synthetic IDs with generated owner photos and artificial personal information. For the presented dataset, benchmark baselines are provided for tasks such as screen recapture detection and glare detection. The data presented are openly available in Zenodo.


Introduction
The growing popularity of mobile services increases the risk of financial and other losses from fraudulent user actions. To reduce the number of illegal actions and comply with the law when using mobile services, it is often required for users to present their identity documents. In the case of remote access via a mobile device, this means receiving and analyzing identity (ID) document images. ID document recognition systems [1,2] are widely used to obtain and check users' personal information in many applications. At the same time, despite a large number of publications on the topic of ID document recognition, due to legal and ethical restrictions, researchers are constrained by [3] a lack of open datasets that can be used to reproduce and compare results. The absence of open datasets for ID document fraud prevention research inspired us to create a new dataset, called DLC-2021 [4][5][6]. It can be used to establish an evaluation methodology and set up baselines for document image recapture detection, document photocopy detection, and document lamination detection methods.

Overview
The GDPR [7] and other local laws prohibit the creation of datasets with real ID images. Thus, researchers began to use artificially generated ID document images for open dataset creation [8][9][10][11][12]. As far as we know, printed mock documents are used only in MIDV family datasets, and MIDV-500 was the first [8]. This dataset contained 500 video clips of 50 identity documents, with 10 clips per document type. The identity documents were of different types, and were mostly "sample" or "specimen" documents that could be found in WikiMedia and were distributed under public copyright licenses. The conditions represented in MIDV-500 thus had some diversity regarding the background and the positioning of the document in relation to the mobile capturing process; however, they did not include variation in lighting conditions, or significant projective distortions. MIDV-2019 [9] was later published as an extension of MIDV-500. It contained video clips captured with very low lighting conditions and with higher projective distortions. The dataset was also supplemented with photos and scanned images of the same document types to represent the typical input for server-side identity document analysis systems. MIDV-2020 [10] was published recently to provide variability in the text fields, faces, and signatures, while retaining the realism of the dataset. The MIDV-2020 dataset consists of 1000 different physical documents (100 documents per type), all with unique, artificially generated faces, signatures, and text field data. Each physical document was photographed and scanned, and for each a video clip was captured using a smartphone. The ground truth includes ideal text field values, and the geometrical position of documents and faces in each photo, scan, and video clip frame (with 10 frames-per-second annotation). MIDV-LAIT [11] contains video for ID documents with textual fields in Perso-Arabic, Thai, and Indian scripts.
When using mobile-based ID document recognition systems, the most technically simple and accessible attack methods are different types of rebroadcast attacks [13]. For the DLC-2021 dataset we shot mock documents from the MIDV-2020 collection as originals ( Figure 1a) and modeled those types of attacks that remain realistic when using mock documents: capturing a color printed copy of a document without lamination (Figure 1b), capturing a gray printed unlaminated copy of a document ( Figure 1c) and capturing a displayed image of a document (Figure 1d). Thus, all images in the MIDV family of datasets [8][9][10][11] can be considered as images of genuine documents and can be used as negative samples for document fraud detectors.
Document-specific methods for detecting document recapture are based on the latest advances in deep learning. The algorithm proposed in [25] takes advantage of both metric learning and image forensic techniques. The authors considered practical domain generalization problems, such as the variations in printing/imaging devices, substrates, recapturing channels, and document types with a private dataset. The texture and reflectance characteristics of the bronzing region are used as discriminative features to detect a recaptured certificate document in [26]. The dataset used in the study is available upon request.
Thus, for research in the field of document recapture prevention, new specialized open datasets captured with smartphones are required.

DLC-2021 Dataset Description
The set of 10 ID document types for DLC-2021 (Table 1) coincides with the set of document types in the MIDV-2020 dataset. For each type of document, eight examples of physical documents were taken. For selected physical documents, color and gray paper hard copies were made by printing without lamination. All color copies and some of the gray copies were cut to fit the original document page shape.
While preparing the DLC-2021 dataset, we focused on video capture. On the one hand, the video stream allows for analysis changes in time, and this provides much more information for assessing liveliness. On the other hand, video frames usually contain compression artifacts that can significantly affect the performance of analysis algorithms.
In general, DLC-2021 follows the structure of the MIDV-2020 folder and files, except for clip names. In DLC-2021, the two-digit document template number clip name is extended with a two-letter video type code ( Table 2) and four-digit serial number. An Apple iPhone XR and Samsung S10 were used for video capturing, as in MIDV-2020. Video clips were shot with a wide-angle camera (Table 3) using a standard smartphone camera application. To make videos more varied, we used two different frame resolutions (1080 × 1920, 2160 × 3840) and two different frame rates (30, 60 fps) for shooting video clips. Table 4 summarizes the number of video clips by type. Each clip was shot vertically and was at least five seconds long. Frames were extracted at 10 frames per second using ffmpeg version n4.4 with default parameters, and for the first 50 extracted frames the document position was manually annotated. The annotation file for each clip followed the MIDV-2020 JSON format [10] and was readable with VGG Image Annotator (v2.0.11) [29].

Paper Document Shooting
We captured video with the "original" documents and printed copies under different lighting conditions, such as natural daylight, bright light with deep shadows, artificial light, colored light, low light, and flashlight. The color characteristics of document images varied significantly under different lighting and capture conditions ( Figure 2).
Low or uneven lighting and white balance correction algorithms inappropriate for the lighting conditions dramatically affect color reproduction and complicate the process of distinguishing color documents from gray copies ( Figure 3) without specialized color correction algorithms, such as that in [30].
To achieve greater realism of the video, various document occlusions were made on some of the clips (Figure 4), such as holding the document with fingers, and a brightly colored object in the document area.  In the task of detecting gray copies, such partial occlusion can create additional difficulties, as it can lead to an increase in the color diversity of pixels in the document area.
Since ID documents are used regularly, manufacturers protect them from dirt, creases, and other damage by using a special protective coating or lamination. Such a coating can preserve the integrity of documents for a long time from various environmental influences and also significantly complicate attempts to change the content of the document, for example, such as replacing a photo. However, laminated documents can easily introduce reflection and saturation phenomenon, especially when a strong illuminant such as a flash, a fluorescent lamp, or even the sun lights the document during the video-capturing process. Figure 5 shows some images extracted from a video captured with a smartphone.  Strong reflections on the smooth surface of the laminated documents can partially or totally hide the content of the document, making it impossible to analyze the pictures or to extract the text. In addition, the shape and the size of the area of reflection may vary depending on the orientation of the document relative to the smartphone lens. On the one hand, these variations are challenging for detection, segmentation, and recognition algorithms. On the other hand, the analysis of the shape and consistency of changes in highlights and scene geometry can serve as an important indicator of the liveliness of a document. For example, exploiting the camera flashlight during the capture process creates semi-controlled lighting conditions in which laminated and unlaminated documents in some cases can be differentiated more robustly (Figure 6).

Screens Shooting
For screen recapture, we used two office desktops and two notebook LCD monitors. Figure 7 shows samples from the template image and video for original and screenrecaptured cases. 00   It should be noted that the documents themselves may have a complex textured page background, for example, when using document-protection technologies such as guilloche. Another interesting case is textured scene objects, or even the LCD screen behind the document. In such cases, moiré and other recapture artifacts can also occur outside the document zone when the original document is captured with a digital camera.

Experimental Baselines
While the main goal of the paper is to present a document liveness challenge dataset, DLC-2021, in order to provide a baseline for future research involving the dataset, in the following sections several benchmarks using DLC-2021 will be presented. As a baseline method we chose Convolutional Neural Networks (CNNs) in view of the fact that CNNs show state-of-the-art results in image classification tasks. In our experiments we used the Keras (2.6.0) library from the Tensorflow (2.8.0) [31] framework and Scikit-learn (1.2.0) library [32]. Scripts, instructions, and pre-trained models to reproduce our experiments can be downloaded from [4].

Screen Recapture Detection
For screen recapture detection, we used a classification CNN model based on ResNet-50 architecture [33] pre-trained on ImageNet weights from TensorFlow Model Garden. We froze the first 49 layers and reduced the number of the last softmax layer outputs to 2. For learning, we used the binary cross-entropy loss function and Adam [34] optimizer with a constant learning rate (lr = 0.1).
The  Table 5 shows results from the validation dataset for CNN-based and Scikit Dummy Classifier detectors with different strategies: "constant" (generates constant prediction), "stratified" (generates predictions with respect to the balance of training set classes), and "uniform" (generates predictions uniformly at random). Results for "stratified" and "uniform" strategies were averaged over 10 runs with different seed values, and the standard deviation values are shown in the table. Most of the false-positive (FP) errors were caused by documents having complex textured backgrounds and compression artifacts, as shown in Figure 8.

Unlaminated Color Copy Detection
The presence of glare is the most evident feature of laminated documents. An unlaminated color copy detector classifies projective undistorted images by frame markup and scaled-down document images. The ResNet-50-based CNN detector showed a steady trend of overfitting, so a more simple architecture, as presented in Table 6, was used. The CNN-based detector was trained on gray images scaled down to 76 × 76 with a binary cross-entropy loss function and Adam optimizer (learning rate = 0.05). Early stopping and data augmentation (brightness distortion with range [0.9, 1.1]) were used to avoid overfitting.
The training dataset was collected from manually labeled MIDV-500 and MIDV-2020 images and contained 29,564 positive and 7544 negative samples. The validation dataset was collected from manually labeled DLC-2021 clip images (or and cc types) and contained 34,607 positive and 3388 negative samples. Table 7 shows the results from the validation dataset for CNN-based and Scikit Dummy Classifier detectors.

Gray Copy Detection
Projective undistorted document images were used for classification. Positive samples in the training set were collected from gray copy clips of Azerbaijani passports, Finnish ID cards, and Serbian passports. Negative samples in the training set were obtained from MIDV-2020. The training set contained 3492 positive and 1000 negative samples. The validation set contained copied grey clips for all other types of documents and original document clips from DLC-2021 (10473 positive and 16264 negative samples).
All experiments with ResNet-50-like models (similar to Section 4.1) and more simple CNN models (similar to Section 4.2) failed. Models either did not train at all or were overfitted. One reason for this result is that CNNs are sensitive to intensity gradient features but ignore color features. Since the development of a more sophisticated CNN architecture is beyond the scope of this article, as a simple baseline, we examined the Scikit Dummy Classifier detector on the validation dataset, as shown in Table 8.

Conclusions
In this paper, we presented the DLC-2021 dataset containing video clips of mock "real" identity documents from the MIDV-2020 collection and three types of popular rebroadcast attacks: capturing a color printed copy of a document without lamination, capturing a gray printed unlaminated copy of a document and capturing a displayed image of a document. Video was captured using modern smartphones with different video quality, and a wide range of different real-world capturing conditions were simulated. Selected video frames were accompanied by the geometric markup of the outer borders of the document.
Using mock documents from the MIDV-2020 collection as targets for shooting DLC-2021 video makes it easy to use field values and document geometry markup from MIDV-2020 templates. The prepared open dataset can be used for other ID-recognition tasks: -Document detection and localization in the image [35][36][37]; -Document type identification [35,37]; -Document layout analysis; -Detection of faces in document images [38] and the choice of the best photo of the document owner [39]; -Integration of the recognition results [40]; -Video frame quality assessment [41] and the choice of the best frame [42].
As the videos were captured with two different smartphones, the DLC-2021 dataset can be used for sensor noise (PRNU)-based method analysis.
In the future, we plan to expand the DLC dataset with more screen types and devices for shooting, as well as increase the variety of document types.
Regarding ethical AI, the published dataset has no potential to affect the privacy of individuals regarding personal data, since all documents are synthetic mock-ups and comply with the GDPR.
The authors believe that the provided dataset will serve as a valuable resource for ID document recognition and ID document fraud prevention, and lead to more high-quality scientific publications in the field of ID document analysis, as well as in the general field of computer vision.  Acknowledgments: All source images for MIDV-2020 dataset were obtained from Wikimedia Commons. Author attributions for each source images are listed in the original MIDV-500 description table (ftp://smartengines.com/midv-500/documents.pdf). Face images by Generated Photos (https://generated.photos).