# Effects of JPEG Compression on Vision Transformer Image Classification for Encryption-then-Compression Images

## Abstract

## 1. Introduction

## 2. Preparation

#### 2.1. Vision Transformer

#### 2.2. Previous Classification Method for EtC Images through Encrypted ViT Model

## 3. Evaluation of JPEG-Compression Effects on the Classification Results

#### 3.1. Overview

#### 3.2. Image Encryption

- Step i-1:
- Divide an input image into main blocks, and further divide each main block into sub blocks.
- Step i-2:
- Translocate sub blocks within each main block using ${K}_{1}$.
- Step i-3:
- Rotate and flip each sub block using ${K}_{2}$.
- Step i-4:
- Apply a negative–positive transformation to each sub block using ${K}_{3}$.
- Step i-5:
- Normalize all pixels.
- Step i-6:
- Shuffle the R, G, and B components in each sub block using ${K}_{4}$.
- Step i-7:
- Translocate main blocks using ${K}_{5}$.
- Step i-8:
- Integrate all of the sub and main blocks.

- H and W: the height and width of an image.
- $\mathit{x}\in {\{0,1,\cdots ,255\}}^{H\times W\times 3}$: an input image.
- ${S}_{\mathrm{mb}}$ and ${S}_{\mathrm{sb}}$: the main-block and sub-block sizes.
- ${N}_{\mathrm{mb}}$: the number of main blocks.
- ${N}_{\mathrm{sb}}$: the number of sub blocks within each main block.
- ${\mathit{x}}_{\mathbf{mb}}\in {\{0,1,\cdots ,255\}}^{{N}_{\mathrm{mb}}\times {S}_{\mathrm{mb}}\times {S}_{\mathrm{mb}}\times 3}$: an image after main-block division, called a main-block image.
- ${\mathit{x}}_{\mathbf{sb}}\in {\{0,1,\cdots ,255\}}^{{N}_{\mathrm{mb}}\times {N}_{\mathrm{sb}}\times {S}_{\mathrm{sb}}\times {S}_{\mathrm{sb}}\times 3}$: an image after sub-block division, called a sub-block image.
- ${{\mathit{x}}^{\prime}}_{\mathbf{sb}(\mathbf{\gamma})}\in {\{0,1,\cdots ,255\}}^{{N}_{\mathrm{mb}}\times {N}_{\mathrm{sb}}\times {S}_{\mathrm{sb}}\times {S}_{\mathrm{sb}}\times 3}$: an image after the $\gamma $-th operation in sub-block encryption, where $\gamma \in \{1,2,3,4,5\}$.
- ${{\mathit{x}}^{\prime}}_{\mathbf{sb}}\in {\{0,1,\cdots ,255\}}^{{N}_{\mathrm{mb}}\times {N}_{\mathrm{sb}}\times {S}_{\mathrm{sb}}\times {S}_{\mathrm{sb}}\times 3}$: an image after main-block encryption.
- ${{\mathit{x}}^{\prime}}_{\mathbf{mb}}\in {\{0,1,\cdots ,255\}}^{{N}_{\mathrm{mb}}\times {S}_{\mathrm{mb}}\times {S}_{\mathrm{mb}}\times 3}$: an image after sub-block integration.
- ${\mathit{x}}^{\prime}\in {\{0,1,\cdots ,255\}}^{H\times W\times 3}$: an image after main-block integration, i.e., an EtC image.
- ${x}_{\mathrm{sb}}(m,s,h,w,c)$, ${x}_{\mathrm{sb}(\gamma )}^{\prime}(m,s,h,w,c)$, and ${x}_{\mathrm{sb}}^{\prime}(m,s,h,w,c)$: pixel values in ${\mathit{x}}_{\mathbf{sb}}$, ${{\mathit{x}}^{\prime}}_{\mathbf{sb}(\mathbf{\gamma})}$, and ${{\mathit{x}}^{\prime}}_{\mathbf{sb}}$, respectively.
- $m\in \{1,2,\cdots ,{N}_{\mathrm{mb}}\}$: a main-block number.
- $s\in \{1,2,\cdots ,{N}_{\mathrm{sb}}\}$: a sub-block number in the m-th main block.
- $h\in \{1,2,\cdots ,{S}_{\mathrm{sb}}\}$: a position in the height direction in the s-th sub block.
- $w\in \{1,2,\cdots ,{S}_{\mathrm{sb}}\}$: a position in the width direction in the s-th sub block.
- $c\in \{1,2,3\}$: a color-channel number.

#### 3.2.1. Sub-Block Translocation

#### 3.2.2. Block Rotation and Block Flipping

#### 3.2.3. Negative–Positive Transformation

#### 3.2.4. Normalization

#### 3.2.5. Color Component Shuffling

#### 3.2.6. Main-Block Translocation

#### 3.3. Model Encryption

- Step m-1:
- Transform $\mathbf{E}$ to obtain ${\mathbf{E}}_{\mathbf{sb}}\in {\mathbb{R}}^{{N}_{\mathrm{sb}}\times {S}_{\mathrm{sb}}\times {S}_{\mathrm{sb}}\times 3\times D}$.
- Step m-2:
- Translocate indices in the first dimension of ${\mathbf{E}}_{\mathbf{sb}}$ using ${K}_{1}$.
- Step m-3:
- Translocate indices in the second and third dimensions of ${\mathbf{E}}_{\mathbf{sb}}$ using ${K}_{2}$.
- Step m-4:
- Flip or retain the signs of the elements in ${\mathbf{E}}_{\mathbf{sb}}$ using ${K}_{3}$.
- Step m-5:
- Translocate indices in the fourth dimension of ${\mathbf{E}}_{\mathbf{sb}}$ using ${K}_{4}$.
- Step m-6:
- Transform ${\mathbf{E}}_{\mathbf{sb}}$ into the original dimension of $\mathbf{E}$ to derive ${\mathbf{E}}^{\prime}\in {\mathbb{R}}^{(3\xb7{S}_{\mathrm{mb}}\xb7{S}_{\mathrm{mb}})\times D}$.
- Step m-7:
- Translocate rows in ${\mathbf{E}}_{\mathbf{pos}}$ using ${K}_{5}$ to obtain ${\mathbf{E}}_{\mathrm{pos}}^{\prime}\in {\mathbb{R}}^{({N}_{\mathrm{mb}}+1)\times D}$.

#### 3.3.1. Index Translocation in the First Dimension

#### 3.3.2. Index Translocation in the Second and Third Dimensions

#### 3.3.3. Sign Flipping

#### 3.3.4. Index Translocation in Fourth Dimension

#### 3.3.5. Row Translocation

#### 3.4. Evaluation Metrics

## 4. Experiments

#### 4.1. Experimental Setup

#### 4.2. Experimental Results

#### 4.3. Discussion

## 5. Conclusions

## References

**Figure 1.**Overview of ViT [21].

**Figure 2.**Block diagram of the previous method [17].

**Figure 3.**Classification flow of evaluation schemes. (

**a**) Classification flow for JPEG-compressed EtC images using the encrypted model trained with plain training images (evaluation scheme A, hereafter). (

**b**) Classification flow for JPEG-compressed EtC images using the encrypted model trained with JPEG training images (evaluation scheme B, hereafter).

**Figure 8.**Relationship between divided the image and ViT parameter $\mathbf{E}$. Blue dots represent single pixels in the segmented image, and green dots represent single rows in $\mathbf{E}$ corresponding to blue dots.

**Figure 12.**Classification accuracy at each quality factor with and without downsampling (${S}_{\mathrm{sb}}=16$).

${\mathit{S}}_{\mathbf{sb}}$ | Transformation Type | Average Amount of Image Data [bpp] | ||||||
---|---|---|---|---|---|---|---|---|

JPEG Compression | Linear | No | ||||||

$\mathit{Q}=100$ | $\mathit{Q}=95$ | $\mathit{Q}=90$ | $\mathit{Q}=85$ | $\mathit{Q}=80$ | Quantization | Compression | ||

8 | Common | 4.19 | 2.08 | 1.47 | 1.20 | 1.04 | 3.00 | 24.00 |

Independent | 5.50 | 2.80 | 2.01 | 1.64 | 1.42 | |||

16 | Common | 2.98 | 1.57 | 1.13 | 0.93 | 0.82 | ||

Independent | 3.49 | 1.64 | 1.18 | 0.98 | 0.87 | |||

No encryption | 2.92 | 1.54 | 1.10 | 0.91 | 0.80 |

${\mathit{S}}_{\mathbf{sb}}$ | Transformation Type | Classification Accuracy [%] (Change Rate [%]) | ||||||
---|---|---|---|---|---|---|---|---|

JPEG Compression | Linear | No | ||||||

$\mathit{Q}=100$ | $\mathit{Q}=95$ | $\mathit{Q}=90$ | $\mathit{Q}=85$ | $\mathit{Q}=80$ | Quantization | Compression | ||

8 | Common | 98.83 | 98.83 | 98.80 | 98.75 | 98.71 | 33.29 (66.70) | 98.89 (0.00) |

(0.20) | (0.30) | (0.46) | (0.61) | (0.60) | ||||

Independent | 98.45 | 98.33 | 98.24 | 98.00 | 97.67 | |||

(0.99) | (1.17) | (1.27) | (1.45) | (1.94) | ||||

16 | Common | 98.87 | 98.89 | 98.89 | 98.85 | 98.86 | ||

(0.12) | (0.18) | (0.17) | (0.19) | (0.25) | ||||

Independent | 98.87 | 98.86 | 98.89 | 98.74 | 98.66 | |||

(0.10) | (0.17) | (0.46) | (0.57) | (0.66) | ||||

No encryption for | 98.89 | 98.89 | 98.81 | 98.89 | 98.90 | 98.89 | ||

images and model | (0.08) | (0.10) | (0.18) | (0.25) | (0.23) | (-) |

${\mathit{S}}_{\mathbf{sb}}$ | Transformation Type | Classification Accuracy [%] | |
---|---|---|---|

JPEG Compression | Linear Quantization | ||

8 | Common | 98.84 | 88.20 |

Independent | 97.94 | ||

16 | Common | 98.96 | |

Independent | 98.80 | ||

No encryption for images and model | 98.97 |

