# Privacy-Preserving Semantic Segmentation Using Vision Transformer

## Abstract

## 1. Introduction

- We propose the combined use of encrypted images and models in a semantic segmentation task to protect visual sensitive information of input images for the first time.
- We confirm that the proposed method allows us not only to use the same accuracy as that when images are not encrypted but to also update a secret key easily.

## 2. Related Work

#### 2.1. Privacy-Preserving DNNs

#### 2.2. Learnable Image Encryption for Machine Learning

#### 2.3. Segmentation Transformer

## 3. Proposed Method

#### 3.1. Overview and Threat Model

#### 3.2. Encryption Method

**E**and ${\mathbf{E}}_{\mathbf{pos}}$ are decided by training a model with plain images. By using a trained model ${\psi}_{\theta}$, a segmentation map y is given by

#### 3.2.1. Model Encryption

**E**is transformed with key K after training a model as follows.

- Randomly generate a matrix ${\mathbf{E}}_{\mathbf{enc}}$ with key K as$$\begin{array}{c}\hfill {\mathbf{E}}_{\mathbf{enc}}=\left[\begin{array}{cccc}{k}_{(1,1)}& {k}_{(1,2)}& \cdots & {k}_{(1,L)}\\ {k}_{(2,1)}& {k}_{(2,2)}& \cdots & {k}_{(2,L)}\\ \vdots & \vdots & \ddots & \vdots \\ {k}_{(L,1)}& {k}_{(L,2)}& \cdots & {k}_{(L,L)}\end{array}\right],\end{array}$$$$\begin{array}{cc}\hfill {k}_{(i,j)}& \in \mathbb{R},\phantom{\rule{4pt}{0ex}}i,j\in \left(\right)open="\{"\; close="\}">1,\cdots ,L,\hfill \end{array}$$
- Multiply ${\mathbf{E}}_{\mathbf{enc}}$ and
**E**to obtain $\widehat{\mathbf{E}}$ as$$\widehat{\mathbf{E}}={\mathbf{E}}_{\mathbf{enc}}\mathbf{E},\phantom{\rule{4pt}{0ex}}\widehat{\mathbf{E}}\in {\mathbb{R}}^{L\times D}.$$ - (3)

#### 3.2.2. Example of ${\mathbf{E}}_{\mathbf{enc}}$

- Generate a random integer vector with a length of L by using a random generator with a seed value as$${l}_{enc}=[{l}_{e}\left(1\right),{l}_{e}\left(2\right),\dots ,{l}_{e}\left(i\right)\phantom{\rule{4pt}{0ex}},\dots ,{l}_{e}\left(L\right)]\phantom{\rule{4pt}{0ex}},$$$$\begin{array}{cc}\hfill le\left(i\right)\in & \left(\right)open="\{"\; close="\}">1,2,...,L,\hfill \end{array}$$
#### 3.2.3. Test Image Encryption

- Divide a test (query) image tensor $x\in {\mathbb{R}}^{h\times w\times c}$ into blocks with a size of $p\times p$ such that $B=\left(\right)open="\{"\; close="\}">{B}_{1},\cdots ,{B}_{N}$.
- Flatten each block ${B}_{i}$ into a vector ${b}_{i}$ such that$${b}_{i}=[{b}_{i}\left(1\right),\cdots ,{b}_{i}\left(L\right)],$$
- Generate an encrypted vector $\widehat{{b}_{i}}$ by multiplying ${b}_{i}$ by ${\widehat{\mathbf{E}}}_{\mathbf{enc}}$ as$$\widehat{{b}_{i}}={b}_{i}{\widehat{\mathbf{E}}}_{\mathbf{enc}},\phantom{\rule{4pt}{0ex}}\widehat{{b}_{i}}\in {\mathbb{R}}^{L},$$
- Concatenate the encrypted vectors into an encrypted test image $\widehat{x}$.

#### 3.3. Requirements of Proposed Method

- Semantic segmentation can be carried out by using visually protected input images without sensitive information.
- No network modification is required.
- A high accuracy, which is close to that of using plain images, can be maintained.
- Keys are easily updated.

## 4. Experimental Results

#### 4.1. Setup

#### 4.2. Semantic Segmentation Performance

#### 4.3. Comparison with Conventional Methods

#### 4.4. Robustness against Attacks

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

**Figure 2.**Architecture of segmentation transformer [17].

**Figure 4.**Example of encrypted images ($p\times p=16\times 16$). Zoom-ins of red-framed regions are shown on right side of each image. The red boxes represent sensitive information such as license plates.

**Figure 6.**Mean IoU ($mIoU$) values of protected models with randomly generated 50 keys. Boxes span from first to third quartile, referred to as ${Q}_{1}$ and ${Q}_{3}$, and whiskers show maximum and minimum values in range of [${Q}_{1}-1.5({Q}_{3}-{Q}_{1}),{Q}_{3}+1.5({Q}_{3}-{Q}_{1})$]. Band inside box indicates median. Outliers are indicated as dots. Blue lines represent each baseline.

Dataset | Selected Decoder | Baseline | Correct (K) | No-Enc | Random (${\mathit{K}}^{\prime}$) |
---|---|---|---|---|---|

Cityscapes | Naïve | 0.6490 | 0.6490 | 0.0674 | 0.0718 |

MLA | 0.6386 | 0.6386 | 0.0792 | 0.0743 | |

PUP | 0.7039 | 0.7039 | 0.1135 | 0.1137 | |

ADE20K | Naïve | 0.3710 | 0.3710 | 0.0023 | 0.0024 |

MLA | 0.4370 | 0.4370 | 0.0030 | 0.0029 | |

PUP | 0.4383 | 0.4383 | 0.0048 | 0.0050 |

**Table 2.**Accuracy (mIoU) of conventional method with encrypted images [15].

Network | Fully Convolutional Network (FCN) | ||||||||
---|---|---|---|---|---|---|---|---|---|

Block size | SHF | NP | FFX | ||||||

Correct (K) | No-enc | Random (${K}^{\prime}$) | Correct (K) | No-enc | Random (${K}^{\prime}$) | Correct (K) | No-enc | Random (${K}^{\prime}$) | |

4 | 0.4731 | 0.4536 | 0.3671 | 0.4706 | 0.3359 | 0.1505 | 0.3823 | 0.0157 | 0.0012 |

16 | 0.2214 | 0.1994 | 0.1150 | 0.3439 | 0.2114 | 0.0832 | 0.2611 | 0.0007 | 0.0079 |

Baseline | 0.5966 |

