1. Introduction
Single image deblurring is one of the most classic yet challenging research topics in the field of image restoration, which aims to restore a latent sharp image from a blurred image. Recently, deep-learning-based methods have achieved remarkable success in image deblurring by training models on large-scale synthetic deblurring datasets (e.g., GoPro [
1], DVD [
2] and REDS [
3]). These datasets have been suggested to synthesize blurred images by averaging consecutive sharp frames sampled from videos. They are based on the premise that the motion blur can be seen as the accumulation of movements that occurred during the camera exposure duration [
1]. Motivated by this, many methods have made significant progress in estimating a sequence of sharp frames from an observed blurred image, which is also known as the blur decomposition task [
4].
Due to the complex and ill-posed nature of blur decomposition [
5], existing methods [
4,
5,
6] face significant challenges. There is substantial room for improvement in generating visually pleasing, high-quality frames. In addition, most of them have been designed to restore a fixed number of frames using supervised learning. This limits the flexibility and applicability of these models, as adjusting the network architecture or training procedure is necessary to produce different numbers of frames. One of the practical approaches for restoring a flexible number of frames involves first extracting a fixed number of sharp frames using blur decomposition methods. Subsequently, video interpolation is applied to these frames. However, this approach may not be optimal, since inaccurate blur decomposition can lead to degraded quality of the interpolated frames.
In this work, we propose an Arbitrary Time Blur Decomposition Using Critic-Guided Triple Generative Adversarial Network (ABDGAN) (This article is based on Chapter 5 of the first author’s Ph.D. thesis [
7]). This approach restores a sharp frame with an arbitrary time index from a single blurred image, a task we refer to as arbitrary time blur decomposition. One of the main challenges for this problem is the lack of ground-truth (GT) images for every continuous time within exposure duration in existing synthetic datasets [
1]. Recent synthetic deblurring datasets (e.g., GoPro [
1]) have used a range of
sharp frames to synthesize a blurred image. This means that there are only a limited number of timestamps for each blurred image, with no GT images for all continuous time codes over the exposure time. In this circumstance, when the models are trained by supervised fashion using these datasets, they may not be able to effectively restore images at timestamps that are not present in the training set [
8].
As a departure from previous blur decomposition methods that rely on supervised learning, we propose a semisupervised framework. To this end, we adopt the TripleGANs framework [
9] which consists of three players, including a generator, a discriminator, and a classifier. For the blur decomposition task, we modify the role of three players and the objective functions for the generator and the label predictor. Specifically, our ABDGAN plays a min–max game of three players consisting of a time-conditional deblurring network
G, a discriminator
D, and a label predictor
C. Our
G takes a pair of a time code and a blurred image as inputs, and restores a corresponding sharp moment occurring within exposure duration. Meanwhile,
D estimates the probability of whether given images are real or fake. Concurrently,
C predicts the time code when the blurred image and latent sharp image are jointly given. Since the training (real) data do not include sharp frames for every consecutive time,
C is trained not only on real images but also on the generated sharp images by
G. However, in our framework, a naïve adoption of TripleGAN [
9]’s approach often faces unstable training of
C due to the distribution discrepancy problem [
10] between real and fake images. This arises, especially in the early training phase, where the restored images from
G may not match the real data distribution well enough. This makes it difficult for
C to correctly predict the time codes for both real and fake images.
To mitigate this, unlike the original TripleGANs [
9], which directly utilizes fake samples obtained by the generator for training the classifier, our
D assists
C in filtering out unrealistic fake samples. To this end, we propose a critic-guided (CG) loss, allowing our
C to train using reliable fake images generated from
G based on the feedback from the critic
D.
On the other hand, it is a highly ill-posed problem to recover a temporal ordering of frames from a single blurred image, because motion blur is caused by averaging that ruins the temporal ordering of the instant frames [
5]. To address the challenge of frame-order recovery, most existing methods [
5,
6,
11,
12] utilize the pairwise ordering-invariant (POI) loss [
5]. The POI loss is invariant to the order by utilizing the average of two temporal symmetric frames and the absolute value of their difference. This approach enables the network to choose which frame among symmetric frames to generate during training [
5]. As a result, this loss function effectively facilitates stable network convergence by preventing temporal shuffling and ensuring temporal consistency among predicted frames. However, it is suboptimal because pixel-level consistency is not guaranteed, potentially resulting in each pixel in a predicted image corresponding to different GT frames.
To address this problem, we propose a pairwise order-consistency (POC) loss that alleviates the problem of pixel-level inconsistency inherent in the existing POI loss [
5]. Our POC loss shares similarities with the POI loss by including temporal symmetric frames in the loss function. However, our POC loss differs from the POI loss in that the POI loss implicitly matches pairs of estimated frames and GT frames to define the loss, while our POC loss explicitly determines these pairs. Specifically, the proposed POC loss starts by determining whether the temporal order of predicted sharp images aligns with the GT order or its reverse. This preliminary step enables us to determine which specific GT image and predicted image should be optimally minimized. Following this, we ensure that each pixel in a predicted image consistently matches the corresponding pixel in the same ground-truth frame by rigorously enforcing across all pixels.
Figure 1 exemplifies the superiority of our model compared with previous methods [
5,
13]. Unlike existing methods, our model can restore highly accurate dynamic motion from a blurred image. Moreover, our model can extract sharp sequences at any desired frame rate, while competing methods are constrained to restoring a predetermined number of frames.
Our main contributions can be summarized as follows.
We propose Arbitrary Time Blur Decomposition Using Critic-Guided TripleGAN (ABDGAN), a semisupervised learning approach, to extract an arbitrary sharp moment as a function of a blurred image and a continuous time code.
We introduce a critic-guided (CG) loss, which addresses the issue of training instability, especially in the early stages, by guiding the label predictor to learn from trustworthy fake images with the assistance of the discriminator.
We introduce a pairwise order-consistency (POC) loss, designed to guarantee that every pixel in a predicted image consistently matches the corresponding pixel in a specific ground-truth frame.
Our extensive experiments demonstrate that our method surpasses existing methods in restoring high-quality frames at the GT frame rates, and consistently produces superior visual quality at arbitrary time codes.
The remainder of this paper is organized as follows. In
Section 2, we review previous works on image deblurring.
Section 3 presents the details of our proposed ABDGAN.
Section 4 analyzes the experimental results of the proposed method. Finally, in
Section 6, we discuss the conclusions and future works.
3. Proposed Method
Let and represent a single blurred image and a temporal index within the normalized exposure time, respectively. Our goal is to restore a specific sharp moment conditioned on b and t. One of the major challenges in our goal is to predict the sharp moment for any continuous time code, particularly when the training data contain very few ground-truth images for continuous time code. To overcome this, the proposed ABDGAN utilizes semisupervised learning by leveraging both labeled and unlabeled data. Here, the labeled dataset is sampled from the real-data distribution . It explicitly contains ground-truth sharp images corresponding to each b and t. In contrast, the unlabeled data, the set , indicate that there is no ground-truth sharp image . By leveraging both labeled and unlabeled data, the proposed method aims to predict an accurate sharp moment for any continuous time code. This is achieved despite the scarcity of ground-truth sharp images in the training dataset.
Learning Arbitrary Time Blur Decomposition based on TripleGANs. Inspired by TripleGANs [
9,
10], which achieved successful results in conditional image synthesis in a semisupervised manner, our proposed ABDGAN introduces a new strategy. It plays a min–max game consisting of three players: a time-conditional deblurring network
G, a discriminator
D, and a time-code predictor
C, as depicted in
Figure 2. As mentioned earlier, one of our major goals is to train
G to predict any sharp moment corresponding to arbitrary time code and a blurred image. To achieve this, our
D plays a role of providing adversarial feedback for
G to restore realistic sharp images. Simultaneously, the main role of
C is to provide precise feedback so that
G generates an accurate temporal sharp moment corresponding to the input time code among latent sharp motions within the blurred image.
Concretely, given a pair of
b and
, the proposed time-conditional deblurring network
G outputs a sharp frame
, which is written as
. As illustrated in
Figure 2,
D receives a pair
as input, where
s represents either a real sharp image
or restored sharp image
from
G. During training,
D is trained to predict whether the input comes from the real-data distribution
or the fake-data distribution
. Structurally, we exploit a UNet discriminator [
31] for our
D’s architecture. This architecture involves an encoder that outputs a per-image critic score
and a decoder that outputs a per-pixel critic score
. Given a pair
as input, where
s represents either a real sharp image
or restored sharp image
, our
C is trained to accurately predict the corresponding temporal code, as depicted in
Figure 2. Since our
C is trained using fake images as well as real images,
C can provide adequate feedback to our
G to ensure that the restored sharp moment aligns accurately for arbitrary time code. Considering that image restoration is a pixel-by-pixel dense prediction task [
32,
33,
34], we employ a UNet-based architecture [
35] for our
C to provide per-pixel feedback on
t for
G. Let the temporal code map
denote a 2-dimensional matrix filled with
t as
for every pixel coordinate
. Given an input pair of
, our
C fuses
b and
using channel-wise concatenation, and outputs pixel-wise time-code map
, which is written as
.
Pairwise-order consistency (POC) loss. For training our
G with the labeled data
more effectively, we propose our POC loss. Unlike conventional POI loss [
5] employed in previous studies [
5,
6,
11,
12], our proposed POC loss offers distinct advantages by enforcing stronger constraints on the temporal order of predicted frames. The proposed POC loss results in significant improvements in the accuracy and visual quality of predicted frames compared to existing POI loss.
Critic-guided (CG) loss. As mentioned earlier, the distribution discrepancy problem is one of the crucial challenges in training TripleGAN-based framework [
10]. The limited number of labeled data may not be sufficient for our
G to effectively learn to restore a sharp moment when the input temporal code is absent in the training data. To overcome this, our
C is trained not only with labeled sharp images but also with fake sharp images restored by
G with unlabeled data. However, especially in the early training phase, a distribution discrepancy can arise between real and fake images. This poses a challenge for our
C, which is trained to predict correct time codes for both real and fake images. To address this, we propose our CG loss, optimizing
C using realistic fake images by leveraging the decision made by
D.
Table of notation. To ensure clarity and consistency,
Table 2 shows a concise summary of the notations used throughout this paper. Unless stated otherwise, we maintain consistency in notation.
In the following, we provide explanations of our pairwise-order consistency (POC) loss in
Section 3.1, and the critic-guided (CG) loss in
Section 3.2. The entire training procedure is described in
Section 3.3.
3.1. Pairwise-Order Consistency Loss
In
Appendix A, we describe the limitations of the existing POI loss [
5]. Based on this analysis, we introduce our POC loss, which is designed to overcome the shortcomings. Let
denote the sampled set from the dataset, where
. That is,
is a pair of GT symmetric frames for the central frame
. This implies that
is a pair of GT symmetric frames for the central frame
. Then, we can obtain
and
. Without loss of generality,
t and
satisfy
. Then, the proposed POC loss
is defined as follows:
where
is the
distance metric between two images. The proposed
utilizes
to assess whether the temporal order of predicted sharp images aligns with the GT order or its reverse. If
, this indicates that the GT frame
is closer to its corresponding predicted frame
than to the opposite time-symmetric predicted frame
. Consequently, this suggests that
and
are correctly aligned with GT temporal order of sharp frames
and
, respectively. Based on this, we directly minimize the sum of individual
distance between each predicted frame and its correct GT frame. Conversely, if
, it implies that the predicted frames
and
are aligned with the reverse order of GT frames. In such a case, we minimize the sum of the
distance between
and
, and between
and
. Our POC loss marks a departure from the existing POI loss [
5] (Equation (A1)). It introduces stricter constraints to ensure that every pixel in the predicted image aligns consistently with the same ground-truth (GT) frame. This enhancement substantially improves the accuracy and reliability of frame prediction.
3.2. Critic-Guided Loss
The proposed CG loss trains
C with trustworthy fake samples induced by the pixel-wise critic score
, which represents the pixel-wise probability map predicted by
D. The
D is trained to predict a probability value close to 1 when the input sample is as realistic as the real image, and 0 for vice versa [
29]. If the output probability value of
D for a fake image generated by
G is 0.5, this means that
G generates a sharp image whereby
D cannot distinguish between real and fake [
29]. Based on this, we consider the case where the output of
D is greater than 0.5 for fake data to be trustworthy (realistic) fake data. Accordingly, the sigmoid-based soft threshold function
is pixel-wisely applied to
, as
, where
indicates the pixel coordinate. Here,
is the
x value of the middle point of the sigmoid curve, and
is the steepness of the sigmoid curve. From this, we can obtain the weighting mask by applying
to the outputs of
. For simplicity, we denote
and
as
and
for each pixel
, respectively. Given
sampled from the dataset and the randomly sampled time codes
, our CG loss
is defined as follows:
where
is the
distance metric between two images. As a result of this collaboration with
D, our
C is naturally liberated from the problem of distribution discrepancy between real samples and fake samples. Since our
C learns with the realistic fake samples generated using arbitrary value of
, it overcomes the problem of the limited values of
t in the training dataset.
3.3. Training Objectives of ABDGAN
Similar to [
9,
10], the proposed ABDGAN plays a min–max game of the three networks
D,
C, and
G. Algorithm 1 briefly outlines the optimization process of our ABDGAN. In Algorithm 1,
M denotes the total number of training pairs of
.
f indicates the integer frame index among the total number of sharp frames
F. By calculating
,
t represents the temporal code within the normalized exposure time.
Algorithm 1 Entire training procedure of ABDGAN |
Input: Training data , Initialize: D, C, G, Adam [36] learning rate and , and batch size of n. Set balancing parameters between losses and . Initialize: Number of total training iterations- 1:
for iteration iter do - 2:
Update D using Algorithm 2 - 3:
Update C using Algorithm 3 - 4:
Update G using Algorithm 4 - 5:
end for
|
Algorithm 2 Training of the discriminator D |
- 1:
Sample a batch of of size n, and a batch of of size n. - 2:
Generate fake samples using G as: - 3:
Compute using real samples and fake samples by Equation (3) - 4:
Update the parameters of D using the gradient of by Adam [ 36]
|
Training of the discriminator D. As described in Algorithm 2, the objective of
D,
, is to correctly determine whether the given samples are real or fake. For this purpose,
D is optimized to maximize the log probability of real samples and minimize the log probability of fake samples. Subsequently,
consists of
and
, which represent the losses for the real samples and fake samples, respectively. The
is defined as follows:
where
is the per-image critic score, measured by encoder of
D.
represents the per-pixel critic score measured by decoder of
D at pixel coordinate
.
Training of the time-code predictor C. The details of the training scheme of
C are described in Algorithm 3. One key aspect of our ABDGAN is that our
C is trained to predict accurate time code for both labeled and unlabeled data. For this, the loss function for training
C,
, is defined as the sum of two regression loss functions
and
(Equation 2). The regression loss function for labeled data,
, is defined using the the labeled data
, which can be viewed as real images sampled from the training dataset. The proposed
is defined as follows:
where
and
denote
and
, respectively. For unlabeled data, as described in
Section 3.2, our
C is trained using the proposed CG loss
(Equation (2)). Both
and
are defined depending on the estimated order using
G and the labeled data
. This allows
C to predict a consistent temporal order between time codes with ground-truth images and arbitrary symmetric time codes. Overall, total loss
is defined as follows:
Algorithm 3 Training of the time-code predictor C |
- 1:
Sample a batch of labeled data of size n, Sample a batch of unlabeled data of size n - 2:
Predict the time code matrix on labeled data by: , - 3:
Predict the sharp images on unlabeled data using G as: , - 4:
Compute the weighting mask using and the sigmoid function as: , - 5:
Compute by Equation (2) and by Equation (4) - 6:
Compute the total loss by Equation (5) - 7:
Update the parameters of C using the gradient of by Adam [ 36]
|
Training of the generator G. To restore accurate pixel intensities,
(Equation (1)) is utilized for optimizing
G. To encourage
G to restore a realistic sharp image according to arbitrary time codes,
is defined using estimated time codes from
C. Given the symmetric samples
, the
can be obtained by
and
. For randomly sampled time codes
, we can also obtain
and
. Then, our
is formulated as follows:
To guarantee that the generated image is as realistic as the real data, the adversarial loss for
G can be defined as follows:
Then, we define our adversarial loss
using the symmetric pair of time codes
and randomly sampled codes
as follows:
Based on the above procedure, the entire objective of
G,
, is formulated by the weighted sum of Equations (
1), (
6), and (
8),
where
,
, and
are balancing weight parameters, which are empirically set to
,
, and
. Algorithm 4 shows the training scheme of
G. As shown in line (2) of Algorithm 4, given the symmetric samples
, the predicted deblurring results are obtained by
and
. For unlabeled data
, we can obtain the predicted deblurring results from
G (line (3) of Algorithm 4). Then, predicted time code maps are obtained using
C, as shown in lines (4) and (5) of Algorithm 4. Then, our proposed POC loss
is obtained to restore more accurate pixel intensities on labeled data.
allows our
G to restore a realistic sharp image according to arbitrary time codes. The adversarial loss
is computed to guarantee that the generated image is real for both labeled data and unlabeled data.
Algorithm 4 Training of the generator G |
- 1:
Sample a batch of labeled data of size n, Sample a batch of unlabeled data of size n - 2:
Predict the sharp images on labeled data using G as: , - 3:
Predict the sharp images on unlabeled data using G as: , - 4:
Predict the time code matrix on labeled data using C as: , - 5:
Predict the time code matrix on unlabeled data using C as: , - 6:
Compute by Equation (1), by Equation (6), and by Equation (8) - 7:
Compute the total loss by Equation (9) - 8:
Update the parameters of G using the gradient of by Adam [ 36]
|
5. Limitations
Figure 9 shows the failure cases of our approach. The first row presents the results on a test image sampled from the benchmark GoPro test set [
1]. In the second row, we show the outputs when the input blurred image is degraded by defocus blur. The input and the ground-truth images are sampled from the recent benchmark defocus blur test set proposed by Lee et al. [
47]. Here, the results for central frames are shown for all models. Despite the significant advancements achieved by our proposed ABDGAN in motion-blur decomposition, we recognize two primary limitations. First, when the input image is severely degraded by substantial motion due to a large amount of local motion and camera shake, our method encounters challenges in accurately restoring a sharp frame, as shown in the first row in
Figure 9. Second, since our method is specifically designed for restoring motion-blurred images, we did not account for various other types of blur (i.e., defocus blur) that commonly occur in real-world scenarios. As depicted in the second row in
Figure 9, our method encounters limitations in such cases.
Based on these limitations, we believe that future research should focus on addressing severe motion blur. Additionally, improving the model’s robustness to accommodate various blur types, including defocus blur, remains a key focus for broader applicability.
6. Conclusions
In this paper, we proposed an ABDGAN, which is a novel approach for arbitrary time blur decomposition. By incorporating a TripleGAN-based framework, our ABDGAN learns to restore an arbitrary sharp moment latent in a given blurred image when the training data contain very few ground-truth images for continuous time code. We also proposed a POC loss that encourages our generator to restore more accurate pixel intensities. Moreover, we proposed a CG loss that ensures stability training by minimizing the distribution discrepancy between generated and real frames. Extensive experiments conducted on diverse benchmark motion blur datasets demonstrate the superior performance of our ABDGAN when compared to recent blur decomposition methods in terms of quantitative and qualitative evaluations.
The proposed ABDGAN outperforms the best competitor, enhancing PSNR, SSIM, and LPIPS on the GoPro test set by 16.67%, 9.16%, and 36.61%, respectively. On the B-Aist++ test set, our method provides improvements of 6.99% in PSNR, 2.38% in SSIM, and 17.05% in LPIPS over the best competitive method. In conclusion, the proposed ABDGAN restores arbitrary sharp moment from a single motion-blurred image with accurate, realistic, and pleasing quality. We believe that the proposed ABDGAN expands the application scope of image deblurring, which has traditionally focused on restoring a single image, to arbitrary time blur decomposition.
We anticipate that extending our ABDGAN to tackle a broader range of blur types, including defocus blur, will result in a more versatile and comprehensive deblurring solution. Future work will focus on enhancing the model’s capability to handle diverse blur scenarios, thereby improving its applicability and effectiveness in real-world situations.