1. Introduction
A camera unavoidably suffers from noise in low illumination. A corrupted signal influences the perceptual quality and efficiency of subsequent processing tasks such as video compression and pattern recognition. Thus, denoising is an important issue in image and video processing.
A common technique to enhance image quality in low illumination in photography is to prolong the exposure time. However, motion blurring occurs in images with long exposure time, particularly when moving objects are present in the scene, such as during recording of sports competitions or vehicular collision experiments. Blurring and noise are key challenges in ensuring good video quality in low illumination. Denoising is required when the exposure time is short, whereas deblurring is needed when the exposure time is long. Deblurring is more complex and time consuming for computer vision than denoising, so the latter can be regarded as the preferable option.
A short exposure time produces high-frame-rate videos with crisp and fluid imagery. A judder is also absent unlike in a normal-frame-rate video with a short exposure time. Fig. 1 shows the comparison between normal-frame-rate and high-frame-rate videos. In this figure, two high-frame-rate cameras were used to simultaneously record a waving hand. The normal video frame rate is approximately 25 fps. However, such a rate is still unable to provide clear images of fast-moving objects. Depending on the velocity of motion, a high frame rate, such as 100 fps, 250 fps, or even higher, is necessary.
Fig. 1.Comparison between the 25 fps video and 250 fps video. (a) The 25fps video with a blurred waving hand. (b) The 250 fps video with a clear waving hand.
Numerous studies on video denoising have been conducted in recent decades and are currently available as references. To a certain extent, denoising methods, such as those presented in [1-7], effectively work in low illumination. We take VBM3D [2] as an example. Fig. 2 shows the denoising result of VBM3D for additive white Gaussian noise (AWGN) with standard deviation σn = 30 , which represents light noise in low illumination. The result looks excellent compared with those of the ground truth frame (Fig. 1(b)), except for some details, such as the text on the top left corner.
Fig. 2.Denoising performance of VBM3D for AWGN with standard deviation σn = 30 . (a) Noisy frame whose ground truth is Fig. 1(b) (PSNR = 18.59dB). (b) Denoising result of VBM3D (PSNR = 33.57dB).
However, the performance dramatically degrades in ultra-low illumination. Fig. 3 shows the denoising result of VBM3D for a video captured in a real noisy environment in ultra-low illumination. The video is captured by a high-frame-rate sensor, i.e., Viimagic 9222B. An image in ultra-low illumination has minimal chrominance components. Hence, only the luminance part is taken. Fig. 3 depicts that VBM3D removes noise to a certain degree, but can visual quality still be improved? This task is difficult in ultra-low illumination because a useful signal is nearly submerged in noise.
Fig. 3.Denoising performance of VBM3D in ultra-low illumination. (a) Noisy frame. (b) Denoising result of VBM3D.
In this study, we address the video denoising problem in ultra-low illumination. The target video has a stationary background. This type of video has extensive applications, such as in ubiquitous surveillance cameras.
Our work is based on the Kalman filtering framework. The Kalman filter was proposed by Rudolph E. Kalman [8] in 1960. This set of mathematical equations provides an efficient recursive means to estimate the state of a process such that the mean squared error is minimized. The Kalman filter is widely used in denoising and other video processing tasks [9-12]. For video denoising, this algorithm removes noise from a signal by propagating error covariance statistics. In this study, motion is modeled as imaging process noise. Estimating motion determines the denoising capability of the Kalman filter. We employ two characteristics of noisy high-frame-rate videos to obtain reliable motion estimation. The first characteristic is the small motion vector. The motion area between two frames with a small motion vector is contained within an area with a large motion vector. However, flickering noise does not exhibit this feature. The second characteristic is the loss of small objects and details in large-scale noise in ultra-low illumination. The movement of such minute objects and details is difficult to detect. Therefore, we sacrifice them to improve the estimation on major parts, which is valuable for overall performance.
The remainder of this paper is organized as follows. Section 2 discusses related works on video denoising. Section 3 presents our Kalman filtering framework. Section 4 proposes a motion estimation scheme for ultra-low illumination. Section 5 discusses the experiments. Finally, Section 6 summarizes the study.
2. Related Work
Video denoising can exploit redundant information from nearby frames. Thus, a better denoising capability compared with single-image processing can be expected. Determining how to deal with the temporal relationship of frames is the key to video denoising.
In recent years, many algorithms have used 2D similar patch clusters to implicitly estimate motion information [2,4,13-15]. Similar patches have been matched across several frames in spatiotemporal space. In [13-15], weighted averaging on selected patches was conducted after patch matching to denoise the reference patch. In [4], patch matching was followed by an adaptive threshold approach, i.e., SURE-LET. In VBM3D [2], a two-step Wiener filtering framework was used to handle similar patch clusters.
Meanwhile, some researchers [3,6,16,17] found that 3D patches are more appropriate for video denoising than 2D patches. The former can possibly better characterize motion-related temporal dependency than the latter. In [6] and [16], 3D patches were used as atoms of a sparse dictionary. In [17], a Bayesian framework was proposed to process 3D patch clusters. The basic blocks of VBM4D [3], which is an extension of VBM3D [2], are spatiotemporal 3D volumes that form a 4D group.
Some researchers directly conducted 3D transformations on video data without using 3D patches [5,18-21]. In [18,19], two 3D complex wavelet transforms were proposed. In [5,20], a 2D discrete shearlet transform [21] was extended to have a 3D version.
Motion information is implicitly represented regardless of whether 2D patches, 3D patches, or 3D domain transforms are used for video denoising. However, many researchers have also expected to explicitly estimate motion [7, 22-26]. In [22], a block-based multiple hypotheses motion estimation method was proposed. In [23] and [24], an optical flow method was used to estimate motion. In [25], a hierarchical motion estimation method was discussed. The basic idea of this method is to track matching blocks and filters along the motion trajectory. Although a 2D patch was used in [22] and [25], motion information was explicitly obtained. In [26], motion estimation was performed in the wavelet transform domain. In [7], a cross-correlation algorithm that is robust to Fourier domain noise, called spatiotemporal Gaussian scale mixture (ST-GSM), was proposed for motion estimation. Similarly, we employed the explicit representation approach to estimate motion in the current work.
All of the aforementioned algorithms can be directly applied to high-frame-rate videos. In this study, however, we exploited several new characteristics to improve denoising performance. The details are discussed in Section 4.
3. The Kalman Filtering Framework for Video Denoising
The discrete instant of time is denoted as k . The system state of the previous time step is \(\hat{\mathbf{x}}_{k-1}\). The optional control input is uk . The system working process noise is wk. A is the state transition model that operates in system \(\hat{\mathbf{x}}_{k}\). B is the control input model that operates in uk. The predicted state of the current time step \(\hat{\mathbf{x}}_{k}^{-}\) is \(\hat{\mathbf{x}}_{k}^{-}=\mathbf{A} \hat{\mathbf{x}}_{k-1}+\mathbf{B} \mathbf{u}_{k}+\mathbf{w}_{k}\). The actual measurement is zk is \(\mathbf{z}_{k}=\mathbf{H} \hat{\mathbf{x}}_{k}+\mathbf{v}_{k}\), where H is the observation model that maps the true state space into the observed space. \(\hat{\mathbf{x}}_{k}\) is the state of the current time step. vk is the observation noise.
In video imaging systems, a camera directly records input light rays. Thus, the state transition model A = I , and the mapping observation model H = I . I is the unit matrix in both models. No control input is available for video capturing; hence, uk = 0 . Consequently, the predicted state of the video imaging system is \(\hat{\mathbf{x}}_{k}^{-}=\hat{\mathbf{x}}_{k-1}+\mathbf{w}_{k}\), and the actual measurement is \(\mathbf{z}_{k}=\hat{\mathbf{x}}_{k}+\mathbf{v}_{k}\).
The working process noise w and the observation noise v are independent Gaussian random processes. We model w to be caused by motion and v to be caused by camera noise. Both noises are assumed to have zero mean Gaussian distribution with covariance Q and R , i.e., w ~ N(0,Q) and v ~ N(0,R).
The Kalman filter works in two steps: priori and updated posteriori state estimations. For the video denoising system, the former is given by
\(\hat{\mathbf{x}}_{k}^{-}=\hat{\mathbf{x}}_{k-1}\). (1)
This equation indicates that if process noise is absent (i.e., Q = 0 ), then the predicted state is equal to the system state of the previous time step. The latter updated state is expressed as follows:
\(\hat{\mathbf{x}}_{k}=\hat{\mathbf{x}}_{k}^{-}+\mathbf{K}_{k}\left(\mathbf{z}_{k}-\hat{\mathbf{x}}_{k}^{-}\right)\), (2)
where Kk is the optimal Kalman gain that minimizes the posteriori error covariance. This gain is computed as follows:
\(\mathbf{K}_{k}=\mathbf{P}_{k}^{-}\left(\mathbf{P}_{k}^{-}+\mathbf{R}\right)^{-1}\), (3)
where \(\mathbf{P}_{k}^{-}\) is the predicted priori estimate covariance. This covariance is computed as follows:
\(\mathbf{P}_{k}^{-}=\mathbf{P}_{k-1}+\mathbf{Q}\), (4)
where Pk-1 is the posteriori estimate covariance of the previous time step. The updated posteriori estimate covariance of the current time step is expressed as follows:
\(\mathbf{P}_{k}=\left(\mathbf{I}-\mathbf{K}_{k}\right) \mathbf{P}_{k}^{-}\). (5)
If the noise covariance of a single image pixel is \(\sigma_{n}^{2}\), then R can be written as \(\mathbf{R}=\sigma_{n}^{2} \mathbf{I}\) for the 2D image matrix. The covariance Q of the imaging process noise w is estimated as \(\mathbf{Q}=\left(\boldsymbol{\Delta}_{k}^{k-1, k}\right)^{2}\), where \(\boldsymbol{\Delta}_{k}^{k-1, k}\) is the motion-caused deviation between the current frame k and the last frame k – 1. The computation of this deviation is further discussed in Section 4.
Regarding initialization, the measurement of the first noisy frame z1 functions as the posteriori state estimate, i.e., \(\hat{\mathbf{x}}_{1}=\mathbf{z}_{1}\). Kalman filtering starts at the second frame. The posteriori estimate covariance of the first frame is set as P1 = R .
Moreover, the large motion estimation \(\boldsymbol{\Delta}_{k}^{k-1, k}\) results in the large value of the Kalman gain Kk in the motion area, according to Eqs. (3) and (4). In extreme cases, such as when Kk = I, the updated posteriori state \(\hat{\mathbf{x}}_{k}=\mathbf{z}_{k}\) based on Eq. (2). Thus, spatial denoising is required for the motion area to improve denoising performance. In our work, the classical edge preserving filter, i.e., the bilateral filter [27], is employed. It can be calculated as
\(\begin{array}{l} \tilde{x}_{i, k}=\frac{\sum_{s \in N(i)} G_{S}\left(\|s-i\|, \sigma_{S}\right) \cdot G_{I}\left(\left\|z_{s}-z_{i}\right\|, \sigma_{I}\right) z_{s}}{\sum_{s \in N(i)} G_{S}\left(\|s-i\|, \sigma_{S}\right) \cdot G_{I}\left(\left\|z_{s}-z_{i}\right\|, \sigma_{I}\right)} \\ G_{S}\left(\|s-i\|, \sigma_{S}\right)=\frac{1}{\sqrt{2 \pi} \sigma_{S}} \exp \left(-\frac{\|s-i\|^{2}}{2 \sigma_{S}^{2}}\right) \\ G_{I}\left(\left\|z_{s}-z_{i}\right\|, \sigma_{I}\right)=\frac{1}{\sqrt{2 \pi} \sigma_{I}} \exp \left(-\frac{\left\|z_{s}-z_{i}\right\|^{2}}{2 \sigma_{I}^{2}}\right) \end{array}\), (6)
where \(\tilde{x}_{i, k}\) is the spatial bilateral denoising result of frame k at spatial coordinates i , and zs , zi are the pixel values of noisy frame zk in positions s and i . N(i) is a (2r + 1) x (2r + 1) block centered at i , and s is the coordinate of the block N(i) , i.e., su = [iu - r,iu + r] in horizontal direction u and sv = [iv - r,iv + r] in vertical direction v . Two Gaussian filter kernels are used for the bilateral filter. The first, which is the spatial distance kernel Gs(·) with standard deviation σs, is the same as the Gaussian filter. The second, which is the pixel intensity difference kernel GI(·) with standard deviation σI, is used to preserve edges. A large intensity difference ║zs - zi║ results in a small weight GI. Thus, pixels on different sides can be distinguished.
The spatial denoising result is denoted as \(\tilde{\mathbf{x}}_{k}\). Kalman temporal denoising \(\hat{\mathbf{x}}_{k}\) and bilateral spatial denoising \(\tilde{\mathbf{x}}_{k}\) are mixed by weighted averaging. As the Kalman gain Kk approaches zero, the reliability of the actual measurement zk decreases, whereas trust on the predicted estimate \(\hat{\mathbf{x}}_{k}^{-}\) increases. Accordingly, we use the Kalman gain Kk , which reflects motion degree, as the weight. Thus, the final denoising result is
\(\mathbf{x}_{k}=\tilde{\mathbf{x}}_{k} \cdot \mathbf{K}_{k}+\hat{\mathbf{x}}_{k} \cdot\left(\mathbf{I}-\mathbf{K}_{k}\right)\). (7)
For the predicted priori state estimate, i.e., Eq. (1), this equation is revised as \(\hat{\mathbf{x}}_{k}^{-}=\mathbf{x}_{k-1}\).
Fig. 4 shows the Kalman filtering performance without motion estimation ( Q = 0 ). The video is the same one in Fig. 3. A total of 350 frames are used. The noise standard deviation σn is set to 100. In the figure, the still background is clearer than that of VBM3D (Fig. 3(b)). Our Kalman filtering framework is suitable for videos with a fixed background because it employs all frames by propagating the estimate error covariance Pk. The estimation of the original signal is improved when the number of frames is increased. By contrast, only a few adjacent frames are employed in VBM3D. Thus, a good estimation at the background is obtained.
Fig. 4.Denoising performance of the Kalman filtering framework without motion estimation
However, the blurring occurs without motion estimation. Nearly no moving objects can be observed in Fig. 4, but in fact, a moving man at the left of the visual field can be seen in Fig. 3(b). This blurring phenomenon indicates the importance of motion estimation.
4. Motion Estimation for High-frame-rate Videos in Ultra-low Illumination
Our motion estimation method is based on frame difference. In ultra-low illumination, the noise is extremely large, such that the direct frame difference cannot provide a good estimation. This problem significantly influences motion estimation [Fig. 5(e) and 6(b)]. A Gaussian filter is employed to preprocess a noisy frame. In Fig. 5(f) and 6(c), a large Gaussian kernel with a window that measures 20 x 20 and a standard deviation σG = 5 is used to suppress noise.
Fig. 5.Difference between two successive frames ( k – 1 and k ). (a) Noise-free frames. (b) Noisy frames of (a) polluted by AWGN with standard deviation σn = 50 . (c) Gaussian prefiltered frames of (b) with a 20×20 Gaussian kernel ( σG = 5 ). (d) Frame difference of (a) that functions as the ground truth. (e) Frame difference of (b). (f) Frame difference of (c)
Fig. 6.Difference between two spaced frames (frames k – 5 and k ). (a) Frame difference of noise-free frames. (b) Frame difference of noisy frames. (c) Frame difference of Gaussian prefiltered frames.
Although the influence of noise is decreased to a considerable extent by the large Gaussian kernel, several undesired patches still affect reliable motion estimation. We further improve our method by applying the small motion vector characteristic and intentionally sacrificing small changing patches.
Before the work, the definition of the small motion vector is given. Suppose the length of the object along the motion direction is L , and the movement distance between two successive frames is M . Both variables are measured by the number of pixels. Let M = αL . We define the video to have a small motion vector characteristic when α ≤ 0.5 . The interframe motion distance is less than half of the object size; this is a rough definition. In reality, several motion objects with different lengths and velocities may simultaneously exist in a scene. In such a case, a user often concentrates on only one moving object within a period of time. We call this movement of the most concerned object as the main movement. To simplify this problem, we consider only the main movement for evaluation. Under this condition, the selection of the main movement is objective. For example, suppose a slow walking man and a rushing car appear in the same scene. The movement of the man satisfies α ≤ 0.5, whereas the movement of the car does not. If the man is highlighted, the video can be considered as owning a small motion vector characteristic. If the car is selected, the video is thought to not have such a characteristic. When α ≤ 0.1 , we can assume that the small motion vector characteristic is obvious.
Small motion vector of high-frame-rate videos
Motion estimation becomes more difficult for a small motion vector than for a large motion vector with large-scale noise, as shown in Fig. 5. The video in the figure recorded a moving man at 250 fps. Nearly no motion information can be obtained from the difference of the noisy frames [Fig. 5(e)]. Even the Gaussian prefilter loses its effect [Fig. 5(f)]. However, a large motion vector is easier to detect than a small one (Fig. 6). Frame k in Fig. 6 is also frame k in Fig. 5. The interval is five frames in Fig. 6.
Fig. 5 and Fig. 6 also show another feature of high-frame-rate videos, i.e., based on the same reference frame, the motion area between two close frames is nearly contained between two relatively far frames. This feature results from a small motion vector. Close frames imply a small motion area. Fig. 7 further illustrates the aforementioned feature with several typical motion modes. The major part of the motion area conforms to this feature. Only some minor parts at the edge of the motion area (in red) violate this rule. When the frame rate is high, this containment rule is obviously reliable. However, the flickering noise does not satisfy this rule.
Fig. 7.Motion area containment of several typical motion modes. Only the small red regions violate the rule. (a) Translation. (b) Rotation. (c) Scale variation.
Neglecting small changing patches in ultra-low illumination
Most details and small objects are submerged in noise in ultra-low illumination. From the perspective of frequency-domain analysis, these details and small objects are the high-frequency part of the image. Noise also exists in the high-frequency part of the image, and, thus, these elements overlap with one another. The only object that can be easily recognized in ultra-low illumination is the main structure of the image, which is located at the low-frequency part. From the perspective of principal components analysis, the main structure is the principal component of the image, which represents the main feature of such an image. When the main structure changes, it can be easily perceived. When the details change, however, it is difficult to be detected because the change caused by noise seriously interferes with the change caused by motion. This reason also explains why motion estimation is difficult for the small motion vector in Fig. 5.
In this study, we do not detect small motions, and we neglect small changing patches. To provide an extreme example, detecting a fly in ultra-low illumination is senseless. Our objective is to sacrifice the minor part to improve the major part. Thus, the red regions in Fig. 7 that violate the containment rule can also be ignored.
Our motion estimation scheme is based on the two aforementioned assumptions. Supposing that the last denoised frame is xk-1, then the current frame is zk. The future adjacent frames zk+i (i = 1,2,⋯) are used to help in motion segmentation. A large motion area of N1 frames is initially segmented with extra N2 frames. Segmentation relies on the containment rule of small motion vectors. Then, motion estimation for each of the two successive images within N1 frames is performed in the segmented area. Subsequently, large motion area segmentation is conducted between frames k – 1 + N1 and k – 1 + 2N1 , as shown in Fig. 8.
Fig. 8.Extra N2 frames help the large motion area segmentation of N1 frames.
The motion area is initially segmented as a black-and-white map by hard thresholding as follows:
\(\mathbf{B}_{k}^{k-1, k-1+i}=\left\{\begin{array}{l} 1,\left|G\left(\mathbf{x}_{k-1}\right)-G\left(\mathbf{z}_{k-1+i}\right)\right|>t \\ 0,\left|G\left(\mathbf{x}_{k-1}\right)-G\left(\mathbf{z}_{k-1+i}\right)\right| \leq t \end{array}\right.\), (8)
where i = N1, N1 + 1,⋯, N1 + N2 ; G(·) is the Gaussian prefilter operation, and t is the threshold. The intensity difference below t can be ignored. For the intensity range 0 to 255, our threshold t is fixed as 5 because the human eye nearly cannot discriminate an intensity difference below 5.
The refinement of the motion area between k – 1 and k – 1 + N1 is obtained with an AND operation among other N2 adjacent motion segmentation areas according to the containment rule of small motion vectors as follows:
\(\hat{\mathbf{B}}_{k}^{k-1, k-1+N_{1}}=\mathbf{B}_{k}^{k-1, k-1+N_{1}} \cap \cdots \cap \mathbf{B}_{k}^{k-1, k-1+i} \cap \cdots \cap \mathbf{B}_{k}^{k-1, k-1+N_{1}+N_{2}}\)
, (9)
where i = N1 , N1 + 1,⋯, N1 + N2.
Several small changing patches still exist after the AND operation. We ignore these remaining patches by
\(\tilde{\mathbf{B}}_{k}^{k-1, k-1+N_{1}}=\left\{\begin{array}{l} 1, \text { Area }\left(\text { Conn } \operatorname{Region}\left(\hat{\mathbf{B}}_{k}^{k-1, k-1+N_{1}}\right)\right)>S \\ 0, \text { Area }\left(\text { Conn } \operatorname{Region}\left(\hat{\mathbf{B}}_{k}^{k-1, k-1+N_{1}}\right)\right) \leq S \end{array}\right.\), (10)
where the connected region of \(\tilde{\mathbf{B}}_{k}^{k-1, k-1+N_{1}}\) is obtained with a ConnRegion(·) operation. The area of each connected regions is then calculated. The area is measured by the number of pixels in a connected region. If the area is smaller than S , then the region can be set as 0 to neglect it.
We choose the classical connected region detection algorithm in the book [28] for the ConnRegion(·) operation. The main process is as follows. The algorithm is based on 8-connectivity. At first, the black-and-white binary image, whose pixel value is 0 or 1, is scanned pixel by pixel from left to right and top to bottom. Let x(m,n) ∈ {0,1} denote the current processing pixel, with the position denoted as (m,n). Inspect the neighbor left, top left, top, top right pixels of the current white pixel (x = 1). If all neighbor pixel values are 0, a new label is assigned to the pixel at (m,n) to form a label map L. If one or more pixel values are not-zero, the least label value of these nonzero pixels is assigned to L(m,n), and the other label values of neighboring non-zero pixels are recorded as equivalent. After the scan, the equivalent relations among labels are searched according to reflexivity, symmetry, and transitivity, and then the equivalent labels are modified to the least one. For example, if label value 1 is equivalent to label value 2, label value 2 is equivalent to label value 6, label value 1 is equivalent to label value 6, and all labels are set to value 1. After the modification, the labels are reassigned from small to large with the use of a natural number index. Finally, the original label map is replaced by the new labels. The pixels that have the same label are in the same connected region.
Fig. 9 shows the process of our motion segmentation method. The video in Fig. 9 is the same video in Fig. 5 and Fig. 6. The large motion area contains five frames ( N1 = 5 ), and another four frames are used for segmentation ( N2 = 4 ). N1 and N2 are selected according to the motion vector. In our work, N1 + N2 ≤ 1/α needs to be satisfied. In the definition of the small motion vector, α ≤ 0.5. Thus, N1 + N2 ≤ 2. N1 and N2 should at least be equal to 1. If N1 is too small, detecting the motion area for the small motion vector characteristic of high-frame-rate videos is difficult, such as in Fig. 5. If N2 is too small, it cannot effectively suppress noise influence. We choose N1 and N2 as
Fig. 9.Motion segmentation process
\(N_{1} \approx N_{2} \approx 1 /(2 \alpha)\). (11)
If the velocity is low or the frame rate is high, then a small motion vector is obtained. A large N1 and N2 can be used to improve the accuracy of motion segmentation results. The small patch threshold S is set to 100, which is approximately a 10 x 10 patch that is in line with the noise scale.
Finally, motion estimation for each two successive images within N1 frames is calculated as follows:
\(\boldsymbol{\Delta}_{k}^{k-1+j, k+j}=\tilde{\mathbf{B}}_{k}^{k-1, k-1+N_{1}} \cdot\left|G\left(\mathbf{x}_{k-1+j}\right)-G\left(\mathbf{z}_{k+j}\right)\right|\), (12)
where j = 0,1,⋯, N1 -1.
5. Experiments and Analysis
The performance of the proposed algorithm is evaluated in this section. The experiments are divided into two parts. The first part is on synthetic noisy videos, whereas the second part is on real noisy videos captured in ultra-low illumination. The experiments are conducted in the luminance channel of the video because minimal color can be captured in ultra-low illumination. Three state-of-the-art video denoising methods are chosen for comparison. These methods are the 2D block matching method VBM3D [2], the 3D domain transformation method 3D shearlets [5], and the explicit motion estimation method ST-GSM [7]. The toolbox of these methods can be downloaded from the websites of the authors [29, 30, 31]. The objective criterion peak signal-to-noise ratio (PSNR) is employed to provide quantitative evaluation, which is defined as
\(P S N R=10 \log _{10}\left(\frac{L^{2}}{M S E}\right)\), (13)
where L is the dynamic range of the image. In our experiments, L equals 255 for 8-bit images. MSE is the mean squared error between the original and the corrupted or denoised images. In our experiments, a PSNR score is computed for each frame in the noisy or restored video. The final PSNR for a video is the average of the score of each frame. The denoised pixel intensity range is 0 to 255.
5.1 Synthetic video
The noise-free videos, which function as ground truth, are captured by our high-frame-rate camera (JVC GC-P100). The frame rate is 250 fps, the frame resolution is 640×360, and the duration is 250 frames. Three videos are captured, namely, (1) a man moving from right to left (MovR2L), (2) a man moving from far to near (MovF2N), and (3) a waving hand (Waving). The noise is AWGN with standard deviation σn. Three noise levels are chosen to simulate large-scale noise in ultra-low illumination.
A common parameter among all methods, including ours, is the noise level estimator. In our experiments, this parameter is equal to σn. The parameters of the other algorithms are set as their default values. For our algorithm, the Gaussian prefilter is a 20×20 kernel with σG = 5 . The intensity threshold t = 5 , and the small patch threshold S = 100 . N1 = 6 and N2 = 5 frames are used for motion segmentation. The spatial distance Gaussian kernel of the spatial bilateral filter is 5×5 with a standard deviation of 3. The standard deviation of the pixel intensity difference kernel is 5.
The PSNR comparison is shown in Table 1. It indicates that the proposed algorithm is superior to the other methods in the overall evaluation. The processing time is shown in Table 2. The configuration of the computer is Intel Core i5-2430M CPU with 2.40 GHz speed, 8 GB RAM, and a 64-bit Windows 7 OS. Table 2 illustrates that the runtime of our algorithm is faster than that of the other methods.
Table 1.PSNR (dB) comparison for three large noise levels.
Table 2.Processing time (seconds) comparison for three large noise levels.
In detail, the per-frame PSNR is shown in Fig. 10. Our algorithm outperforms the other methods after an initialization of approximately 50 frames. The advantage of our method can be attributed to the excellent denoising performance on the background. Even the outline of a bunch of balloons, which is shown in the enlarged patch in Fig. 11, is restored. Our robust motion segmentation method ensures good temporal denoising of the Kalman filter. Several segmentation examples are provided in Fig. 12.
Fig. 10.PSNR comparison for the synthetic videos. (a1) to (a3) MovR2L, MovF2N, and Waving with σn = 80 . (b1) to (b3) MovR2L, MovF2N, and Waving with σn = 100 . (c1) to (c3) MovR2L, MovF2N, and Waving with σn = 120 .
Fig. 11.Denoising result for frame 250 of the “MovR2L” video. (a) Noisy input. (b) Ground truth. (c) VBM3D. (d) ST-GSM. (e) 3D shearlets. (f) Proposed method.
Fig. 12.Several examples of our motion segmentation method. The first row is from MovR2L video processing, the second row is from MovF2N video processing, and the third row is from Waving video.
One shortcoming of our algorithm lies in the motion area because only a few frames can be employed for Kalman filtering. Using only the spatial bilateral filter is insufficient to obtain a good result. When the object moves fast or when the frame rate is low, the motion vector or motion area is large. Consequently, the denoising performance of our algorithm is poor. The moving leg in Fig. 13 has a small motion vector. Thus, the influence of noise is not obvious. In Fig. 14(a), the hand stops for an instant when it is turning back, so the motion vector becomes zero, and a good denoising result is achieved. However, the influence of noise appears in a relatively large motion area when the hand is moving, as shown in Fig. 14(b). The peaks in Fig. 10 (a3), (b3), and (c3) result from this instant stop-and-move switch. In the extreme case, when the entire visual field is moving, the algorithm degrades to spatial bilateral filtering.
Fig. 13.Denoising result for frame 215 of the “MovF2N” video. (a) Noisy input. (b) Ground truth. (c) VBM3D. (d) ST-GSM. (e) 3D shearlets. (f) Proposed method.
Fig. 14.Denoising result of our algorithm for the “Waving” video. (a) Frame 150. (b) Frame 170.
5.2 Real video
We also test our algorithm on real videos captured in ultra-low illumination by using a high-frame-rate sensor (Viimagic 9222B). The frame rate is 240 fps, and the frame resolution is 720 × 480. The illumination during video capture is approximately 0.01lux. We increase the global gain of the sensor to obtain a bright image. Three videos are also used for the experiments, namely, (1) a man moving from left to right (MovL2R), (2) a man moving from far to near (MovF2N), and (3) a waving hand (Waving). The MovL2R video is the same one in Fig. 3.
The noise estimator for all methods is set as 100. The settings of the other parameters are the same as those in Subsection 5.1. A sample denoising result is presented in Fig. 15. Our algorithm also shows the clearest background and no artifacts, and the moving objects are maintained too.
Fig. 15.Denoising results for videos with real noise captured under an illumination of 0.01 lux. (a1) to (a5) MoVL2R video. (b1) to (b5) MovF2N video. (c1) to (c5) Waving video.
The experiments prove that our algorithm is suitable for videos with a still background and a small motion vector. The experiments also demonstrate that neglecting small changing patches in videos with serious noise is valuable. Through this procedure, good motion estimation is obtained, and the main bodies of moving objects are maintained.
6. Conclusion
In this study, we propose a denoising algorithm for high-frame-rate videos in ultra-low illumination. Kalman filtering is used as a temporal denoising method. Imaging process noise is modeled as a result of motion, whereas observation noise is caused by the camera. The bilateral filter is used to help denoising in the motion area. Kalman temporal denoising and bilateral spatial denoising are combined by the Kalman gain, which indicates motion degree. Kalman filtering can provide good estimation for the background. However, the motion would be blurred if no motion estimation method is implemented. Reliable motion estimation is the key to an effective Kalman filtering framework.
We exploit two features of high-frame-rate videos in ultra-low illumination for motion segmentation. The first feature is the small motion vector of moving objects in high-frame-rate videos. Thus, the motion area between two close frames is contained within the motion area between two relatively far frames (Fig. 7). This containment rule is valid when the motion vector is small. The second feature is reasonably neglecting small changing patches in ultra-low illumination because detail changes are lost in noise, and only the main structure of the image is detected. A robust motion estimation method for high-frame-rate videos in ultra-low illumination is derived by application of the aforementioned features. A large motion area between two spaced frames is initially segmented with the use of several extra frames under the containment rule and through neglect of small patches [Eqs. (8) and (9)]. Then, the motion of two successive images is estimated in the segmented areas [Eq. (10)].
The experiments on videos with synthetic and real noises demonstrate that our algorithm performs better than other state-of-the-art methods, particularly at the background. Regarding moving objects, the main body is maintained because of our effective motion segmentation scheme, although the denoising performance is not as good as that in the background. When the motion vector is small, the motion area is also small, and the denoising performance is improved. A small motion vector can be achieved when objects are captured at a normal velocity with the use of a high-frame-rate camera. Our algorithm is suitable for denoising high-frame-rate videos in ultra-low illumination.
References
- K. Dabov, A. Foi, V. Katkovnik and K. Egiazarian, "Image denoising by sparse 3D transform-domain collaborative filtering," IEEE transactions on Image Processing, vol. 16, no. 8, pp. 2080-2095, August, 2007. https://doi.org/10.1109/TIP.2007.901238
- K. Dabov, A. Foi and K. Egiazarian, "Video denoising by sparse 3-D transform-domain collaborative filtering," in Proc. of 15th European Signal Processing Conf., pp. 145-149, September 3-7, 2007.
- M. Maggioni, G. Boracchi, A. Foi and K. Egiazarian, "Video denoising, deblocking and enhancement through separable 4-D nonlocal spatiotemporal transforms," IEEE transactions on Image Processing, vol. 21, no. 9, pp. 3952-3966, Septmeber, 2012. https://doi.org/10.1109/TIP.2012.2199324
- F. Luiser, T. Blu and M. Unser, "SURE-LET for orthonormal wavelet-domain video denoising," IEEE Transactions on Circuits and Systems for Video Technology, vol. 20, no. 6, pp. 913-919, June, 2010. https://doi.org/10.1109/TCSVT.2010.2045819
- P. S. Negi and D. Labate, "3-D discrete shearlet transform and video processing," IEEE transactions on Image Processing, vol. 21, no. 6, pp. 2944-2954, June, 2012. https://doi.org/10.1109/TIP.2012.2183883
- M. Protter and M. Elad, "Image sequence denoising via sparse and redundant representations," IEEE transactions on Image Processing, vol. 18, no. 1, pp. 27-35, January, 2009. https://doi.org/10.1109/TIP.2008.2008065
- G. Varghese and Z. Wang, "Video denoising based on a spatiotemporal Gaussian scale mixture model," IEEE Transactions on Circuits and Systems for Video Technology, vol. 20, no. 7, pp. 1032-1040, July, 2010. https://doi.org/10.1109/TCSVT.2010.2051366
- R. E. Kalman, "A new approach to linear filtering and prediction problems," Journal of Basic Engineering, vol. 82, no. 1, pp. 35-45, March, 1960. https://doi.org/10.1115/1.3662552
- M. Kim, D. Park, D. K. Han and H. Ko, "A novel framework for extremely low-light video enhancement," in Proc. of 2014 IEEE Int. Conf. Consumer Electronics, pp. 91-92, January 10-13, 2014.
- F. Conte, A. Germani and G. Iannello, "A kalman filter approach for denoising and deblurring 3-d microscopy images," IEEE transactions on Image Processing, vol. 22, no. 12, pp. 5306-5321, December, 2013. https://doi.org/10.1109/TIP.2013.2284873
- M. Biloslavo, G. Ramponi, S. Olivieri and L. Albani, "Joint kalman-based noise filtering and motion compensated video coding for low bit rate videoconferencing," in Proc. of 2000 Int. Conf. Image Processing, pp. 992-995 vol. 1, September 10-13, 2000.
- R. Dugad and N. Ahuja, "Video denoising by combing Kalman and wiener estimates," in Proc. 1999 Int. Conf. Image Processing, pp. 152-156 vol. 4, October 24-28, 1999.
- M. Mahmoudi and G. Sapiro, "Fast image and video denoising via non-local means of similar neighborhoods," IEEE Signal Processing Letters, vol. 12, no. 12, pp. 839-842, December, 2005. https://doi.org/10.1109/LSP.2005.859509
- A. Buades, B. Coll and J. Morel, "Nonlocal image and movie denoising," International Journal of Computer Vision, vol. 76, no. 2, pp. 123-139, February, 2008. https://doi.org/10.1007/s11263-007-0052-1
- Y. Han and R. Chen, "Efficient video denoising based on dynamic nonlocal means," Image and Vision Computing, vol. 30, no. 2, pp. 78-85, February, 2012. https://doi.org/10.1016/j.imavis.2012.01.002
- Y. Kuang, L. Zhang and Z. Yi, "An adaptive rank-sparsity k-svd algorithm for image sequence denoising," Pattern Recognition Letters, vol. 45, pp. 46-54, August, 2014. https://doi.org/10.1016/j.patrec.2014.03.003
- X. Li and Y. Zheng, "Patch-based video processing: a variational Bayesian approach," IEEE Transactions on Circuits and Systems for Video Technology, vol. 19, no. 1, pp. 27-40, January, 2009. https://doi.org/10.1109/TCSVT.2008.2005805
- I. W. Selesnick and K. Y. Li, "Video denoising using 2d and 3d dual-tree complex wavelet transforms," in Proc. of SPIE 5207, Wavelets: Applications in Signal and Image Processing, pp. 607-618, November 14, 2003.
- H. Rabbani and S. Gazor, "Video denoising in three-dimensional complex wavelet domain using a doubly stochastic modelling," IET image processing, vol. 6, no. 9, pp. 1262-1274, December, 2012. https://doi.org/10.1049/iet-ipr.2012.0017
- D. Labate and P. S. Negi, "3D discrete shearlet transform and video denoising," in Proc. of SPIE 8138, Wavelets and Sparsity XIV, pp. 81381Y-81381Y-11, September, 2011.
- G. Easley, D. Labate and W. Lim, "Sparse directional image representations using the discrete shearlet transform," Applied and Computational Harmonic Analysis, vol. 25, no. 1, pp. 25-46, July, 2008. https://doi.org/10.1016/j.acha.2007.09.003
- H. Tan, F. Tian, Y. Qiu, S. Wang and J. Zhang, "Multihypothesis recursive video denoising based on separation of motion state," IET Image Processing, vol. 4, no. 4, pp. 261-268, August, 2010. https://doi.org/10.1049/iet-ipr.2009.0279
- C. Liu and W. T. Freeman, "A high-quality video denoising algorithm based on reliable motion estimation," in Proc. of 2010 European Conf. Computer Vision, vol. 6313, pp. 706-719, September 5-11, 2010.
- T. Portz, L. Zhang and H. Jiang, "High-quality video denoising for motion-based exposure control," in Proc. of 2011 IEEE Int. Conf. Computer Vision Workshops, pp. 9-16, November 6-13, 2011.
- Z. Cong, Z. Gao and X. Zhang, "A practical video denoising method based on hierarchical motion estimation," in Proc. of 2013 IEEE Int. Symposium on Broadband Multimedia Systems and Broadcasting, pp. 1-5, June 5-7, 2013.
- V. Zlokolica, A. Pizurica and W. Philips, "Wavelet-domain video denoising based on reliability measures," IEEE Transactions on Circuits and Systems for Video Technology, vol. 16, no. 8, pp. 993-1007, August, 2006. https://doi.org/10.1109/TCSVT.2006.879994
- C. Tomasi and R. Manduchi, "Bilateral filtering for gray and color images," in Proc. of 1998 6th Int. Conf. on Computer Vision, pp. 839-846, January 4-7, 1998.
- R. C. Gonzales and R. E. Woods, Digital Image Processing, 2nd Edition, Prentice Hall, New Jersey, 2002.
- VBM3D denoising toolbox [Onlline], http://www.cs.tut.fi/-foi/GCF-BM3D/.
- ST-GSM denoising toolbox [Onlline], https://ece.uwaterloo.ca/-z70wang/research/stgsm/.
- Shearlet denoising toolbox [Online], http://www.math.uh.edu/-dlabate/software.html.