# 1. Introduction

3D video technologies are now offering more realistic experience to users. Since most of the 3DTV and 3D movies in the current market are stereoscopic displays, most of available 3D content are in the stereoscopic format. In this case, it has problems that provide narrow viewing angle and require wearing glasses for watching 3D contents. To solve these problems, auto stereoscopic displays and FTV (Free-viewpoint TV) have been studied. Auto stereoscopic displays provide 3D depth perception without wearing glasses by simultaneously providing a number of slightly different views. FTV enables users to watch image at any arbitrary viewpoint. However, in these cases, we need more spectrum bands to transmit and also need a big data storage as well as considerable cost for the multi-camera set-up [1-2].

In general, the auto-stereoscopic multi-view 3D display system needs multi-view images as an input. There are three methods of acquiring the multi-view images. First, we can acquire multi-view images by using as many cameras as the number of views required. However, in this case, synchronizing and calibrating multi-cameras is a very difficult job. The next option is using a camera system that can acquire a color image and its corresponding depth map and synthesize virtual intermediate multi-view images from the acquired data. Last option is estimating the disparity from the stereo images acquired by two color cameras to synthesize multi-view images [3].

MPEG regards FTV as the most challenging 3D media service and started international standardization activities in 2002. 3DV (3D Video) group in MPEG is working on a standard which can be used to serve a variety of 3D display formats. 3DV is a new framework that includes a coded representation of the multi-view video and depth information to support the generation of a highquality intermediate view. Consequently, depth estimation and view synthesis processes are two critical processes in 3DV and we therefore need a high quality view synthesis algorithm. We may use limited number of camera images to generate multi-view images by using the DIBR (depth image based rendering) algorithm [4-5].

Depth image-based rendering (DIBR) is one of the most popular techniques used to render virtual-views. A color image and its associated per-pixel depth map are used for 3D warping which is based on the principle of projective geometry. However, extracting the accurate disparity map or depth map is a time consuming and very difficult. Moreover, there exists boundary noise and holes in the synthesized image because of occlusion and inaccurate disparity. The boundary noise occurs due to the boundary mismatch between depth and texture images during 3D warping process and it usually causes unusual defects in a generated virtual view. Also, common-holes are created while synthesizing a virtual-view. In general, commonholes are recovered with the most similar adjacent regions in the warped reference view. However, common-holes are difficult to completely recover so many researchers have already worked to resolve these problems although their performance is not yet satisfactory in terms of visual quality. Therefore, we need a new way to perform the hole filling process that will result in high performance [6].

To fill the common-hole, linear interpolation and inpainting method have been suggested. Linear interpolation is linearly adding or subtracting the pixel values at opposite ends of the common-hole area. This process requires little time but the quality of the common-hole is insufficient. Inpainting method is originally used for recovering damaged parts of an image by estimating the value from the given color information [7]. VSRS (View Synthesis Reference Software) version 3.5 of MPEG 3DV has already tried to resolve these problem. We have three choices in VSRS. The first one is the pixel based in-painting method [8]. The second one is the depth based weighted average method [9] and the last one is the interpolation method by using bilateral filter [10]. However, it has a blurring effect in the result so that we could say its performance is not yet satisfied in terms of a visual quality. In order to solve problems such as these, texture synthesis based in-painting methods have been proposed. One of those methods that have been proposed by Criminisi includes exemplar based in-painting [11]. This method prioritizes the pixels in the recovered area and finds the most similar area to the patch based on the first priority pixel. Y. J. Kim used this exact method to fill the common-holes while Daribo took the depth information into account when using this method [12, 13]. However, all of the methods mentioned above have a tendency to make an error when using the object’s information to fill the common-holes. Also because the above mentioned algorithms do not consider properties of other frames, they may result in the flickering defect when applied to successive frames.

In this paper, a more effective algorithm to enhance the quality of the synthesized view by properly filling common-holes is newly proposed. First, to remove boundary noise, we find occlusion regions and expand these regions to common-hole region in the synthesized view. Then, we determine pixel values of common-holes from both the spiral weighted average algorithm and the gradient searching algorithm. Spiral weighted average algorithm keeps boundary of each object relatively well with depth information. However, it brings a color spreading effect around the common-hole regions as a result. To solve this problem, we use a gradient searching algorithm to keep high frequency components to preserve details in the synthesized view. We combine two algorithms by using an adaptive weighting factor to reflect strong points of each algorithm effectively in the virtual view synthesis process. We also use a probability mask to remove this flickering defect in the proposed algorithm.

This paper is organized as follows. Section 2 presents the 3D warping process and problems. The proposed algorithm is presented in Section 3. Then, the performance of the proposed algorithm is demonstrated in Section 4. Finally, Section 5 concludes this paper.

# 2. 3D Warping and Problems

## 2.1 Virtual viewpoint image synthesis using 3D warping

3D warping is used to determine the world coordinates of a given image using the intrinsic and extrinsic camera parameters. Then, it is used to generate the desired viewpoint image through to the re-projection of the 2D space using the virtual viewpoint camera parameters. Conversion of the 2D image coordinate system to the 3D world coordinate system is necessary for 3D warping and a 3D world coordinate system, the given 2D image coordinate system needs to be converted to a 3D camera coordinate system. The world coordinate and the camera coordinate systems are both 3D coordinate systems and conversion between the two systems can be achieved through rotation and translation as shown in Fig. 1. These two operations are defined as the camera’s extrinsic parameter.

**Fig. 1.**Transformation of the world coordinate to the camera coordinate

The conversion of the camera coordinate system to the 2D image coordinate system can be explained through the camera’s geometrical structure as shown in Fig. 2. Fig. 2(a) represents the 3D model of the pin-hole camera, and Fig. 2(b) represents the 2D model. Generally, the image from the pin-hole camera passes through the pin-hole in a straight line and is formed as a reverse image in the –f location of the Z-axis. (The f in the previous sentence stands for the focal length of the camera). However, the plane where an image is formed is moved to the focal length on the Z-axis then analyzed.

**Fig. 2.**Geometrical structure of the pin-hole camera; (a) 3D and (b) 2D

The projection of the object’s 3D coordinates of the camera coordinate system on the image plane can be explained using similar triangles that are formed using the focal length and the coordinates of the object as shown in Fig. 2(b). This conversion process is defined as the intrinsic parameter.

Using the extrinsic and intrinsic parameters of the camera, the 3D coordinates in the world coordinate system can be converted to the 2D coordinates in the image plane as shown in Eq. (1).

where x, y represents 2D coordinates of the image plane, K represents the intrinsic parameter of the camera, R is the rotational matrix of the camera, T is the translation vector of the camera and X, Y, Z represents the 3D coordinates of the world coordinate system. Also, K[R|T] is defined as the projection matrix.

Through the inverse operation of the matrix in Eq. (1), the 2D coordinates can be converted to the world coordinate system as shown in Eq. (2). At this time, the disparity information D from Eq. (3) is needed to find the real depth value Z.

where Z(i, j) and D(i, j) are the depth value and the disparity values of the (i, j) coordinate within the image, MinZ and MaxZ represents the minimum and maximum values of Z, respectively.

To produce the virtual viewpoint image, the intrinsic and extrinsic parameters of a virtual camera that exists at a virtual viewpoint location needs to be defined. In general, the intrinsic parameter is determined by camera’s inner structure. Thus, the intrinsic parameter of the reference viewpoint of the camera can be used as the intrinsic parameter of the virtual viewpoint camera. The value of the extrinsic parameter can be used after converting to the location of the virtual viewpoint camera. The converted 3D world coordinate and the parameter of the virtual viewpoint camera are applied to Eq. (1) to find Eq. (4). Then, Eq. (4) re-projects to the image coordinates of the virtual viewpoint.

where xv, yv represents 2D coordinates of the virtual viewpoint image to be formed, KV, RV, TV represents the intrinsic parameter, rotation matrix and the translation vector of the virtual viewpoint camera, respectively [14].

Fig. 3 shows an example of a virtual viewpoint image produced by the 3D warping technique. As shown in Fig. 3(b), in the virtual viewpoint image that is formed, an occlusion region can be found at the reference viewpoint image. This region is the common-hole region.

**Fig. 3.**Generation of a virtual view-point image; (a) reference color and depth images and (b) generated virtual viewpoint image Transformation of the world coordinate to the camera coordinate

## 2.2 Problems of the generated virtual view

### 2.2.1 Boundary noise

As a result of 3D warping in the view synthesis process, boundary noise still remains because of the mismatch of boundaries of depth map and texture image in a given reference view. An example of the boundary noise is shown in Fig. 4. It is easy to see the remaining boundary components inside of the circle in Fig. 4. The shape of boundary noise seems to be an afterimage of an object as shown in Fig. 4 and it occurs around object boundaries.

**Fig. 4.**Boundary noises

### 2.2.2 Common-hole

Common-holes created while synthesizing a virtual view result in some defects. In general, common-holes are recovered with the most similar adjacent regions in the synthesized reference views. However, common-holes are difficult to be completely recovered. An example of the common-hole is shown in Fig. 5.

**Fig. 5.**Common-holes

### 2.2.3 Flickering defect

Existing algorithms do not use any information from the other frames. Consequently, we may have a flickering defect around the filled common-hole regions when they are applied to the successive frames in a video content.

# 3. Proposed Algorithm

In this section, we describe the proposed common-hole filling algorithm. Fig. 6 shows a block diagram of the proposed algorithm. At first, we perform 3D warping with a color image and its corresponding depth image. Then, we search for boundary noises around holes then extend them to the common-hole region. Next, we compensate the common-hole region separately by using the spiralweighted average and the gradient searching algorithms. Next, we combine these two results by using an adaptive weighting factor. Finally, we removed the flickering defect by using a probability mask.

**Fig. 6.**Block diagram of the proposed algorithm

## 3.1 Boundary noise detection

Boundary noise occurs due to the boundary mismatch between the depth and texture images during the 3D warping process. Boundary noise occurs in a similar manner to Fig. 7(a) as an object’s after-image. If the common-hole is filled without removing this noise, then the quality of the filled common-hole is degraded as shown in Fig. 7(b) because the color information of the boundary noise is used to fill the common-hole.

**Fig. 7.**(a) Boundary noise image, (b) hole filling result with boundary noise, and (c) hole filling result after boundary noise is removed

In the proposed algorithm, we first find the average of the fixed region from the background region as shown in Fig. 8. Then we compare this average to the nearest region pixels and consider it boundary noise when the absolute difference is bigger than a threshold. Taking into account the cases where boundary noise continuously appear, if the first pixel is determined to be a boundary noise, then we compare the next pixel to detect the continuous boundary noise. The exposed boundary noise is assigned into the common-hole region and removed. Fig. 7(c) shows the image after removing the boundary noise and filling the common-hole through the proposed algorithm. When compared to the image that has been filled without removing the boundary noise, the image that has gone through the proposed algorithm shows better quality.

**Fig. 8.**Proposed algorithm to remove boundary noise

## 3.2 Common-hole filling algorithm

### 3.2.1 Determining the order of the common-hole filling algorithm considering the background region

Fig. 9 shows the procedure of existing common-hole filling algorithms; starting with the outermost pixels and ending at the center as shown in Fig. 9(a). In this case, it is very likely that the object color information near the holearea is used and the quality of the resulting image will be lowered as shown in Fig. 9(b). Therefore, in the proposed algorithm, it takes into account virtual and reference viewpoints to fill the region near the background first. Fig.10 represents the filling order in the case where the virtual viewpoint is on the further right side of the reference viewpoint. In this case, because the virtual viewpoint is on the further right side of the reference viewpoint, the common-hole appears on the right side of the object. Thus, the region on the right of the commonhole becomes the background region and the common-hole is filled from this point first as shown in Fig. 10(a). From looking at Fig. 10(b), it can be seen that the proposed algorithm is less influenced by the object’s information than the existing algorithms.

**Fig. 9.**(a) Hole filling from outside pixels; (b) results

**Fig. 10.**(a) Proposed order of hole filling; (b) results

### 3.2.2 Spiral weighted average algorithm

The procedure of the spiral weighted average algorithm in Fig. 11 is as follows [15]. (1) First, find the inner boundary of a common-hole region. Then, pick a pixel from this inner boundary and start the filling process at this pixel. (2) The initial common-hole pixel chosen in (1) is assigned the smallest depth value of its 8-neighbors which do not belong to the common-hole region. Perform the spiral searching at the initial commonhole pixel in the search range as shown in Fig. 12. (3) During the spiral searching process, weighted texture and depth values of all candidate pixels with different weights depending on the distance between initial and current pixels are stored if the depth difference between the initial and current pixels is less than a threshold. Average value of all stored weighted texture and depth values of the initial pixel is then reassigned as new texture and depth values of the initial pixel. This process is called a spiral weighted averaging and it is shown in Eq. (5).

**Fig. 11.**Flow chart of the spiral weighted average algorithm

**Fig. 12.**Spiral searching process

where T(p,q,t) and d(p,q,t) are the stored texture and depth values of a pixel (p,q) in the search range SR. ST(x,y,t) and SD(x,y,t) are the newly assigned texture and depth values of the hole at (x,y), respectively. D(p,q) is the depth difference between the initial pixel at (x,y) and the current pixel at (p,q), and W(p,q) indicates the weighting factor which is the Euclidian distance between two pixels at (x,y) and (p,q).

### 3.2.3 Gradient searching algorithm

The spiral weighted average algorithm in the previous session gives a color spreading effect due to its low pass filtering property. To solve this problem, we use a gradient information which can preserve details in the synthesized view image as follows [15]. (1) First calculate an intensity difference between a pixel and its 8-neighbors (yellow marked region in Fig. 13(a)) of the initial hole and a pixel from its adjacent pixels in 12 directions (red marked region in Fig. 13(a)). Then determine the pixel with the biggest difference value from adjacent pixels (red marked region in Fig. 13(a)). (2) Now, repeat (1) at pixels chosen in (1), but only in three similar directions as shown in Fig. 13(b). (3) Repeat (2) for all the pixels in the predefined search range. (4) Finally, a hole is assigned to the average value of the pixels which are determined in step (1) through (3).

**Fig. 13.**Gradient searching algorithm; (a) step (1) and (b) step (2)

### 3.2.4 Adaptive weighted blending

While the spiral weighted average algorithm uses depth information to keep the boundary of each object relatively well, but it has a color spreading effect in the filled common-hole region. The gradient searching algorithm however uses gradient information and consequently, it is able to preserve details which are actually high frequency components in the synthesized view. Therefore, if we combine these two results, we can offset the defect of each algorithm. We use an adaptive weighting factor to properly combine these two results as in Eq. (6) [15].

where GT(x, y, t) and ST(x, y, t) are texture values obtained by the gradient searching algorithm and the spiral weighted average algorithm in the current frame t, respectively and g(x,y) is normalized gradient value as a weighting factor.

### 3.2.5 Removal of temporal inconsistency

Both the spiral weighted average and gradient searching algorithms do not use any information from the other frames. Consequently, we may have a flickering defect around the filled common-hole regions when they are applied to successive frames in a video content. We solve this problem by defining a probability mask as in Eq. (7) [15].

where P(x,y,t) is a probability mask value of current frame at (x,y) whose initial value is either 0.4(hole) or 0(not hole). H is a hole pixel and F(x,y,t) is a final texture value.

Eq. (7) means that if a pixel in the current frame is a legacy of common-hole region, then we use more information of previous frame. Fig. 14 shows example of a probability mask.

**Fig. 14.**A probability mask; (a) synthesized view with common-hole and (b) probability mask image

Flickering defect around filled hole regions are properly reduced by the proposed algorithm in the common-hole regions as shown in Fig. 15.

**Fig. 15.**Generated virtual view (6th view in the 0th frame (upper row) and 1st frame (bottom row) of “Book_Arrival”); (a) after hole filling by inpainting algorithm and (b) by the proposed algorithm

# 4. Experimental Results

## 4.1 Extrapolation case

As shown in Fig. 16, we could generate a virtual viewpoint image which is positioned at the outer side of the reference views (extrapolation view synthesis case) by the proposed algorithm. We compared its performance with a bilinear interpolation algorithm, a conventional inpainting algorithm [8] and common-hole filling algorithm in the VSRS 3.5 alpha version [9, 10].

**Fig. 16.**Extrapolation view synthesis case

We have used “Book_Arrival”, “Café”, “Lovebird1” and “Mobile” as test sequences. First, note that all the data for the test sequence in the table from now on are the average values for all the temporal frames of a sequence, as shown in Table 1. The number for “reference” indicates the position of a reference viewpoint camera. And the number for “virtual” means the position of a virtual viewpoint camera.

**Table 1.**Specifications of the test sequences for extrapolation

As shown in Figs. 17 and 18, the proposed algorithm reduced incongruity much better than conventional algorithms. You can see that the proposed algorithm clearly outperformed the other algorithms in the zoomed red marked regions in each result of Figs. 17 and 18.

**Fig. 17.**Generated virtual view (5th view in the 0th frame of “Cafe”); (a) 3D warped image, (b) after hole filling by bilinear interpolation, (c) by in-painting algorithm [8], (d) by VSRS 1 [9], (e) by VSRS 2 [10], and (f) by the proposed algorithm (d) by VSRS 1 [9], (e) by VSRS 2 [10], and (f) by the proposed algorithm

**Fig. 18.**Generated virtual view (6th view in the 0th frame of “Book_Arrival”); (a) 3D warped image, (b) after hole filling by bilinear interpolation, (c) by in-painting algorithm [8], (d) by VSRS 1 [9], (e) by VSRS 2 [10], and (f) by the proposed algorithm

In Table 2, we also compared PSNR performance of the proposed algorithm. It shows that the proposed algorithm generally performed better than other algorithms.

**Table 2.**PSNR performance comparison for extrapolation case

4.2 Interpolation case

Fig. 19 shows an interpolation view synthesis case. We generated a virtual viewpoint image which is positioned at the inner side of the reference views (interpolation view synthesis case) by the proposed algorithm and compared its performance with the in-painting algorithm and the common-hole filling algorithms in the VSRS 3.5 alpha. All the data for the test sequence in the table are also the average values for all the temporal frames of a sequence, as shown in Table 3. Similarly, the number for “references” indicates the positions of reference viewpoint cameras. And the number for “virtual” means the position of a virtual viewpoint camera.

**Fig. 19.**Interpolation view synthesis case

**Table 3.**Specifications of the test sequences for interpolation

As shown in Figs. 20 and 21, the proposed algorithms performed better than other algorithms for the interpolation case, too. You can see that the proposed algorithm clearly outperformed the other algorithms in the zoomed red marked regions in both Fig. 20 and Fig. 21.

**Fig. 20.**Generated virtual view (3rd view in the 0th frame of “Book_Arrival”); (a) 3D warped image, (b) by in-painting algorithm [8], (c) by VSRS 1 [9], (d) by VSRS 2 [10], and (e) by the proposed algorithm

**Fig. 21.**Generated virtual view (7th view in the 0th frame of “Book_Arrival”); (a) 3D warped image, (b) by in-painting algorithm [8], (c) by VSRS 1 [9], (d) by VSRS 2 [10], and (e) by the proposed algorithm

In Table 4, we also compared PSNR performance of the proposed algorithm for the interpolation case. Table 4 shows that the proposed algorithm gave better results in general, but the difference is not that big because in this case, the size of common-hole region is smaller than in the extrapolation case.

**Table 4.**PSNR performance comparison for interpolation case

# 5. Conclusion

In this paper, we proposed an improved algorithm to recover the common-hole regions resulting from a virtual view synthesis process. In the proposed algorithm, we first determine the boundary noise and then remove this boundary noise. Then, the common-hole regions filled by using spiral weighted average algorithm and a gradient searching algorithm. We tried to combine the strong points of both the spiral weighted average algorithm and the gradient searching algorithm. We also tried to reduce the flickering defect existing around the filled common-hole regions by defining the probability mask.

Experimental results showed that the proposed algorithm enhanced the quality and reduced flickering defect in the synthesized view much better than the conventional algorithms in general.