High Efficiency Video Coding (HEVC) [1,2] is a new video codec which was finalized in January 2013. HEVC is known to provide two times higher compression ratio than the previous video coding standard, which is promising to the multimedia industry. As the electronics industry pushes the large display products and consumers pursuit the high resolution video contents, broadcast community and other key players in the video market have taken the fast action to applying the HEVC codec to deliver advanced services. Due to achievement of high efficiency, HEVC sacrifices the time complexity. To apply HEVC to the market applications, one of the key requirements is the fast encoding  . To achieve the fast encoding, exploiting thread-level parallelism is widely chosen mechanism since multi-threading is commonly supported based on the multi-core computer architecture. In HEVC, there are several picture partition schemes such as slices, tiles  , and wavefront parallel processing (WPP) [5,6,7] . Since the slices result larger coding loss compared to tiles, we consider the tile scheme for picture partitioning parallelism in this paper.
1. Tile-level parallelism
In HEVC the picture partitioning parallelism structure is supported as called Tile; a picture is divided into squares and encoded independently. There is no dependency between the tiles so that the several tiles can be encoded in parallel. However, the neighboring blocks at the tile boundary cannot be referenced while encoding, the encoding efficiency is decreased. We have implemented the Tile-level parallelism based on HEVC test model (HM) encoder. Since the HM encoder is single-core codec, the multiple tiles are encoded in serial and use a single CABAC engine. We parallelize multiple tile encoding with each independent CABAC engine.
2. Frame-level parallelism
Figure 1 illustrates the random access GOP structure defined in . With the GOP structure, the frames in the fourth GOP level can be employed to the frame-level parallelism. The GOP structure shown in Fig. 1 has a shortage, which is that only GOP level 4 frames can be evolved in parallel processing. To improve the frame-level parallel scalability, we propose to change the GOP structure as Fig. 2, where the frames in the third GOP level are not referenced each other. In this way, we can encode two frames in the GOP level 3 in parallel and then four frames in the GOP level 4 in parallel. We call this frame-level parallelism scheme Frame-level parallelism of GOP level 3&4 and Fig. 3 illustrates the scheme.
그림 1.HEVC random access GOP 구조  Fig. 1. HEVC random access GOP structure proposed in 
그림 2.프레임 레벨 병렬화를 위한 GOP 구조 Fig. 2. The proposed GOP structure to improve the frame-level parallelism in this paper
그림 3.GOP 레벨 3과 레벨 4의 프레임 레벨 병렬화 Fig. 3. Frame-level parallelism of GOP level 3&4
For the previous video coding standard, frame-level parallelism, slice-level parallelism, and combining approach of two different level parallelisms have been proposed. Since the tile is a newly adopted tool to support the parallel HEVC encoding, we have tried to combine the tile-level parallelism and frame-level parallelism in HEVC encoding. To the best of our knowledge, this is the first report of combining the tile-level parallelism and frame-level parallelism for HEVC. In this paper, we implemented tile-level parallel encoding [8,9] and frame-level parallel encoding each and proposed an effective combined approach applying the adaptive number of tile as taking the consideration of the number of frames in parallel and the number of available cores.
II. Improvement of parallelism
HEVC allows the tile number up to 25 for full HD video (1920 x 1080) and 110 for 4K UHD video (3840 x 2160). The recent multi-core architecture technique has improved significantly so that many systems have high number of cores. Employing more than two parallelism schemes improves the parallel scalability. Based on the implementation of tile-level parallelism and frame-level parallelism, we combine the two parallelisms to improve the parallel scalability and to more effectively utilize the multi-core system.
1. Combined Approch 1: Fixed number of tile in a frame
First, we combine the tile-level parallelism and frame-level parallelism as shown in Fig. 4. As an example, we set the number of tile in a frame is four. When the encoder takes the input frame, it creates the thread as many as the number of frames in parallel, and then each frame-encoding-thread creates the worker thread as many as the number of tile.
그림 4.조합 1 방법: 타일 레벨 병렬화와 프레임 레벨 병렬화를 조합할 때 프레임 레벨 병렬화와 상관없이 동일한 개수의 타일로 화면을 분할하여 타일 레벨 병렬화를 수행하는 방법 Fig 4. Combined Approach 1: Tile-level parallelism combined with the frame-level parallelism of GOP level 3 & 4. The number of tile is the same for all frames. The number of tile in a frame is four as an example
The combined approach 1 illustrated in Fig. 4 has some shortage. When the frame-level parallelism is employed, the number of tiles in parallel is increased by factor of the number of frames in parallel. The unbalanced number of worker threads through the encoding timeline results in difficulty in effectively running multi-threading.
2. Combined Approach 2: Adaptive number of tile in a frame
As targeting the problem in the previous section, we propose to apply the adaptive number of tile as considering the number of frames in parallel. The proposed method as taking an example with four tiles in a frame in serial as illustrated in Fig. 5. With the second approach having the adaptive number of tile, we expect less loss in coding efficiency and good balanced CPU usage.
그림 5.조합 2 방법: 타일 레벨 병렬화와 프레임 레벨 병렬화를 조합할 때 병렬로 처리되는 프레임의 개수에 따라 적응적으로 타일의 개수를 조정하며 화면 분할 병렬화를 수행하는 방법 Fig. 5. Combined Approach 2: Tile-level parallelism combined with the frame-level parallelism of GOP level 3 & 4. The number of tile is changed taking the frame-level parallelism into consideration. The initial number of tile in a frame is four as an example
III. Experimental Results
1. Test sequences and environments
We implemented our parallel encoding approaches described so far based on HEVC reference software. Multithreading has been applied using Windows threads APIs. The test sequences used in the experiment are two set of video. The first set has five Full-HD (1920 x 1080) videos of 100 frames (Kimono, Park Scene, Cactus, Basketball Drive, and BQ Terrace), which are from HEVC test sequences. The second set has five 4K UHD (3840 x 2160) videos of 100 frames (Jockey, YachtRide, ReadySteadyGo, ShakeNDry, and HoneyBee), which are from Kvazaar Encoder[11,12] test sequences. We used the separate platforms according to the test sequence resolution. The platform used for Full-HD test set has one Intel Xeon E5-2690 processor with eight physical cores whereas the platform for 4K UHD test set has two Intel Xeon E5-2690 processors with sixteen physical cores in total. For speedup measurement, the sequences encoded with one tile per frame in serial are used as the anchor. We select Main profile and HEVC common condition random access setting with two modifications: The GOP structure is changed as Fig. 2 to encode the frames in the level 3 in parallel and the encoding tool of AMP is not applied.
2. Coding efficiency analysis
As described in Section II-1, picture partition schemes result coding loss. However as shown in Fig. 6 tiles cause less coding loss compared to slices. As shown in Fig. 7 slices produces large boundary compared to tiles and the coding loss for slices is high.
그림 6.타일과 슬라이스 분할로 인한 부호화 효율 비교(테스트영상: BasketballDrive) Fig. 6. Coding loss comparison for Tile and Slice encoding
그림 7.Slice 분할 인코딩 Fig. 7. Pictuer partitioning using slices
As designing the encoder with various parallelisms, coding loss and parallel scalability should be carefully considered. The coding efficiency is measured using the Bjøntegaard delta (BD) bitrate as described in . To measure the BD bitrate (BDBR), test sequences are encoded with no parallelization and no picture partitioning. Table 1 - 2 shows the coding losses by encoding with Tile-level parallelism. The coding losses of the tile-level parallelism occur mainly at the tile boundary that the neighboring encoding information cannot be used. The coding loss by tile partitioning increases as the global motion in the sequence is large such as Jockey. Since a race horse runs very fast in Jockey, coding efficiency drops very significantly at tile boundaries. Note that for the Jockey sequence, our proposed Combined Approach 2 decreases the coding loss significantly compared to the fixed tile partitioning method. Table 3 - 4 shows the coding losses when we applied the adaptive picture partitioning according to the GOP level as described in Fig. 5. The results present that coding loss from Combined Approach 2 parallelism produces less coding losses compared to the Tile-level parallelism.
표 1.FHD: 타일 레벨 병렬화 방법의 부호화 효율(Y-BDBR) Table 1. FHD: Tile-level parallelism coding efficiency (Y-BDBR)
표 2.4K: 타일 레벨 병렬화 방법의 부호화 효율(Y-BDBR) Table 2. 4K: Tile-level parallelism coding efficiency (Y BDBR)
표 3.FHD: 조합 2 방법의 부호화 효율(Y-BDBR) Table 3. FHD: Combined approach 2 parallelism coding efficiency (Y-BDBR)
표 4.4K: 조합 2방법의 부호화 효율 (Y-BDBR) Table 4. 4K: Combined approach 2 parallelism coding efficiency (Y-BDBR)
3. Parallel scalibility analysis
Table 5 - 10 show the speedup by Tile-level parallelism, Combined Approach 1 parallelism, and Combined Approach 2 parallelism, respectively. From the results of Table 7-10, the combined parallelism improves the parallel scalability significantly.
표 5.FHD: 타일 레벨 병렬화에 의한 부호화 속도 향상(배수) Table 5. FHD: Tile-level parallelism scalability (Speedup)
표 6.4K: 타일 레벨 병렬화에 의한 부호화 속도 향상(배수) Table 6. 4K: Tile-level parallelism scalability (Speedup)
표 7.FHD: 조합 1 방법 병렬화에 의한 부호화 속도 향상(배수) Table 7. FHD: Combined approach 1 parallelism scalability (Speedup)
표 8.4K: 조합 1 방법 병렬화에 의한 부호화 속도 향상(배수) Table 8. 4K: Combined approach 1 parallelism scalability (Speedup)
표 9.FHD: 조합 2 방법 병렬화에 의한 부호화 속도 향상(배수) Table 9. FHD: Combined approach 2 parallelism scalability (Speedup)
표 10.4K: 조합 2 방법 병렬화에 의한 부호화 속도 향상(배수) Table 10. 4K: Combined approach 2 parallelism scalability (Speedup)
When designing the encoding parallelism, we carefully consider the trade-off between the speedup and coding efficiency. From our experimental results, we aggregate the speedup results against coding loss for each parallelization scheme as shown in Fig. 8. Both Combined Approach 1 and 2 are better parallel scalability compared to the Tile-level parallelism. Combined Approach 2 parallelism shows better speedup against the coding loss for Full-HD and 4K test sequences. In addition, the comparison of speedup between the combined approach 2 and the tile-level parallelism shows the similar pattern as Fig 9 no matter what number of physical core. However, the speedup scalability according to the number of available core is different. The speedup scalability decreases as the number of cores increases. This performance decrease causes from the inefficient parallelization such as frame synchronization.
그림 8.부호화 효율 대비 병렬화로 인한 속도 향상 Fig. 8. Speedup against BD bitrate
그림 9.코어개수 및 thread 개수 조합에 의한 속도 향상 Fig. 9. Speedup against number of thread and number of cores
In this paper, we present two effective approaches of combining the tile-level parallelism and frame-level parallelism. Both approaches provide better parallel scalability compared to the tile-level parallelism and the experimental results show that when combining the tile-level parallelism and frame-level parallelism, applying the adaptive number of tile as taking the consideration of the number of frames in parallel and the number of available cores results in better trade-off between coding loss and parallel scalability for Full-HD and 4K UHD video sequences.