DOI QR코드

DOI QR Code

CNN-based Visual/Auditory Feature Fusion Method with Frame Selection for Classifying Video Events

  • Choe, Giseok (Department of Computer Science and Engineering, Sogang University) ;
  • Lee, Seungbin (Department of Computer Science and Engineering, Sogang University) ;
  • Nang, Jongho (Department of Computer Science and Engineering, Sogang University)
  • Received : 2018.10.03
  • Accepted : 2019.01.17
  • Published : 2019.03.31

Abstract

In recent years, personal videos have been shared online due to the popular uses of portable devices, such as smartphones and action cameras. A recent report predicted that 80% of the Internet traffic will be video content by the year 2021. Several studies have been conducted on the detection of main video events to manage a large scale of videos. These studies show fairly good performance in certain genres. However, the methods used in previous studies have difficulty in detecting events of personal video. This is because the characteristics and genres of personal videos vary widely. In a research, we found that adding a dataset with the right perspective in the study improved performance. It has also been shown that performance improves depending on how you extract keyframes from the video. we selected frame segments that can represent video considering the characteristics of this personal video. In each frame segment, object, location, food and audio features were extracted, and representative vectors were generated through a CNN-based recurrent model and a fusion module. The proposed method showed mAP 78.4% performance through experiments using LSVC data.

Keywords

1. Introduction

 In recent years, personal videos have been shared through YouTube or Flickr due to the popular uses of smartphones and action cameras. A recent report[1] predicted that 80% of the Internet traffic will be video content by the year 2021. Accordingly, content-sharing companies perform video event detection by managing a large scale of videos to provide services for users. However, video event classification conducted by watching videos via humans can take considerable time and human resources. To resolve this problem, computer vision researchers have continuously conducted studies on the classification of main video events automatically. Personal videos may have severe noise due to shaking and lighting depending on the expertise of shooters or shooting device performance, and videos have a different duration. Videos are thus more difficult to handle than single images due to the temporal relation between frames. In recent years, studies using a deep neural network (DNN), which has played a major role in solving various problems in the computer vision area, have been conducted to analyze videos using complex features.

 Most studies have involved experiments with short-duration video data, and videos have been analyzed using object-oriented features extracted from a convolutional neural network (CNN) trained with ImageNet[3]. However, personal videos were collected from various categories, and various sets of information included place, food, and voice etc. that can be used to detect main events. Thus, this study samples segment frames from various duration images and features of objects, places, foods, and voices from a variety of viewpoints are extracted to analyze the main events. In addition, the main events are detected by encoding features that have a sequential structure into fixed-length features. The LSVC 2017[2] dataset, which was used in the Large-Scale Video Classification 2017, one of the largest video datasets in the world, was used in the experiment for performance evaluation. The proposed method in this study achieved a performance improvement of 9% compared to that of the existing single-feature-oriented mode, and 78.4% of performance in the mean average precision (mAP) can be achieved.

2. Related Work

 Various approaches have been studied to solve the video event classification problem, which has been considered one of the main problems in computer vision. The final output is returned after performing feature extraction through the pre-determined indexing method and training the classifier model. In contrast, a more recent deep learning model trains not only classifiers but also convolutional filters, which are suitable for classification using a loss function. Fig. 1 shows the structures of the two models. Next, in Section 2.1, traditional models are explained followed by a detailed explanation about studies related to a recent DNN in Section 2.2.

 

Fig. 1. Structures of traditional feature and DNN based models

2.1 Traditional Feature-based Research

 Studies prior to DNN extracted hand-crafted features of users for classification. In the text-based research displayed in videos, a study[4] was conducted to categorize the news subject into politics, society, and entertainment using subtitles in videos. A study[5] on using audio information classified videos into discourse, news, commercial, and sports, as shown in Fig. 2. A study[6] using visual information analyzed videos by extracting a variety of low-level features in frames. There have been studies dealing with transitions, objects, and motion in color, texture, and shot boundary. In particular, motion may degrade event detection performance, as it can make noise using camera movements in personal videos. In a study[7], motion compensation was applied to prevent this noise. In addition, noise was eliminated around persons who were focused on in events, and features, such as the trajectory and histogram of gradient (HoG), were extracted for better performance. These studies have significant effects on recent studies on DNN-based video analysis.

 

Fig. 2. An example of video classification using audio information

2.2 DNN-based Research

 Traditional feature-based works are difficult to use when studying personal videos whose themes are diverse because they analyze videos based on pre-defined criteria. To solve this problem, studies using a CNN, which revealed a significant improvement in the computer vision field, have been conducted. In one study, the researchers[8] proposed a method of fusing features between frames after selecting input frames using various methods, as shown in Fig. 3, to deal with temporal dependency using a CNN. Although the results of that study were not better than those of [7], it was regarded as a represented study using a CNN and became the reference study for subsequent CNN-related research.

 There have been studies in consideration of temporal characteristics to overcome the shortcoming of the difficulty in analyzing sequential frames using a CNN. Fig. 3 shows a video analysis method using a neural network of various structures. These studies have object-oriented feature extraction from each frame using a CNN trained with ImageNet in common. However, various methods are used to deal with features of continuity as shown in Fig. 4. Fig. 4(a) shows a video analysis method [9] using long short-term memory (LSTM), which has a recurrent model. The feature information in the video is compressed through features between adjacent frames using an LSTM structure. and videos are classified. Although the performance was better than that of [7], it was worse than that of [7]. Fig. 4(b) shows the use of three-dimensional (3D) convolution to overcome a drawback of existing two-dimensional (2D) convolution, which was difficult to use for the analysis of the continuous structure of video frames, although 2D convolution is easier for understanding single frame locality. It showed better performance than that of [9] or [7], as it considered spatial information and temporal dependency. When it was fused with features in [7], its performance was improved further. However, it required a large amount of resources to train the complex structure. One study [10] using a structure in Fig. 4(c) showed better performance than that of (a) and (b) that determined classification on the basis of frames only by adding motion information, such as optical flows. However, it also required a large amount of resources in a process that obtained optical flows.

 

Fig. 3. An example of the fusing frame method used in [8]

 

Fig. 4. Structures of the DNN-based video analysis method

 Table 1 compares the performances of the studies described in Sections 2.1 and 2.2. The data used in training and validation were UCF-101[12]. A detailed explanation of the data is presented in the next section.

Table 1. Previous works for video event classification

 

3. Dataset

 The performance verification was conducted using LSVC 2017 video datasets used in the large-scale video classification challenge in this study experiment. UCF-101 was mainly used for the video event classification study. As presented in Table 2, UCF-101 consists of 13,000 data records with a total of 101 events, such as “Apply Eye Makeup,” “Apply Lipstick,” and “Playing Cello,” in daily activities. Its average video length is seven sec. and videos are trimmed to display only main events. In contrast, LSVC 2017[2] consists of 155,000 data records with a total of 500 events. Its average length is 186 sec., which is longer than that of UCF-101, and videos are untrimmed. Its video length is longer and uses original versions without editing. Thus, it is a more difficult benchmark dataset than UCF-101.

Table 2. Comparison of UCF-101 and LSVC 2017 datasets

 

 The frames shown in Fig. 5 are contained in the LSVC dataset. In Fig. 5, (a) and (b) or (c) and (d) are similar frames in terms of visual information but explain different events, or (b) and (d) are the same event but their frames are significantly different. In response to this type of problem, a dataset for a particular perspective can be helpful. The data in Table 3 are the datasets added for this reason.

Table 3. Added dataset for various perspectives

 

 

Fig. 5. Examples of visually similar frames

4. Fused Feature-based Event Detection System

 Although features in videos can be extracted from various perspectives, existing studies extract features via a CNN trained with object-oriented features using ImageNet. The visual information is the most important information among many modalities displayed in videos. However, since videos have significant noise due to various external factors, such as the shooting environment, shooting device performance, and the shooter’s skill, it is difficult to classify events using only low-level features. In addition, main events cannot be represented with only information at a specific time. Considering the above points, this study designed the structure shown in Fig. 7.

 

Fig. 6. A Method for data augmentation

 

Fig. 7. Overview of the proposed model

4.1. Frame segment encoder

 A large amount of data is needed to achieve superior performance with regard to training data while preventing overfitting in the DNN. However, data collection is difficult from many problem domains of daily activities and obtaining the ground truth is more difficult. To overcome this difficulty, data augmentation is applied to increase the size of training data by modifying existing data using various changes. Since it can give various changes to input data and prevent overfitting effectively to increase generalization, it has been widely used in a pre-processing procedure in the CNN study. For images, horizontal flip, random crop, and adding noise can be used to increase the number of limited data records. This study employs a method shown in Fig. 6 to apply the data augmentation method to videos. A video that is a total of T sec. is extracted into one second. frames, which are then cropped into a fixed size of four corners in the right, left, upper, and lower sides, and the center. Then, the original data are enlarged 10 times using the horizontal symmetry and these values are averaged to reduce the noise effect. Fig. 8 shows the algorithm of the frame data augmentation explained above.

 

Fig. 8. Algorithm of frame augmentation

 Existing studies employed entire frames of a video using a relatively short-term UCF video. However, the average duration of LSVC videos amounts to 186 sec., and videos are untrimmed. The information displayed in part of the frames in the video may not tell the main event.

 For example, the frames in the first half of the video contain features that cannot explain the main event correctly, as shown in Fig. 9. However, the main event of the entire video appears in the second-half frames. The images related to a “cooking” event involve ingredients or cooking method. Then, the second half frames display which food is cooked in the event. The event video called “Marriage proposal” displays feature frames such as “tailgate party,” “music concert,” and “recital” in the first half. However, the main frame of marriage proposal appears in the second half frames. Thus, a frame segment extraction method was used in which an entire video was divided into fixed lengths, as shown in Fig. 10, and the segment section was extracted randomly to generate a single frame.

 

Fig. 9. Example of key frames shown in the latter part of a video

 

Fig. 10. A method of extracting a segment in a video

 In contrast with existing studies, features were extracted on the basis of an object, place, food, and audio to utilize various pieces of information displayed in the video.

4.1. Consider various perspectives

 Detailed information of the CNN used for fused-feature extraction in this study is presented in Table 4. Various CNN models were employed, and perspectives of the feature extraction were modified using the trained data. For the feature information, outputs in the last layer for feature extraction were employed and the length of vectors in each layer varied. BN-Inception[15, 16] and VGG19[17] that were trained with ImageNet were used to extract object-based features. Transfer learning was performed with Resnet152[18] trained with ImageNet into Place365[13] and Food101[14] data to extract a distinguished representation between place and food. In addition, videos contain not only various visual features but also many audio features. To employ voice information to classify main events, VGGish[19] that showed superior performance in feature extraction using a CNN in recent years was used after changing voice signals to spectrogram.

Table 4. Details of multi-modality feature extraction

 

5. Experiment and analysis

 The purpose of this study was to improve existing models by extracting object, place, food, and audio CNN representations from sec. unit frames and designing a DNN model in consideration of temporal dependency to classify the main events of personal videos collected from YouTube or Flickr. Experiments were conducted to detect events by only seeing 300 frames at a maximum in the first half without performing segment sampling separately.

 The performances of the basic model using only object-oriented visual information are presented in Table 5. The performance of BN-Inception was better than that of the model using VGG19. VGG19 performed worse than that of 1,024-dimension BN-Inception despite using 4,096-dimension information. This result indicated that the performance difference was determined by which structure of the neural network was used and from which layer extraction was made for visual information features. Furthermore, the performance of NetVLAD[20] was better than that of LSTM. This indicated that different features can be produced depending on the structure of the recurrent model.

Table 5. Performance comparison using object-oriented features

 

 Table 6 presents the classification performances using a single feature among object, location, food, and audio. When only a single feature was used, a method using an object-based visual feature showed the best performance. However, location or food-based features showed worse classification performance than using object-based features. The object-based features revealed the best performance because they can extract distinguished features of objects that were relatively represented universally, while location or food could extract good features about specific events. When only audio information was used in the experiment, it was easily affected by noise. The classification of events using only audio information is not easy, even for humans. The experiment results also indicated much worse performances than using only visual information.

Table 6. Performance comparison using features from various perspectives

 

 The experiment results of the multi-modality features proposed in this study are presented in Table 7. Model 1 is a case using object-based visual features. Models 2 and 3 improved performances by adding food or location features compared to that of using a single feature. As revealed in Models 3, 4, and 5, performances improved whenever a feature was added. When all four features were used, performances were improved by 4% compared to using an object-based single feature.

Table 7. Performance of the proposed model using multi-modality features

 

However, when the main event in a video is not contained in 300 frames, the event may be detected inaccurately due to the lack of information. Thus, features were extracted by the random sampling of partial sections in the following video. Table 8 presents the performances before and after sampling. Model 6 shows the performance when events were detected through segment sampling and multi-modality features used in Model 5. The proposed segment sampling showed 2 ~ 3% performance improvement regardless of the method.

 The results showed that the performance (0.723) revealed when only an object-based feature was extracted was improved to 0.784 through multi-modality features and segment sampling.

Table 8. Performance comparison using the proposed segment sampling

 

6. Conclusion and future research

 This study improved a problem that made main event classification difficult by taking the characteristics of personal video shots from various fields into consideration. The performance of a DNN can vary depending on the learning data, parameters, neural network structure, and feature extraction layer. However, performances were improved simply by connecting features extracted from other perspectives and through segment sampling if a video was long.

 In a research, we found that adding a dataset with the right perspective in the study improved performance. It has also been shown that performance depending on how you extract keyframes from the video. Therefore, in future research, optical flow will be added to increase diversity of perspective, and key frame extraction will be considered according to video characteristics.

References

  1. CISCO, "Cisco Visual Networking Index: Forecast and Methodology," Feb 15, 2018; https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/complete-white-paper-c11-481360.html.
  2. Z. Wu, Y. G. Jiang, L. S. Davis, and S.-F. Chang, "LSVC2017: Large-Scale Video Classification Challenge," 2017.
  3. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "ImageNet: A Large-scale Hierarchical Image Database," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248-255.
  4. W. Zhu, C. Toklu, and S.-P. Liou, "Automatic News Video Segmentation and Categorization Based on Closed-captioned Text," Urbana, vol. 51, pp. 61801, 2001.
  5. Z. Liu, Y. Wang, and T. Chen, "Audio Feature Extraction and Analysis for Scene Segmentation and Classification," Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, vol. 20, no. 1-2, pp. 61-79, 1998. https://doi.org/10.1023/A:1008066223044
  6. B. T. Truong, and C. Dorai, "Automatic Genre Identification for Content-based Video Categorization," in Proceedings of International Conference on Pattern Recognition, pp. 230-233, 2000.
  7. H. Wang, and C. Schmid, "Action Recognition with Improved Trajectories," in Proc. of IEEE International Conference on Computer Vision, pp. 3551-3558, 2013.
  8. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, "Large-scale Video Classification with Convolutional Neural Networks," in Proc. of the IEEE conference on Computer Vision and Pattern Recognition, 2014, pp. 1725-1732, 2014.
  9. J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, "Long-term Recurrent Convolutional Networks for Visual Recognition and Description," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625-2634, 2015.
  10. L. Wang, Y. Qiao, and X. Tang, "Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4305-4314, 2015.
  11. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning Spatiotemporal Features with 3d Convolutional Networks," in Proc. of the IEEE International Conference on Computer Vision, pp. 4489-4497, 2015.
  12. K. Soomro, A. R. Zamir, and M. Shah, "UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild," arXiv preprint arXiv:1212.0402, 2012.
  13. B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, "Places: A 10 Million Image Database for Scene Recognition," IEEE transactions on pattern analysis and machine intelligence, 2017.
  14. L. Bossard, M. Guillaumin, and L. Van Gool, "Food-101-mining Discriminative Components with Random Forests," in European Conference on Computer Vision, pp. 446-461, 2014.
  15. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going Deeper with Convolutions," in Proc. of International Conference on Computer Vision and Pattern Recognition, 2015.
  16. S. Ioffe, and C. Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift," arXiv preprint arXiv:1502.03167, 2015.
  17. K. Simonyan, and A. Zisserman, "Very Deep Convolutional Networks for Large-scale Image Recognition," arXiv preprint arXiv:1409.1556, 2014.
  18. K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, 2016.
  19. S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, and B. Seybold, "CNN Architectures for Large-scale Audio Classification," in Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 131-135, 2017.
  20. R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, "NetVLAD: CNN Architecture for Weakly Supervised Place Recognition," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5297-5307, 2016.

Cited by

  1. Oil Pipeline Weld Defect Identification System Based on Convolutional Neural Network vol.14, pp.3, 2020, https://doi.org/10.3837/tiis.2020.03.010