Search | Korea Science

Improving Transformer with Dynamic Convolution and Shortcut for Video-Text Retrieval

Liu, Zhi;Cai, Jincen;Zhang, Mengmeng
- KSII Transactions on Internet and Information Systems (TIIS)
- /
- v.16 no.7
- /
- pp.2407-2424
- /
- 2022
Recently, Transformer has made great progress in video retrieval tasks due to its high representation capability. For the structure of a Transformer, the cascaded self-attention modules are capable of capturing long-distance feature dependencies. However, the local feature details are likely to have deteriorated. In addition, increasing the depth of the structure is likely to produce learning bias in the learned features. In this paper, an improved Transformer structure named TransDCS (Transformer with Dynamic Convolution and Shortcut) is proposed. A Multi-head Conv-Self-Attention module is introduced to model the local dependencies and improve the efficiency of local features extraction. Meanwhile, the augmented shortcuts module based on a dual identity matrix is applied to enhance the conduction of input features, and mitigate the learning bias. The proposed model is tested on MSRVTT, LSMDC and Activity-Net benchmarks, and it surpasses all previous solutions for the video-text retrieval task. For example, on the LSMDC benchmark, a gain of about 2.3% MdR and 6.1% MnR is obtained over recently proposed multimodal-based methods.
https://doi.org/10.3837/tiis.2022.07.016 인용 PDF KSCI HTML

A Video Stream Retrieval System based on Trend Vectors (경향 벡터 기반 비디오 스트림 검색 시스템)

Lee, Seok-Lyong;Chun, Seok-Ju
- Journal of Korea Multimedia Society
- /
- v.10 no.8
- /
- pp.1017-1028
- /
- 2007
In this paper we propose an effective method to represent, store, and retrieve video streams efficiently from a video database. We extract features from each video frame, normalize the feature values, and represent them as values in the range [0,1]. In this way a video frame with f features can be represented by a point in the f-dimensional space $[0,1]^f$, and thus the video stream is represented by a trail of points in the multidimensional space. The video stream is partitioned into video segments based on camera shots, each of which is represented by a trend vector which encapsulates the moving trend of points in a segment. The video stream query is processed depending on the comparison of those trend vectors. We examine our method using a collection of video streams that are composed of sports, news, documentary, and educational videos. Experimental results show that our trend vector representation reduces a reconstruction error remarkably (average 37%) and the retrieval using a trend vector achieves the high precision (average 2.1 times) while maintaining the similar response time and recall rate as existing methods.
PDF

Representation and Detection of Video Shot s Features for Emotional Events (감정에 관련된 비디오 셧의 특징 표현 및 검출)

Kang, Hang-Bong;Park, Hyun-Jae
- The KIPS Transactions:PartB
- /
- v.11B no.1
- /
- pp.53-62
- /
- 2004
The processing of emotional information is very important in Human-Computer Interaction (HCI). In particular, it is very important in video information processing to deal with a user's affection. To handle emotional information, it is necessary to represent meaningful features and detect them efficiently. Even though it is not an easy task to detect emotional events from low level features such as colour and motion, it is possible to detect them if we use statistical analysis like Linear Discriminant Analysis (LDA). In this paper, we propose a representation scheme for emotion-related features and a defection method. We experiment with extracted features from video to detect emotional events and obtain desirable results.
https://doi.org/10.3745/KIPSTB.2004.11B.1.053 인용 PDF KSCI

Video Classification System Based on Similarity Representation Among Sequential Data (순차 데이터간의 유사도 표현에 의한 동영상 분류)

Lee, Hosuk;Yang, Jihoon
- KIPS Transactions on Computer and Communication Systems
- /
- v.7 no.1
- /
- pp.1-8
- /
- 2018
It is not easy to learn simple expressions of moving picture data since it contains noise and a lot of information in addition to time-based information. In this study, we propose a similarity representation method and a deep learning method between sequential data which can express such video data abstractly and simpler. This is to learn and obtain a function that allow them to have maximum information when interpreting the degree of similarity between image data vectors constituting a moving picture. Through the actual data, it is confirmed that the proposed method shows better classification performance than the existing moving image classification methods.
https://doi.org/10.3745/KTCCS.2018.7.1.1 인용 PDF

A Study for Improved Human Action Recognition using Multi-classifiers (비디오 행동 인식을 위하여 다중 판별 결과 융합을 통한 성능 개선에 관한 연구)

Kim, Semin;Ro, Yong Man
- Journal of Broadcast Engineering
- /
- v.19 no.2
- /
- pp.166-173
- /
- 2014
Recently, human action recognition have been developed for various broadcasting and video process. Since a video can consist of various scenes, keypoint approaches have been more attracted than template based methods for real application. Keypoint approahces tried to find regions having motion in video, and made 3-dimensional patches. Then, descriptors using histograms were computed from the patches, and a classifier based on machine learning method was applied to detect actions in video. However, a single classifier was difficult to handle various human actions. In order to improve this problem, approaches using multi classifiers were used to detect and to recognize objects. Thus, we propose a new human action recognition using decision-level fusion with support vector machine and sparse representation. The proposed method extracted descriptors based on keypoint approach from a video, and acquired results from each classifier for human action recognition. Then, we applied weights which were acquired by training stage to fuse each results from two classifiers. The experiment results in this paper show better result than a previous fusion method.
https://doi.org/10.5909/JBE.2014.19.2.166 인용 PDF KSCI KPUBS

Video augmentation technique for human action recognition using genetic algorithm

Nida, Nudrat;Yousaf, Muhammad Haroon;Irtaza, Aun;Velastin, Sergio A.
- ETRI Journal
- /
- v.44 no.2
- /
- pp.327-338
- /
- 2022
Classification models for human action recognition require robust features and large training sets for good generalization. However, data augmentation methods are employed for imbalanced training sets to achieve higher accuracy. These samples generated using data augmentation only reflect existing samples within the training set, their feature representations are less diverse and hence, contribute to less precise classification. This paper presents new data augmentation and action representation approaches to grow training sets. The proposed approach is based on two fundamental concepts: virtual video generation for augmentation and representation of the action videos through robust features. Virtual videos are generated from the motion history templates of action videos, which are convolved using a convolutional neural network, to generate deep features. Furthermore, by observing an objective function of the genetic algorithm, the spatiotemporal features of different samples are combined, to generate the representations of the virtual videos and then classified through an extreme learning machine classifier on MuHAVi-Uncut, iXMAS, and IAVID-1 datasets.
https://doi.org/10.4218/etrij.2019-0510 인용 PDF KSCI

Object Motion Analysis and Interpretation in Video

Song, Dan;Cho, Mi-Young;Kim, Pan-Koo
- Proceedings of the Korean Information Science Society Conference
- /
- 2004.10b
- /
- pp.694-696
- /
- 2004
With the more sophisticated abilities development of video, object motion analysis and interpretation has become the fundamental task for the computer vision understanding. For that understanding, firstly, we seek a sum of absolute difference algorithm to apply to the motion detection, which was based on the scene. Then we will focus on the moving objects representation in the scene using spatio-temporal relations. The video can be explained comprehensively from the both aspects : moving objects relations and video events intervals.
PDF

Semantic Representation of Moving Objectin Video Data Using Motion Ontology (Motion Ontology를 이용한 비디오내 객체 움직임의 의미표현)

Shin, Ju-Hyun;Kim, Pan-Koo
- Journal of Korea Multimedia Society
- /
- v.10 no.1
- /
- pp.117-127
- /
- 2007
As the value of the multimedia data is getting high, the study on the semantic recognition and retrieval about the multimedia information is strongly demanded. In this paper, we build the motion ontology and adopt it for representing the meaning of the moving objects in video data. By referencing the WordNet structure, we extend its semantic meaning based on the reclassification of motion verbs, which are used to represent the semantic meaning of moving objects. The represented information is receded in OWL/RDF(S). Here, we could expect the 'Is-A' and 'Equivalent' reasoning of the data as we use the ontologies. And the semantic representation about the moving objects is possible through the video annotation using ontology. And we tested the accuracy of the system comparing with the key-word based system. As a result, we could get the approximately 10% improvement of the system performance.
PDF

Study on the estimation and representation of disparity map for stereo-based video compression/transmission systems (스테레오 기반 비디오 압축/전송 시스템을 위한 시차영상 추정 및 표현에 관한 연구)

Bak Sungchul;Namkung Jae-Chan
- Journal of Broadcast Engineering
- /
- v.10 no.4 s.29
- /
- pp.576-586
- /
- 2005
This paper presents a new estimation and representation of a disparity map for stereo-based video communication systems. Several pixel-based and block-based algorithms have been proposed to estimate the disparity map. While the pixel-based algorithms can achieve high accuracy in computing the disparity map, they require a lost of bits to represent the disparity information. The bit rate can be reduced by the block-based algorithm, sacrificing the representation accuracy. In this paper, the block enclosing a distinct edge is divided into two regions and the disparity of each region is set to that of a neighboring block. The proposed algorithm employs accumulated histograms and a neural network to classify a type of a block. In this paper, we proved that the proposed algorithm is more effective than the conventional algorithms in estimating and representing disparity maps through several experiments.
PDF KSCI

A Novel Bit Rate Adaptation using Buffer Size Optimization for Video Streaming

Kang, Young-myoung
- International Journal of Internet, Broadcasting and Communication
- /
- v.12 no.4
- /
- pp.166-172
- /
- 2020
Video streaming application such as YouTube is one of the most popular mobile applications. To adjust the quality of video for available network bandwidth, a streaming server provides multiple representations of video of which bit rate has different bandwidth requirements. A streaming client utilizes an adaptive bit rate scheme to select a proper video representation that the network can support. The download behavior of video streaming client player is governed by several parameters such as maximum buffer size. Especially, the size of the maximum playback buffer in the client player can greatly affect the user experience. To tackle this problem, in this paper, we propose the maximum buffer size optimization according to available network bandwidth and buffer status. Our simulation study shows that our proposed buffer size optimization scheme successfully mitigates playback stalls while preserving the similar quality of streaming video compared to existing ABR schemes.
https://doi.org/10.7236/IJIBC.2020.12.4.166 인용 PDF KSCI

Search Result 195, Processing Time 0.03 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)