• Title/Summary/Keyword: Scene-context

Search Result 73, Processing Time 0.026 seconds

Multimodal Context Embedding for Scene Graph Generation

  • Jung, Gayoung;Kim, Incheol
    • Journal of Information Processing Systems
    • /
    • v.16 no.6
    • /
    • pp.1250-1260
    • /
    • 2020
  • This study proposes a novel deep neural network model that can accurately detect objects and their relationships in an image and represent them as a scene graph. The proposed model utilizes several multimodal features, including linguistic features and visual context features, to accurately detect objects and relationships. In addition, in the proposed model, context features are embedded using graph neural networks to depict the dependencies between two related objects in the context feature vector. This study demonstrates the effectiveness of the proposed model through comparative experiments using the Visual Genome benchmark dataset.

Visual Search Model based on Saliency and Scene-Context in Real-World Images (실제 이미지에서 현저성과 맥락 정보의 영향을 고려한 시각 탐색 모델)

  • Choi, Yoonhyung;Oh, Hyungseok;Myung, Rohae
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.41 no.4
    • /
    • pp.389-395
    • /
    • 2015
  • According to much research on cognitive science, the impact of the scene-context on human visual search in real-world images could be as important as the saliency. Therefore, this study proposed a method of Adaptive Control of Thought-Rational (ACT-R) modeling of visual search in real-world images, based on saliency and scene-context. The modeling method was developed by using the utility system of ACT-R to describe influences of saliency and scene-context in real-world images. Then, the validation of the model was performed, by comparing the data of the model and eye-tracking data from experiments in simple task in which subjects search some targets in indoor bedroom images. Results show that model data was quite well fit with eye-tracking data. In conclusion, the method of modeling human visual search proposed in this study should be used, in order to provide an accurate model of human performance in visual search tasks in real-world images.

Construction Site Scene Understanding: A 2D Image Segmentation and Classification

  • Kim, Hongjo;Park, Sungjae;Ha, Sooji;Kim, Hyoungkwan
    • International conference on construction engineering and project management
    • /
    • 2015.10a
    • /
    • pp.333-335
    • /
    • 2015
  • A computer vision-based scene recognition algorithm is proposed for monitoring construction sites. The system analyzes images acquired from a surveillance camera to separate regions and classify them as building, ground, and hole. Mean shift image segmentation algorithm is tested for separating meaningful regions of construction site images. The system would benefit current monitoring practices in that information extracted from images could embrace an environmental context.

  • PDF

Multimodal Attention-Based Fusion Model for Context-Aware Emotion Recognition

  • Vo, Minh-Cong;Lee, Guee-Sang
    • International Journal of Contents
    • /
    • v.18 no.3
    • /
    • pp.11-20
    • /
    • 2022
  • Human Emotion Recognition is an exciting topic that has been attracting many researchers for a lengthy time. In recent years, there has been an increasing interest in exploiting contextual information on emotion recognition. Some previous explorations in psychology show that emotional perception is impacted by facial expressions, as well as contextual information from the scene, such as human activities, interactions, and body poses. Those explorations initialize a trend in computer vision in exploring the critical role of contexts, by considering them as modalities to infer predicted emotion along with facial expressions. However, the contextual information has not been fully exploited. The scene emotion created by the surrounding environment, can shape how people perceive emotion. Besides, additive fusion in multimodal training fashion is not practical, because the contributions of each modality are not equal to the final prediction. The purpose of this paper was to contribute to this growing area of research, by exploring the effectiveness of the emotional scene gist in the input image, to infer the emotional state of the primary target. The emotional scene gist includes emotion, emotional feelings, and actions or events that directly trigger emotional reactions in the input image. We also present an attention-based fusion network, to combine multimodal features based on their impacts on the target emotional state. We demonstrate the effectiveness of the method, through a significant improvement on the EMOTIC dataset.

PC networked parallel processing system for figures and letters

  • Kitazawa, M.;Sakai, Y.
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 1993.10b
    • /
    • pp.277-282
    • /
    • 1993
  • In understanding concepts, there are two aspects; image and language. The point discussed in this paper is things fundamental in finding proper relations between objects in a scene to represent the meaning of the that whole scene properly through experiencing in image and language. It is assumed that one of the objects in a scene has letters as objects inside its contour. As the present system can deal with both figures and letters in a scene, the above assumption makes it easy for the system to infer the context of a scene. Several personal computers on the LAN network are used and they process items in parallel.

  • PDF

Conversation Context Annotation using Speaker Detection (화자인식을 이용한 대화 상황정보 어노테이션)

  • Park, Seung-Bo;Kim, Yoo-Won;Jo, Geun-Sik
    • Journal of Korea Multimedia Society
    • /
    • v.12 no.9
    • /
    • pp.1252-1261
    • /
    • 2009
  • One notable challenge in video searching and summarizing is extracting semantic from video contents and annotating context for video contents. Video semantic or context could be obtained by two methods to extract objects and contexts between objects from video. However, the method that use just to extracts objects do not express enough semantic for shot or scene as it does not describe relation and interaction between objects. To be more effective, after extracting some objects, context like relation and interaction between objects needs to be extracted from conversation situation. This paper is a study for how to detect speaker and how to compose context for talking to annotate conversation context. For this, based on this study, we proposed the methods that characters are recognized through face recognition technology, speaker is detected through mouth motion, conversation context is extracted using the rule that is composed of speaker existing, the number of characters and subtitles existing and, finally, scene context is changed to xml file and saved.

  • PDF

Effects of Object-Background Contextual Consistency on the Allocation of Attention and Memory of the Object (물체-배경 맥락 부합성이 물체에 대한 주의 할당과 기억에 미치는 영향)

  • Lee, YoonKyoung;Kim, Bia
    • Korean Journal of Cognitive Science
    • /
    • v.24 no.2
    • /
    • pp.133-171
    • /
    • 2013
  • The gist of a scene can be identified in less than 100msec, and violation in the gist can influence the way to allocate attention to the parts of a scene. In other words, people tend to allocate more attention to the object(s) inconsistent with the gist of a scene and to have better memory of them. To investigate the effects of contextual consistency on the attention allocation and object memory, two experiments were conducted. In both experiments, a $3{\times}2$ factorial design was used with scene presentation time(2s, 5s, and 10s) as a between-subject factor and object-background contextual consistency(consistent, inconsistent) as a within-subject factor. In Experiment 1, eye movements were recorded while the participants viewed line-drawing scenes. The results showed that the eye movement patterns were different according to whether the scenes were consistent or not. Context-inconsistent objects showed faster initial fixation indices, longer fixation times, more frequent returns than context-consistent ones. These results are entirely consistent with those of previous studies. If an object is identified as inconsistent with the gist of a scene, it attracts attention. Furthermore, the inconsistent objects and their locations in the scenes were recalled better than the consistent ones and their locations. Experiment 2 was the same as Experiment 1 except that a dual-task paradigm was used to reduce the amount of attention to allocate to the objects. Participants had to detect the positions of the probe occurring every second while they viewed the scenes. Nonetheless, the result patterns were the same as in Experiment 1. Even when the amount of attention to allocate to the scene contents was reduced, the same effects of contextual inconsistency were observed. These results indicate that the object-background contextual consistency has a strong influence on the way of allocating attention and the memory of objects in a scene.

  • PDF

Collaborative Place and Object Recognition in Video using Bidirectional Context Information (비디오에서 양방향 문맥 정보를 이용한 상호 협력적인 위치 및 물체 인식)

  • Kim, Sung-Ho;Kweon, In-So
    • The Journal of Korea Robotics Society
    • /
    • v.1 no.2
    • /
    • pp.172-179
    • /
    • 2006
  • In this paper, we present a practical place and object recognition method for guiding visitors in building environments. Recognizing places or objects in real world can be a difficult problem due to motion blur and camera noise. In this work, we present a modeling method based on the bidirectional interaction between places and objects for simultaneous reinforcement for the robust recognition. The unification of visual context including scene context, object context, and temporal context is also. The proposed system has been tested to guide visitors in a large scale building environment (10 topological places, 80 3D objects).

  • PDF

A Study on 3D Animation Emotional Lighting Style (3D애니메이션의 감성적 라이팅 스타일 연구)

  • Cho Jung-Sung
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2005.11a
    • /
    • pp.153-160
    • /
    • 2005
  • It is within bounds to say that the mood expressed in the Scenes of 3D Animation influences by mostly setting up of 3D CG lighting. Tn the context of CG, lighting is the process of illuminating digital scenes in an artistic and technical manner so the audience can perceive what the director intends to display on the screen with the appropriate clarity and mood. The lighting has the role of making the scenes beautiful and harmonious as an aesthetics of light and color created and controlled by people. It can be also stylized in symbolic and metaphorical methods environmental mood which we pursue to expose and story which we want to express. It thus appears that the concept of lighting style is intimately related to the particular context and art direction of animation film. But unfortunately, there are no foolproof formulas to the process of lighting a scene. In short, the lighting contributes to define the style of the scene as a conditional lighting setups' elements including position, color, intensity, shadow's area and scope. But at the same time, we must not overlook the artistic aspects that the lighting might suggest over all moods the animation genre and the style of scene like tranquility, suspense, and high-drama.

  • PDF

Detection of Video Scene Boundaries based on the Local and Global Context Information (지역 컨텍스트 및 전역 컨텍스트 정보를 이용한 비디오 장면 경계 검출)

  • 강행봉
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.8 no.6
    • /
    • pp.778-786
    • /
    • 2002
  • Scene boundary detection is important in the understanding of semantic structure from video data. However, it is more difficult than shot change detection because scene boundary detection needs to understand semantics in video data well. In this paper, we propose a new approach to scene segmentation using contextual information in video data. The contextual information is divided into two categories: local and global contextual information. The local contextual information refers to the foreground regions' information, background and shot activity. The global contextual information refers to the video shot's environment or its relationship with other video shots. Coherence, interaction and the tempo of video shots are computed as global contextual information. Using the proposed contextual information, we detect scene boundaries. Our proposed approach consists of three consecutive steps: linking, verification, and adjusting. We experimented the proposed approach using TV dramas and movies. The detection accuracy of correct scene boundaries is over than 80%.