• Title/Summary/Keyword: Large multimodal model

Search Result 13, Processing Time 0.019 seconds

Technical Trends in Artificial Intelligence for Robotics Based on Large Language Models (거대언어모델 기반 로봇 인공지능 기술 동향 )

  • J. Lee;S. Park;N.W. Kim;E. Kim;S.K. Ko
    • Electronics and Telecommunications Trends
    • /
    • v.39 no.1
    • /
    • pp.95-105
    • /
    • 2024
  • In natural language processing, large language models such as GPT-4 have recently been in the spotlight. The performance of natural language processing has advanced dramatically driven by an increase in the number of model parameters related to the number of acceptable input tokens and model size. Research on multimodal models that can simultaneously process natural language and image data is being actively conducted. Moreover, natural-language and image-based reasoning capabilities of large language models is being explored in robot artificial intelligence technology. We discuss research and related patent trends in robot task planning and code generation for robot control using large language models.

Multimodal Emotional State Estimation Model for Implementation of Intelligent Exhibition Services (지능형 전시 서비스 구현을 위한 멀티모달 감정 상태 추정 모형)

  • Lee, Kichun;Choi, So Yun;Kim, Jae Kyeong;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.1
    • /
    • pp.1-14
    • /
    • 2014
  • Both researchers and practitioners are showing an increased interested in interactive exhibition services. Interactive exhibition services are designed to directly respond to visitor responses in real time, so as to fully engage visitors' interest and enhance their satisfaction. In order to install an effective interactive exhibition service, it is essential to adopt intelligent technologies that enable accurate estimation of a visitor's emotional state from responses to exhibited stimulus. Studies undertaken so far have attempted to estimate the human emotional state, most of them doing so by gauging either facial expressions or audio responses. However, the most recent research suggests that, a multimodal approach that uses people's multiple responses simultaneously may lead to better estimation. Given this context, we propose a new multimodal emotional state estimation model that uses various responses including facial expressions, gestures, and movements measured by the Microsoft Kinect Sensor. In order to effectively handle a large amount of sensory data, we propose to use stratified sampling-based MRA (multiple regression analysis) as our estimation method. To validate the usefulness of the proposed model, we collected 602,599 responses and emotional state data with 274 variables from 15 people. When we applied our model to the data set, we found that our model estimated the levels of valence and arousal in the 10~15% error range. Since our proposed model is simple and stable, we expect that it will be applied not only in intelligent exhibition services, but also in other areas such as e-learning and personalized advertising.

Current Status and Direction of Generative Large Language Model Applications in Medicine - Focusing on East Asian Medicine - (생성형 거대언어모델의 의학 적용 현황과 방향 - 동아시아 의학을 중심으로 -)

  • Bongsu Kang;SangYeon Lee;Hyojin Bae;Chang-Eop Kim
    • Journal of Physiology & Pathology in Korean Medicine
    • /
    • v.38 no.2
    • /
    • pp.49-58
    • /
    • 2024
  • The rapid advancement of generative large language models has revolutionized various real-life domains, emphasizing the importance of exploring their applications in healthcare. This study aims to examine how generative large language models are implemented in the medical domain, with the specific objective of searching for the possibility and potential of integration between generative large language models and East Asian medicine. Through a comprehensive current state analysis, we identified limitations in the deployment of generative large language models within East Asian medicine and proposed directions for future research. Our findings highlight the essential need for accumulating and generating structured data to improve the capabilities of generative large language models in East Asian medicine. Additionally, we tackle the issue of hallucination and the necessity for a robust model evaluation framework. Despite these challenges, the application of generative large language models in East Asian medicine has demonstrated promising results. Techniques such as model augmentation, multimodal structures, and knowledge distillation have the potential to significantly enhance accuracy, efficiency, and accessibility. In conclusion, we expect generative large language models to play a pivotal role in facilitating precise diagnostics, personalized treatment in clinical fields, and fostering innovation in education and research within East Asian medicine.

Large Language Models: A Guide for Radiologists

  • Sunkyu Kim;Choong-kun Lee;Seung-seob Kim
    • Korean Journal of Radiology
    • /
    • v.25 no.2
    • /
    • pp.126-133
    • /
    • 2024
  • Large language models (LLMs) have revolutionized the global landscape of technology beyond natural language processing. Owing to their extensive pre-training on vast datasets, contemporary LLMs can handle tasks ranging from general functionalities to domain-specific areas, such as radiology, without additional fine-tuning. General-purpose chatbots based on LLMs can optimize the efficiency of radiologists in terms of their professional work and research endeavors. Importantly, these LLMs are on a trajectory of rapid evolution, wherein challenges such as "hallucination," high training cost, and efficiency issues are addressed, along with the inclusion of multimodal inputs. In this review, we aim to offer conceptual knowledge and actionable guidance to radiologists interested in utilizing LLMs through a succinct overview of the topic and a summary of radiology-specific aspects, from the beginning to potential future directions.

A Deep Learning Based Approach to Recognizing Accompanying Status of Smartphone Users Using Multimodal Data (스마트폰 다종 데이터를 활용한 딥러닝 기반의 사용자 동행 상태 인식)

  • Kim, Kilho;Choi, Sangwoo;Chae, Moon-jung;Park, Heewoong;Lee, Jaehong;Park, Jonghun
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.1
    • /
    • pp.163-177
    • /
    • 2019
  • As smartphones are getting widely used, human activity recognition (HAR) tasks for recognizing personal activities of smartphone users with multimodal data have been actively studied recently. The research area is expanding from the recognition of the simple body movement of an individual user to the recognition of low-level behavior and high-level behavior. However, HAR tasks for recognizing interaction behavior with other people, such as whether the user is accompanying or communicating with someone else, have gotten less attention so far. And previous research for recognizing interaction behavior has usually depended on audio, Bluetooth, and Wi-Fi sensors, which are vulnerable to privacy issues and require much time to collect enough data. Whereas physical sensors including accelerometer, magnetic field and gyroscope sensors are less vulnerable to privacy issues and can collect a large amount of data within a short time. In this paper, a method for detecting accompanying status based on deep learning model by only using multimodal physical sensor data, such as an accelerometer, magnetic field and gyroscope, was proposed. The accompanying status was defined as a redefinition of a part of the user interaction behavior, including whether the user is accompanying with an acquaintance at a close distance and the user is actively communicating with the acquaintance. A framework based on convolutional neural networks (CNN) and long short-term memory (LSTM) recurrent networks for classifying accompanying and conversation was proposed. First, a data preprocessing method which consists of time synchronization of multimodal data from different physical sensors, data normalization and sequence data generation was introduced. We applied the nearest interpolation to synchronize the time of collected data from different sensors. Normalization was performed for each x, y, z axis value of the sensor data, and the sequence data was generated according to the sliding window method. Then, the sequence data became the input for CNN, where feature maps representing local dependencies of the original sequence are extracted. The CNN consisted of 3 convolutional layers and did not have a pooling layer to maintain the temporal information of the sequence data. Next, LSTM recurrent networks received the feature maps, learned long-term dependencies from them and extracted features. The LSTM recurrent networks consisted of two layers, each with 128 cells. Finally, the extracted features were used for classification by softmax classifier. The loss function of the model was cross entropy function and the weights of the model were randomly initialized on a normal distribution with an average of 0 and a standard deviation of 0.1. The model was trained using adaptive moment estimation (ADAM) optimization algorithm and the mini batch size was set to 128. We applied dropout to input values of the LSTM recurrent networks to prevent overfitting. The initial learning rate was set to 0.001, and it decreased exponentially by 0.99 at the end of each epoch training. An Android smartphone application was developed and released to collect data. We collected smartphone data for a total of 18 subjects. Using the data, the model classified accompanying and conversation by 98.74% and 98.83% accuracy each. Both the F1 score and accuracy of the model were higher than the F1 score and accuracy of the majority vote classifier, support vector machine, and deep recurrent neural network. In the future research, we will focus on more rigorous multimodal sensor data synchronization methods that minimize the time stamp differences. In addition, we will further study transfer learning method that enables transfer of trained models tailored to the training data to the evaluation data that follows a different distribution. It is expected that a model capable of exhibiting robust recognition performance against changes in data that is not considered in the model learning stage will be obtained.

Overcoming the Challenges in the Development and Implementation of Artificial Intelligence in Radiology: A Comprehensive Review of Solutions Beyond Supervised Learning

  • Gil-Sun Hong;Miso Jang;Sunggu Kyung;Kyungjin Cho;Jiheon Jeong;Grace Yoojin Lee;Keewon Shin;Ki Duk Kim;Seung Min Ryu;Joon Beom Seo;Sang Min Lee;Namkug Kim
    • Korean Journal of Radiology
    • /
    • v.24 no.11
    • /
    • pp.1061-1080
    • /
    • 2023
  • Artificial intelligence (AI) in radiology is a rapidly developing field with several prospective clinical studies demonstrating its benefits in clinical practice. In 2022, the Korean Society of Radiology held a forum to discuss the challenges and drawbacks in AI development and implementation. Various barriers hinder the successful application and widespread adoption of AI in radiology, such as limited annotated data, data privacy and security, data heterogeneity, imbalanced data, model interpretability, overfitting, and integration with clinical workflows. In this review, some of the various possible solutions to these challenges are presented and discussed; these include training with longitudinal and multimodal datasets, dense training with multitask learning and multimodal learning, self-supervised contrastive learning, various image modifications and syntheses using generative models, explainable AI, causal learning, federated learning with large data models, and digital twins.

Feasibility Study of Determining the Healing Phase of Achilles Tendon Rupture in Rats Using Optical Coherence Tomography

  • Kim, Young-Sik;Chae, Yu-Gyeong;Jeon, Min Yong;Kim, Dong Kyu;Ahn, Yeh-Chan
    • Journal of the Optical Society of Korea
    • /
    • v.19 no.2
    • /
    • pp.175-181
    • /
    • 2015
  • Optical coherence tomography (OCT) is a noninvasive technique for microscopic investigation of tissue. We thought that the OCT method could be a potential tool for monitoring the healing process of a tendon. In this study we used two rat models, denervated and non-denervated groups, to observe a variety of healing phases of Achilles tendon (AT) injury. We made samples of AT injury lesions, to take OCT images and to make histopathological samples of serial sectional tissue. In an OCT image the denervated rat showed no specific finding, but the non-denervated rat showed a large defect lesion that was scaffolding tissue. OCT findings combined with pathologic findings showed advantages in visualization of tendon microstructure over other imaging modalities such as MRI and US, and OCT is beneficial to making a treatment plan, especially the timing and intensity of rehabilitation. Therefore a multimodal platform using OCT for evaluation of tendon injury may be potentially useful for many applications.

Human Action Recognition Using Pyramid Histograms of Oriented Gradients and Collaborative Multi-task Learning

  • Gao, Zan;Zhang, Hua;Liu, An-An;Xue, Yan-Bing;Xu, Guang-Ping
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.8 no.2
    • /
    • pp.483-503
    • /
    • 2014
  • In this paper, human action recognition using pyramid histograms of oriented gradients and collaborative multi-task learning is proposed. First, we accumulate global activities and construct motion history image (MHI) for both RGB and depth channels respectively to encode the dynamics of one action in different modalities, and then different action descriptors are extracted from depth and RGB MHI to represent global textual and structural characteristics of these actions. Specially, average value in hierarchical block, GIST and pyramid histograms of oriented gradients descriptors are employed to represent human motion. To demonstrate the superiority of the proposed method, we evaluate them by KNN, SVM with linear and RBF kernels, SRC and CRC models on DHA dataset, the well-known dataset for human action recognition. Large scale experimental results show our descriptors are robust, stable and efficient, and outperform the state-of-the-art methods. In addition, we investigate the performance of our descriptors further by combining these descriptors on DHA dataset, and observe that the performances of combined descriptors are much better than just using only sole descriptor. With multimodal features, we also propose a collaborative multi-task learning method for model learning and inference based on transfer learning theory. The main contributions lie in four aspects: 1) the proposed encoding the scheme can filter the stationary part of human body and reduce noise interference; 2) different kind of features and models are assessed, and the neighbor gradients information and pyramid layers are very helpful for representing these actions; 3) The proposed model can fuse the features from different modalities regardless of the sensor types, the ranges of the value, and the dimensions of different features; 4) The latent common knowledge among different modalities can be discovered by transfer learning to boost the performance.

Anomaly Detection Methodology Based on Multimodal Deep Learning (멀티모달 딥 러닝 기반 이상 상황 탐지 방법론)

  • Lee, DongHoon;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.2
    • /
    • pp.101-125
    • /
    • 2022
  • Recently, with the development of computing technology and the improvement of the cloud environment, deep learning technology has developed, and attempts to apply deep learning to various fields are increasing. A typical example is anomaly detection, which is a technique for identifying values or patterns that deviate from normal data. Among the representative types of anomaly detection, it is very difficult to detect a contextual anomaly that requires understanding of the overall situation. In general, detection of anomalies in image data is performed using a pre-trained model trained on large data. However, since this pre-trained model was created by focusing on object classification of images, there is a limit to be applied to anomaly detection that needs to understand complex situations created by various objects. Therefore, in this study, we newly propose a two-step pre-trained model for detecting abnormal situation. Our methodology performs additional learning from image captioning to understand not only mere objects but also the complicated situation created by them. Specifically, the proposed methodology transfers knowledge of the pre-trained model that has learned object classification with ImageNet data to the image captioning model, and uses the caption that describes the situation represented by the image. Afterwards, the weight obtained by learning the situational characteristics through images and captions is extracted and fine-tuning is performed to generate an anomaly detection model. To evaluate the performance of the proposed methodology, an anomaly detection experiment was performed on 400 situational images and the experimental results showed that the proposed methodology was superior in terms of anomaly detection accuracy and F1-score compared to the existing traditional pre-trained model.