• Title/Summary/Keyword: multimodal input

Search Result 34, Processing Time 0.023 seconds

A Study of Anomaly Detection for ICT Infrastructure using Conditional Multimodal Autoencoder (ICT 인프라 이상탐지를 위한 조건부 멀티모달 오토인코더에 관한 연구)

  • Shin, Byungjin;Lee, Jonghoon;Han, Sangjin;Park, Choong-Shik
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.3
    • /
    • pp.57-73
    • /
    • 2021
  • Maintenance and prevention of failure through anomaly detection of ICT infrastructure is becoming important. System monitoring data is multidimensional time series data. When we deal with multidimensional time series data, we have difficulty in considering both characteristics of multidimensional data and characteristics of time series data. When dealing with multidimensional data, correlation between variables should be considered. Existing methods such as probability and linear base, distance base, etc. are degraded due to limitations called the curse of dimensions. In addition, time series data is preprocessed by applying sliding window technique and time series decomposition for self-correlation analysis. These techniques are the cause of increasing the dimension of data, so it is necessary to supplement them. The anomaly detection field is an old research field, and statistical methods and regression analysis were used in the early days. Currently, there are active studies to apply machine learning and artificial neural network technology to this field. Statistically based methods are difficult to apply when data is non-homogeneous, and do not detect local outliers well. The regression analysis method compares the predictive value and the actual value after learning the regression formula based on the parametric statistics and it detects abnormality. Anomaly detection using regression analysis has the disadvantage that the performance is lowered when the model is not solid and the noise or outliers of the data are included. There is a restriction that learning data with noise or outliers should be used. The autoencoder using artificial neural networks is learned to output as similar as possible to input data. It has many advantages compared to existing probability and linear model, cluster analysis, and map learning. It can be applied to data that does not satisfy probability distribution or linear assumption. In addition, it is possible to learn non-mapping without label data for teaching. However, there is a limitation of local outlier identification of multidimensional data in anomaly detection, and there is a problem that the dimension of data is greatly increased due to the characteristics of time series data. In this study, we propose a CMAE (Conditional Multimodal Autoencoder) that enhances the performance of anomaly detection by considering local outliers and time series characteristics. First, we applied Multimodal Autoencoder (MAE) to improve the limitations of local outlier identification of multidimensional data. Multimodals are commonly used to learn different types of inputs, such as voice and image. The different modal shares the bottleneck effect of Autoencoder and it learns correlation. In addition, CAE (Conditional Autoencoder) was used to learn the characteristics of time series data effectively without increasing the dimension of data. In general, conditional input mainly uses category variables, but in this study, time was used as a condition to learn periodicity. The CMAE model proposed in this paper was verified by comparing with the Unimodal Autoencoder (UAE) and Multi-modal Autoencoder (MAE). The restoration performance of Autoencoder for 41 variables was confirmed in the proposed model and the comparison model. The restoration performance is different by variables, and the restoration is normally well operated because the loss value is small for Memory, Disk, and Network modals in all three Autoencoder models. The process modal did not show a significant difference in all three models, and the CPU modal showed excellent performance in CMAE. ROC curve was prepared for the evaluation of anomaly detection performance in the proposed model and the comparison model, and AUC, accuracy, precision, recall, and F1-score were compared. In all indicators, the performance was shown in the order of CMAE, MAE, and AE. Especially, the reproduction rate was 0.9828 for CMAE, which can be confirmed to detect almost most of the abnormalities. The accuracy of the model was also improved and 87.12%, and the F1-score was 0.8883, which is considered to be suitable for anomaly detection. In practical aspect, the proposed model has an additional advantage in addition to performance improvement. The use of techniques such as time series decomposition and sliding windows has the disadvantage of managing unnecessary procedures; and their dimensional increase can cause a decrease in the computational speed in inference.The proposed model has characteristics that are easy to apply to practical tasks such as inference speed and model management.

Design of Multimodal User Interface using Speech and Gesture Recognition for Wearable Watch Platform (착용형 단말에서의 음성 인식과 제스처 인식을 융합한 멀티 모달 사용자 인터페이스 설계)

  • Seong, Ki Eun;Park, Yu Jin;Kang, Soon Ju
    • KIISE Transactions on Computing Practices
    • /
    • v.21 no.6
    • /
    • pp.418-423
    • /
    • 2015
  • As the development of technology advances at exceptional speed, the functions of wearable devices become more diverse and complicated, and many users find some of the functions difficult to use. In this paper, the main aim is to provide the user with an interface that is more friendly and easier to use. The speech recognition is easy to use and also easy to insert an input order. However, speech recognition is problematic when using on a wearable device that has limited computing power and battery. The wearable device cannot predict when the user will give an order through speech recognition. This means that while speech recognition must always be activated, because of the battery issue, the time taken waiting for the user to give an order is impractical. In order to solve this problem, we use gesture recognition. This paper describes how to use both speech and gesture recognition as a multimodal interface to increase the user's comfort.

Deep Multimodal MRI Fusion Model for Brain Tumor Grading (뇌 종양 등급 분류를 위한 심층 멀티모달 MRI 통합 모델)

  • Na, In-ye;Park, Hyunjin
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.05a
    • /
    • pp.416-418
    • /
    • 2022
  • Glioma is a type of brain tumor that occurs in glial cells and is classified into two types: high hrade hlioma with a poor prognosis and low grade glioma. Magnetic resonance imaging (MRI) as a non-invasive method is widely used in glioma diagnosis research. Studies to obtain complementary information by combining multiple modalities to overcome the incomplete information limitation of single modality are being conducted. In this study, we developed a 3D CNN-based model that applied input-level fusion to MRI of four modalities (T1, T1Gd, T2, T2-FLAIR). The trained model showed classification performance of 0.8926 accuracy, 0.9688 sensitivity, 0.6400 specificity, and 0.9467 AUC on the validation data. Through this, it was confirmed that the grade of glioma was effectively classified by learning the internal relationship between various modalities.

  • PDF

Implementation of Pen-Gesture Recognition System for Multimodal User Interface (멀티모달 사용자 인터페이스를 위한 펜 제스처인식기의 구현)

  • 오준택;이우범;김욱현
    • Proceedings of the IEEK Conference
    • /
    • 2000.11c
    • /
    • pp.121-124
    • /
    • 2000
  • In this paper, we propose a pen gesture recognition system for user interface in multimedia terminal which requires fast processing time and high recognition rate. It is realtime and interaction system between graphic and text module. Text editing in recognition system is performed by pen gesture in graphic module or direct editing in text module, and has all 14 editing functions. The pen gesture recognition is performed by searching classification features that extracted from input strokes at pen gesture model. The pen gesture model has been constructed by classification features, ie, cross number, direction change, direction code number, position relation, distance ratio information about defined 15 types. The proposed recognition system has obtained 98% correct recognition rate and 30msec average processing time in a recognition experiment.

  • PDF

Improving Transformer with Dynamic Convolution and Shortcut for Video-Text Retrieval

  • Liu, Zhi;Cai, Jincen;Zhang, Mengmeng
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.16 no.7
    • /
    • pp.2407-2424
    • /
    • 2022
  • Recently, Transformer has made great progress in video retrieval tasks due to its high representation capability. For the structure of a Transformer, the cascaded self-attention modules are capable of capturing long-distance feature dependencies. However, the local feature details are likely to have deteriorated. In addition, increasing the depth of the structure is likely to produce learning bias in the learned features. In this paper, an improved Transformer structure named TransDCS (Transformer with Dynamic Convolution and Shortcut) is proposed. A Multi-head Conv-Self-Attention module is introduced to model the local dependencies and improve the efficiency of local features extraction. Meanwhile, the augmented shortcuts module based on a dual identity matrix is applied to enhance the conduction of input features, and mitigate the learning bias. The proposed model is tested on MSRVTT, LSMDC and Activity-Net benchmarks, and it surpasses all previous solutions for the video-text retrieval task. For example, on the LSMDC benchmark, a gain of about 2.3% MdR and 6.1% MnR is obtained over recently proposed multimodal-based methods.

Technical Trends in Artificial Intelligence for Robotics Based on Large Language Models (거대언어모델 기반 로봇 인공지능 기술 동향 )

  • J. Lee;S. Park;N.W. Kim;E. Kim;S.K. Ko
    • Electronics and Telecommunications Trends
    • /
    • v.39 no.1
    • /
    • pp.95-105
    • /
    • 2024
  • In natural language processing, large language models such as GPT-4 have recently been in the spotlight. The performance of natural language processing has advanced dramatically driven by an increase in the number of model parameters related to the number of acceptable input tokens and model size. Research on multimodal models that can simultaneously process natural language and image data is being actively conducted. Moreover, natural-language and image-based reasoning capabilities of large language models is being explored in robot artificial intelligence technology. We discuss research and related patent trends in robot task planning and code generation for robot control using large language models.

Design of a Deep Neural Network Model for Image Caption Generation (이미지 캡션 생성을 위한 심층 신경망 모델의 설계)

  • Kim, Dongha;Kim, Incheol
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.6 no.4
    • /
    • pp.203-210
    • /
    • 2017
  • In this paper, we propose an effective neural network model for image caption generation and model transfer. This model is a kind of multi-modal recurrent neural network models. It consists of five distinct layers: a convolution neural network layer for extracting visual information from images, an embedding layer for converting each word into a low dimensional feature, a recurrent neural network layer for learning caption sentence structure, and a multi-modal layer for combining visual and language information. In this model, the recurrent neural network layer is constructed by LSTM units, which are well known to be effective for learning and transferring sequence patterns. Moreover, this model has a unique structure in which the output of the convolution neural network layer is linked not only to the input of the initial state of the recurrent neural network layer but also to the input of the multimodal layer, in order to make use of visual information extracted from the image at each recurrent step for generating the corresponding textual caption. Through various comparative experiments using open data sets such as Flickr8k, Flickr30k, and MSCOCO, we demonstrated the proposed multimodal recurrent neural network model has high performance in terms of caption accuracy and model transfer effect.

Reliability Analysis Using Parametric and Nonparametric Input Modeling Methods (모수적·비모수적 입력모델링 기법을 이용한 신뢰성 해석)

  • Kang, Young-Jin;Hong, Jimin;Lim, O-Kaung;Noh, Yoojeong
    • Journal of the Computational Structural Engineering Institute of Korea
    • /
    • v.30 no.1
    • /
    • pp.87-94
    • /
    • 2017
  • Reliability analysis(RA) and Reliability-based design optimization(RBDO) require statistical modeling of input random variables, which is parametrically or nonparametrically determined based on experimental data. For the parametric method, goodness-of-fit (GOF) test and model selection method are widely used, and a sequential statistical modeling method combining the merits of the two methods has been recently proposed. Kernel density estimation(KDE) is often used as a nonparametric method, and it well describes a distribution function when the number of data is small or a density function has multimodal distribution. Although accurate statistical models are needed to obtain accurate RA and RBDO results, accurate statistical modeling is difficult when the number of data is small. In this study, the accuracy of two statistical modeling methods, SSM and KDE, were compared according to the number of data. Through numerical examples, the RA results using the input models modeled by two methods were compared, and appropriate modeling method was proposed according to the number of data.

A new human-robot interaction method using semantic symbols

  • Park, Sang-Hyun;Hwang, Jung-Hoon;Kwon, Dong-Soo
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 2004.08a
    • /
    • pp.2005-2010
    • /
    • 2004
  • As robots become more prevalent in human daily life, situations requiring interaction between humans and robots will occur more frequently. Therefore, human-robot interaction (HRI) is becoming increasingly important. Although robotics researchers have made many technical developments in their field, intuitive and easy ways for most common users to interact with robots are still lacking. This paper introduces a new approach to enhance human-robot interaction using a semantic symbol language and proposes a method to acquire the intentions of robot users. In the proposed approach, each semantic symbol represents knowledge about either the environment or an action that a robot can perform. Users'intentions are expressed by symbolized multimodal information. To interpret a users'command, a probabilistic approach is used, which is appropriate for interpreting a freestyle user expression or insufficient input information. Therefore, a first-order Markov model is constructed as a probabilistic model, and a questionnaire is conducted to obtain state transition probabilities for this Markov model. Finally, we evaluated our model to show how well it interprets users'commands.

  • PDF

AN IMAGE THRESHOLDING METHOD BASED ON THE TARGET EXTRACTION

  • Zhang, Yunjie;Li, Yi;Gao, Zhijun;Wang, Weina
    • Journal of applied mathematics & informatics
    • /
    • v.26 no.3_4
    • /
    • pp.661-672
    • /
    • 2008
  • In this paper an algorithm, based on extracting a certain target of an image, is proposed that is capable of performing bilevel thresholding of image with multimodal distribution. Each pixel in the image has a membership value which is used to denote the characteristic relationship between the pixel and its belonging region (i.e. the object or background). Using the membership values of image set, a new measurement, which simultaneously measures the measure of fuzziness and the conditional entropy of the image, is calculated. Then, thresholds are found by optimally minimizing calculated measurement. In addition, a fuzzy range is defined to improve the threshold values. The experimental results demonstrate that the proposed approach can select the thresholds automatically and effectively extract the meaningful target from the input image. The resulting image can preserve the object region we target very well.

  • PDF