• Title/Summary/Keyword: High-dimensional datasets

Search Result 47, Processing Time 0.021 seconds

Group Contribution Method and Support Vector Regression based Model for Predicting Physical Properties of Aromatic Compounds (Group Contribution Method 및 Support Vector Regression 기반 모델을 이용한 방향족 화합물 물성치 예측에 관한 연구)

  • Kang, Ha Yeong;Oh, Chang Bo;Won, Yong Sun;Liu, J. Jay;Lee, Chang Jun
    • Journal of the Korean Society of Safety
    • /
    • v.36 no.1
    • /
    • pp.1-8
    • /
    • 2021
  • To simulate a process model in the field of chemical engineering, it is very important to identify the physical properties of novel materials as well as existing materials. However, it is difficult to measure the physical properties throughout a set of experiments due to the potential risk and cost. To address this, this study aims to develop a property prediction model based on the group contribution method for aromatic chemical compounds including benzene rings. The benzene rings of aromatic materials have a significant impact on their physical properties. To establish the prediction model, 42 important functional groups that determine the physical properties are considered, and the total numbers of functional groups on 147 aromatic chemical compounds are counted to prepare a dataset. Support vector regression is employed to prepare a prediction model to handle sparse and high-dimensional data. To verify the efficacy of this study, the results of this study are compared with those of previous studies. Despite the different datasets in the previous studies, the comparison indicated the enhanced performance in this study. Moreover, there are few reports on predicting the physical properties of aromatic compounds. This study can provide an effective method to estimate the physical properties of unknown chemical compounds and contribute toward reducing the experimental efforts for measuring physical properties.

Attack Detection and Classification Method Using PCA and LightGBM in MQTT-based IoT Environment (MQTT 기반 IoT 환경에서의 PCA와 LightGBM을 이용한 공격 탐지 및 분류 방안)

  • Lee Ji Gu;Lee Soo Jin;Kim Young Won
    • Convergence Security Journal
    • /
    • v.22 no.4
    • /
    • pp.17-24
    • /
    • 2022
  • Recently, machine learning-based cyber attack detection and classification research has been actively conducted, achieving a high level of detection accuracy. However, low-spec IoT devices and large-scale network traffic make it difficult to apply machine learning-based detection models in IoT environment. Therefore, In this paper, we propose an efficient IoT attack detection and classification method through PCA(Principal Component Analysis) and LightGBM(Light Gradient Boosting Model) using datasets collected in a MQTT(Message Queuing Telementry Transport) IoT protocol environment that is also used in the defense field. As a result of the experiment, even though the original dataset was reduced to about 15%, the performance was almost similar to that of the original. It also showed the best performance in comparative evaluation with the four dimensional reduction techniques selected in this paper.

Management of Construction Fields Information Using Low Altitude Close-range Aerial Images (저고도 근접 항공영상을 이용한 현장정보관리)

  • Cho, Young Sun;Lim, No Yeol;Joung, Woo Su;Jung, Sung Heuk;Choi, Seok Keun
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.32 no.5
    • /
    • pp.551-560
    • /
    • 2014
  • Compare to other industrial sites, the civil construction work not only takes longer time but also has made of complicated processes, such as the integrated management, process control, and quality control until the completion. However, it is hard to take control the construction sites, since numerous issues are always emerged. The study purposes on providing the dataset to synthetically manage and monitor the civil construction site, main design, drawings, process, construction cost, and others at real-time by using the low altitude close-range aerial images, based on UAV, and the GPS surveying method for treating the three-dimensional spatial information quickly and accurately. As a result, we could provide the latest information for the quick decision-making following from planning to completion of the construction, and objective site evaluation by the high-resolution three-dimensional spatial information and drawings. Also, the present map, longitudinal map, and cross sectional view are developed to provide various datasets rapidly, such as earthwork volume table, specifications, and transition of ground level.

Design of an Arm Gesture Recognition System Using Feature Transformation and Hidden Markov Models (특징 변환과 은닉 마코프 모델을 이용한 팔 제스처 인식 시스템의 설계)

  • Heo, Se-Kyeong;Shin, Ye-Seul;Kim, Hye-Suk;Kim, In-Cheol
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.2 no.10
    • /
    • pp.723-730
    • /
    • 2013
  • This paper presents the design of an arm gesture recognition system using Kinect sensor. A variety of methods have been proposed for gesture recognition, ranging from the use of Dynamic Time Warping(DTW) to Hidden Markov Models(HMM). Our system learns a unique HMM corresponding to each arm gesture from a set of sequential skeleton data. Whenever the same gesture is performed, the trajectory of each joint captured by Kinect sensor may much differ from the previous, depending on the length and/or the orientation of the subject's arm. In order to obtain the robust performance independent of these conditions, the proposed system executes the feature transformation, in which the feature vectors of joint positions are transformed into those of angles between joints. To improve the computational efficiency for learning and using HMMs, our system also performs the k-means clustering to get one-dimensional integer sequences as inputs for discrete HMMs from high-dimensional real-number observation vectors. The dimension reduction and discretization can help our system use HMMs efficiently to recognize gestures in real-time environments. Finally, we demonstrate the recognition performance of our system through some experiments using two different datasets.

Sentiment Analysis of Korean Reviews Using CNN: Focusing on Morpheme Embedding (CNN을 적용한 한국어 상품평 감성분석: 형태소 임베딩을 중심으로)

  • Park, Hyun-jung;Song, Min-chae;Shin, Kyung-shik
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.59-83
    • /
    • 2018
  • With the increasing importance of sentiment analysis to grasp the needs of customers and the public, various types of deep learning models have been actively applied to English texts. In the sentiment analysis of English texts by deep learning, natural language sentences included in training and test datasets are usually converted into sequences of word vectors before being entered into the deep learning models. In this case, word vectors generally refer to vector representations of words obtained through splitting a sentence by space characters. There are several ways to derive word vectors, one of which is Word2Vec used for producing the 300 dimensional Google word vectors from about 100 billion words of Google News data. They have been widely used in the studies of sentiment analysis of reviews from various fields such as restaurants, movies, laptops, cameras, etc. Unlike English, morpheme plays an essential role in sentiment analysis and sentence structure analysis in Korean, which is a typical agglutinative language with developed postpositions and endings. A morpheme can be defined as the smallest meaningful unit of a language, and a word consists of one or more morphemes. For example, for a word '예쁘고', the morphemes are '예쁘(= adjective)' and '고(=connective ending)'. Reflecting the significance of Korean morphemes, it seems reasonable to adopt the morphemes as a basic unit in Korean sentiment analysis. Therefore, in this study, we use 'morpheme vector' as an input to a deep learning model rather than 'word vector' which is mainly used in English text. The morpheme vector refers to a vector representation for the morpheme and can be derived by applying an existent word vector derivation mechanism to the sentences divided into constituent morphemes. By the way, here come some questions as follows. What is the desirable range of POS(Part-Of-Speech) tags when deriving morpheme vectors for improving the classification accuracy of a deep learning model? Is it proper to apply a typical word vector model which primarily relies on the form of words to Korean with a high homonym ratio? Will the text preprocessing such as correcting spelling or spacing errors affect the classification accuracy, especially when drawing morpheme vectors from Korean product reviews with a lot of grammatical mistakes and variations? We seek to find empirical answers to these fundamental issues, which may be encountered first when applying various deep learning models to Korean texts. As a starting point, we summarized these issues as three central research questions as follows. First, which is better effective, to use morpheme vectors from grammatically correct texts of other domain than the analysis target, or to use morpheme vectors from considerably ungrammatical texts of the same domain, as the initial input of a deep learning model? Second, what is an appropriate morpheme vector derivation method for Korean regarding the range of POS tags, homonym, text preprocessing, minimum frequency? Third, can we get a satisfactory level of classification accuracy when applying deep learning to Korean sentiment analysis? As an approach to these research questions, we generate various types of morpheme vectors reflecting the research questions and then compare the classification accuracy through a non-static CNN(Convolutional Neural Network) model taking in the morpheme vectors. As for training and test datasets, Naver Shopping's 17,260 cosmetics product reviews are used. To derive morpheme vectors, we use data from the same domain as the target one and data from other domain; Naver shopping's about 2 million cosmetics product reviews and 520,000 Naver News data arguably corresponding to Google's News data. The six primary sets of morpheme vectors constructed in this study differ in terms of the following three criteria. First, they come from two types of data source; Naver news of high grammatical correctness and Naver shopping's cosmetics product reviews of low grammatical correctness. Second, they are distinguished in the degree of data preprocessing, namely, only splitting sentences or up to additional spelling and spacing corrections after sentence separation. Third, they vary concerning the form of input fed into a word vector model; whether the morphemes themselves are entered into a word vector model or with their POS tags attached. The morpheme vectors further vary depending on the consideration range of POS tags, the minimum frequency of morphemes included, and the random initialization range. All morpheme vectors are derived through CBOW(Continuous Bag-Of-Words) model with the context window 5 and the vector dimension 300. It seems that utilizing the same domain text even with a lower degree of grammatical correctness, performing spelling and spacing corrections as well as sentence splitting, and incorporating morphemes of any POS tags including incomprehensible category lead to the better classification accuracy. The POS tag attachment, which is devised for the high proportion of homonyms in Korean, and the minimum frequency standard for the morpheme to be included seem not to have any definite influence on the classification accuracy.

CT Based 3-Dimensional Treatment Planning of Intracavitary Brachytherapy for Cancer of the Cervix : Comparison between Dose-Volume Histograms and ICRU Point Doses to the Rectum and Bladder

  • Hashim, Natasha;Jamalludin, Zulaikha;Ung, Ngie Min;Ho, Gwo Fuang;Malik, Rozita Abdul;Ee Phua, Vincent Chee
    • Asian Pacific Journal of Cancer Prevention
    • /
    • v.15 no.13
    • /
    • pp.5259-5264
    • /
    • 2014
  • Background: CT based brachytherapy allows 3-dimensional (3D) assessment of organs at risk (OAR) doses with dose volume histograms (DVHs). The purpose of this study was to compare computed tomography (CT) based volumetric calculations and International Commission on Radiation Units and Measurements (ICRU) reference-point estimates of radiation doses to the bladder and rectum in patients with carcinoma of the cervix treated with high-dose-rate (HDR) intracavitary brachytherapy (ICBT). Materials and Methods: Between March 2011 and May 2012, 20 patients were treated with 55 fractions of brachytherapy using tandem and ovoids and underwent post-implant CT scans. The external beam radiotherapy (EBRT) dose was 48.6Gy in 27 fractions. HDR brachytherapy was delivered to a dose of 21 Gy in three fractions. The ICRU bladder and rectum point doses along with 4 additional rectal points were recorded. The maximum dose ($D_{Max}$) to rectum was the highest recorded dose at one of these five points. Using the HDRplus 2.6 brachyhtherapy treatment planning system, the bladder and rectum were retrospectively contoured on the 55 CT datasets. The DVHs for rectum and bladder were calculated and the minimum doses to the highest irradiated 2cc area of rectum and bladder were recorded ($D_{2cc}$) for all individual fractions. The mean $D_{2cc}$ of rectum was compared to the means of ICRU rectal point and rectal $D_{Max}$ using the Student's t-test. The mean $D_{2cc}$ of bladder was compared with the mean ICRU bladder point using the same statistical test. The total dose, combining EBRT and HDR brachytherapy, were biologically normalized to the conventional 2 Gy/fraction using the linear-quadratic model. (${\alpha}/{\beta}$ value of 10 Gy for target, 3 Gy for organs at risk). Results: The total prescribed dose was $77.5Gy{\alpha}/{\beta}10$. The mean dose to the rectum was $4.58{\pm}1.22Gy$ for $D_{2cc}$, $3.76{\pm}0.65Gy$ at $D_{ICRU}$ and $4.75{\pm}1.01Gy$ at $D_{Max}$. The mean rectal $D_{2cc}$ dose differed significantly from the mean dose calculated at the ICRU reference point (p<0.005); the mean difference was 0.82 Gy (0.48-1.19Gy). The mean EQD2 was $68.52{\pm}7.24Gy_{{\alpha}/{\beta}3}$ for $D_{2cc}$, $61.71{\pm}2.77Gy_{{\alpha}/{\beta}3}$ at $D_{ICRU}$ and $69.24{\pm}6.02Gy_{{\alpha}/{\beta}3}$ at $D_{Max}$. The mean ratio of $D_{2cc}$ rectum to $D_{ICRU}$ rectum was 1.25 and the mean ratio of $D_{2cc}$ rectum to $D_{Max}$ rectum was 0.98 for all individual fractions. The mean dose to the bladder was $6.00{\pm}1.90Gy$ for $D_{2cc}$ and $5.10{\pm}2.03Gy$ at $D_{ICRU}$. However, the mean $D_{2cc}$ dose did not differ significantly from the mean dose calculated at the ICRU reference point (p=0.307); the mean difference was 0.90 Gy (0.49-1.25Gy). The mean EQD2 was $81.85{\pm}13.03Gy_{{\alpha}/{\beta}3}$ for $D_{2cc}$ and $74.11{\pm}19.39Gy_{{\alpha}/{\beta}3}$ at $D_{ICRU}$. The mean ratio of $D_{2cc}$ bladder to $D_{ICRU}$ bladder was 1.24. In the majority of applications, the maximum dose point was not the ICRU point. On average, the rectum received 77% and bladder received 92% of the prescribed dose. Conclusions: OARs doses assessed by DVH criteria were higher than ICRU point doses. Our data suggest that the estimated dose to the ICRU bladder point may be a reasonable surrogate for the $D_{2cc}$ and rectal $D_{Max}$ for $D_{2cc}$. However, the dose to the ICRU rectal point does not appear to be a reasonable surrogate for the $D_{2cc}$.

Transfer Learning using Multiple ConvNet Layers Activation Features with Principal Component Analysis for Image Classification (전이학습 기반 다중 컨볼류션 신경망 레이어의 활성화 특징과 주성분 분석을 이용한 이미지 분류 방법)

  • Byambajav, Batkhuu;Alikhanov, Jumabek;Fang, Yang;Ko, Seunghyun;Jo, Geun Sik
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.1
    • /
    • pp.205-225
    • /
    • 2018
  • Convolutional Neural Network (ConvNet) is one class of the powerful Deep Neural Network that can analyze and learn hierarchies of visual features. Originally, first neural network (Neocognitron) was introduced in the 80s. At that time, the neural network was not broadly used in both industry and academic field by cause of large-scale dataset shortage and low computational power. However, after a few decades later in 2012, Krizhevsky made a breakthrough on ILSVRC-12 visual recognition competition using Convolutional Neural Network. That breakthrough revived people interest in the neural network. The success of Convolutional Neural Network is achieved with two main factors. First of them is the emergence of advanced hardware (GPUs) for sufficient parallel computation. Second is the availability of large-scale datasets such as ImageNet (ILSVRC) dataset for training. Unfortunately, many new domains are bottlenecked by these factors. For most domains, it is difficult and requires lots of effort to gather large-scale dataset to train a ConvNet. Moreover, even if we have a large-scale dataset, training ConvNet from scratch is required expensive resource and time-consuming. These two obstacles can be solved by using transfer learning. Transfer learning is a method for transferring the knowledge from a source domain to new domain. There are two major Transfer learning cases. First one is ConvNet as fixed feature extractor, and the second one is Fine-tune the ConvNet on a new dataset. In the first case, using pre-trained ConvNet (such as on ImageNet) to compute feed-forward activations of the image into the ConvNet and extract activation features from specific layers. In the second case, replacing and retraining the ConvNet classifier on the new dataset, then fine-tune the weights of the pre-trained network with the backpropagation. In this paper, we focus on using multiple ConvNet layers as a fixed feature extractor only. However, applying features with high dimensional complexity that is directly extracted from multiple ConvNet layers is still a challenging problem. We observe that features extracted from multiple ConvNet layers address the different characteristics of the image which means better representation could be obtained by finding the optimal combination of multiple ConvNet layers. Based on that observation, we propose to employ multiple ConvNet layer representations for transfer learning instead of a single ConvNet layer representation. Overall, our primary pipeline has three steps. Firstly, images from target task are given as input to ConvNet, then that image will be feed-forwarded into pre-trained AlexNet, and the activation features from three fully connected convolutional layers are extracted. Secondly, activation features of three ConvNet layers are concatenated to obtain multiple ConvNet layers representation because it will gain more information about an image. When three fully connected layer features concatenated, the occurring image representation would have 9192 (4096+4096+1000) dimension features. However, features extracted from multiple ConvNet layers are redundant and noisy since they are extracted from the same ConvNet. Thus, a third step, we will use Principal Component Analysis (PCA) to select salient features before the training phase. When salient features are obtained, the classifier can classify image more accurately, and the performance of transfer learning can be improved. To evaluate proposed method, experiments are conducted in three standard datasets (Caltech-256, VOC07, and SUN397) to compare multiple ConvNet layer representations against single ConvNet layer representation by using PCA for feature selection and dimension reduction. Our experiments demonstrated the importance of feature selection for multiple ConvNet layer representation. Moreover, our proposed approach achieved 75.6% accuracy compared to 73.9% accuracy achieved by FC7 layer on the Caltech-256 dataset, 73.1% accuracy compared to 69.2% accuracy achieved by FC8 layer on the VOC07 dataset, 52.2% accuracy compared to 48.7% accuracy achieved by FC7 layer on the SUN397 dataset. We also showed that our proposed approach achieved superior performance, 2.8%, 2.1% and 3.1% accuracy improvement on Caltech-256, VOC07, and SUN397 dataset respectively compare to existing work.