• Title/Summary/Keyword: vector space model

Search Result 367, Processing Time 0.024 seconds

Optimal supervised LSA method using selective feature dimension reduction (선택적 자질 차원 축소를 이용한 최적의 지도적 LSA 방법)

  • Kim, Jung-Ho;Kim, Myung-Kyu;Cha, Myung-Hoon;In, Joo-Ho;Chae, Soo-Hoan
    • Science of Emotion and Sensibility
    • /
    • v.13 no.1
    • /
    • pp.47-60
    • /
    • 2010
  • Most of the researches about classification usually have used kNN(k-Nearest Neighbor), SVM(Support Vector Machine), which are known as learn-based model, and Bayesian classifier, NNA(Neural Network Algorithm), which are known as statistics-based methods. However, there are some limitations of space and time when classifying so many web pages in recent internet. Moreover, most studies of classification are using uni-gram feature representation which is not good to represent real meaning of words. In case of Korean web page classification, there are some problems because of korean words property that the words have multiple meanings(polysemy). For these reasons, LSA(Latent Semantic Analysis) is proposed to classify well in these environment(large data set and words' polysemy). LSA uses SVD(Singular Value Decomposition) which decomposes the original term-document matrix to three different matrices and reduces their dimension. From this SVD's work, it is possible to create new low-level semantic space for representing vectors, which can make classification efficient and analyze latent meaning of words or document(or web pages). Although LSA is good at classification, it has some drawbacks in classification. As SVD reduces dimensions of matrix and creates new semantic space, it doesn't consider which dimensions discriminate vectors well but it does consider which dimensions represent vectors well. It is a reason why LSA doesn't improve performance of classification as expectation. In this paper, we propose new LSA which selects optimal dimensions to discriminate and represent vectors well as minimizing drawbacks and improving performance. This method that we propose shows better and more stable performance than other LSAs' in low-dimension space. In addition, we derive more improvement in classification as creating and selecting features by reducing stopwords and weighting specific values to them statistically.

  • PDF

Incremental Regression based on a Sliding Window for Stream Data Prediction (스트림 데이타 예측을 위한 슬라이딩 윈도우 기반 점진적 회귀분석)

  • Kim, Sung-Hyun;Jin, Long;Ryu, Keun-Ho
    • Journal of KIISE:Databases
    • /
    • v.34 no.6
    • /
    • pp.483-492
    • /
    • 2007
  • Time series of conventional prediction techniques uses the model which is generated from the training step. This model is applied to new input data without any change. If this model is applied directly to stream data, the rate of prediction accuracy will be decreased. This paper proposes an stream data prediction technique using sliding window and regression. This technique considers the characteristic of time series which may be changed over time. It is composed of two steps. The first step executes a fractional process for applying input data to the regression model. The second step updates the model by using its information as new data. Additionally, the model is maintained by only recent data in a queue. This approach has the following two advantages. It maintains the minimum information of the model by using a matrix, so space complexity is reduced. Moreover, it prevents the increment of error rate by updating the model over time. Accuracy rate of the proposed method is measured by RME(Relative Mean Error) and RMSE(Root Mean Square Error). The results of stream data prediction experiment are performed by the proposed technique IMQR(Incremental Multiple Quadratic Regression) is more efficient than those of MLR(Multiple Linear Regression) and SVR(Support Vector Regression).

A Study on Automatic Classification Model of Documents Based on Korean Standard Industrial Classification (한국표준산업분류를 기준으로 한 문서의 자동 분류 모델에 관한 연구)

  • Lee, Jae-Seong;Jun, Seung-Pyo;Yoo, Hyoung Sun
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.3
    • /
    • pp.221-241
    • /
    • 2018
  • As we enter the knowledge society, the importance of information as a new form of capital is being emphasized. The importance of information classification is also increasing for efficient management of digital information produced exponentially. In this study, we tried to automatically classify and provide tailored information that can help companies decide to make technology commercialization. Therefore, we propose a method to classify information based on Korea Standard Industry Classification (KSIC), which indicates the business characteristics of enterprises. The classification of information or documents has been largely based on machine learning, but there is not enough training data categorized on the basis of KSIC. Therefore, this study applied the method of calculating similarity between documents. Specifically, a method and a model for presenting the most appropriate KSIC code are proposed by collecting explanatory texts of each code of KSIC and calculating the similarity with the classification object document using the vector space model. The IPC data were collected and classified by KSIC. And then verified the methodology by comparing it with the KSIC-IPC concordance table provided by the Korean Intellectual Property Office. As a result of the verification, the highest agreement was obtained when the LT method, which is a kind of TF-IDF calculation formula, was applied. At this time, the degree of match of the first rank matching KSIC was 53% and the cumulative match of the fifth ranking was 76%. Through this, it can be confirmed that KSIC classification of technology, industry, and market information that SMEs need more quantitatively and objectively is possible. In addition, it is considered that the methods and results provided in this study can be used as a basic data to help the qualitative judgment of experts in creating a linkage table between heterogeneous classification systems.

A Study on Market Size Estimation Method by Product Group Using Word2Vec Algorithm (Word2Vec을 활용한 제품군별 시장규모 추정 방법에 관한 연구)

  • Jung, Ye Lim;Kim, Ji Hui;Yoo, Hyoung Sun
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.1-21
    • /
    • 2020
  • With the rapid development of artificial intelligence technology, various techniques have been developed to extract meaningful information from unstructured text data which constitutes a large portion of big data. Over the past decades, text mining technologies have been utilized in various industries for practical applications. In the field of business intelligence, it has been employed to discover new market and/or technology opportunities and support rational decision making of business participants. The market information such as market size, market growth rate, and market share is essential for setting companies' business strategies. There has been a continuous demand in various fields for specific product level-market information. However, the information has been generally provided at industry level or broad categories based on classification standards, making it difficult to obtain specific and proper information. In this regard, we propose a new methodology that can estimate the market sizes of product groups at more detailed levels than that of previously offered. We applied Word2Vec algorithm, a neural network based semantic word embedding model, to enable automatic market size estimation from individual companies' product information in a bottom-up manner. The overall process is as follows: First, the data related to product information is collected, refined, and restructured into suitable form for applying Word2Vec model. Next, the preprocessed data is embedded into vector space by Word2Vec and then the product groups are derived by extracting similar products names based on cosine similarity calculation. Finally, the sales data on the extracted products is summated to estimate the market size of the product groups. As an experimental data, text data of product names from Statistics Korea's microdata (345,103 cases) were mapped in multidimensional vector space by Word2Vec training. We performed parameters optimization for training and then applied vector dimension of 300 and window size of 15 as optimized parameters for further experiments. We employed index words of Korean Standard Industry Classification (KSIC) as a product name dataset to more efficiently cluster product groups. The product names which are similar to KSIC indexes were extracted based on cosine similarity. The market size of extracted products as one product category was calculated from individual companies' sales data. The market sizes of 11,654 specific product lines were automatically estimated by the proposed model. For the performance verification, the results were compared with actual market size of some items. The Pearson's correlation coefficient was 0.513. Our approach has several advantages differing from the previous studies. First, text mining and machine learning techniques were applied for the first time on market size estimation, overcoming the limitations of traditional sampling based- or multiple assumption required-methods. In addition, the level of market category can be easily and efficiently adjusted according to the purpose of information use by changing cosine similarity threshold. Furthermore, it has a high potential of practical applications since it can resolve unmet needs for detailed market size information in public and private sectors. Specifically, it can be utilized in technology evaluation and technology commercialization support program conducted by governmental institutions, as well as business strategies consulting and market analysis report publishing by private firms. The limitation of our study is that the presented model needs to be improved in terms of accuracy and reliability. The semantic-based word embedding module can be advanced by giving a proper order in the preprocessed dataset or by combining another algorithm such as Jaccard similarity with Word2Vec. Also, the methods of product group clustering can be changed to other types of unsupervised machine learning algorithm. Our group is currently working on subsequent studies and we expect that it can further improve the performance of the conceptually proposed basic model in this study.

A Balanced Cognition-Affect Model of Information Systems Continuance for Mobile Internet Service (모바일 인터넷 서비스를 위한 정보시스템 지속성에 대한 이성과 감성의 조화 모델)

  • Kim, Ki-Eun;Kim, Hee-Woong
    • Science of Emotion and Sensibility
    • /
    • v.11 no.4
    • /
    • pp.461-480
    • /
    • 2008
  • There are innumerable studies on technology adoption and usage continuance; most examine cognitive factors while affective factors or the feelings of users are left relatively unexplored. Although attitude and user satisfaction are factors commonly considered in Information Systems(IS) research, they represent only some aspects of feelings. In contrast, researchers in diverse fields have begun to note the importance of feelings in understanding and predicting human behavior. Feelings are anticipated to be essential particularly in the context of modern applications, such as mobile internet(M-internet) services, where users are not simply technology users but also service consumers. Drawing on the support of consumer research, social psychology and computer science, this study proposes a balanced cognition-affect model of IS continuance. Prior works in relation to IS research have already considered the emotional factors. The common factors are enjoyment, anxiety, affect and satisfaction. The main difference in our study is that the factors that we used are the primary dimensions of affect according to Circumplex Model of Affect. The horizontal axis of the model represents the pleasure dimension and the vertical represents the arousal dimension. Other emotional factors such as enjoyment and anxiety can be viewed as a combination of these two dimensions, and they can be placed in the vector space formed by these two primary dimensions. Affect has been defined as the enjoyment a person derives from using computers. Satisfaction has different conceptualizations. It has been conceptualized as judgment based on the expectation disconfirmation theory. Thus, while prior works considered the direct and indirect effects of "feeling-related constructs"(enjoyment and anxiety) on usage behavior, our study proposes effects of "feeling-based constructs"(pleasure and arousal). The balanced cognition-affect model is tested in a survey of, M-internet service users. The results establish the validity of the model.

  • PDF

Selective Word Embedding for Sentence Classification by Considering Information Gain and Word Similarity (문장 분류를 위한 정보 이득 및 유사도에 따른 단어 제거와 선택적 단어 임베딩 방안)

  • Lee, Min Seok;Yang, Seok Woo;Lee, Hong Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.4
    • /
    • pp.105-122
    • /
    • 2019
  • Dimensionality reduction is one of the methods to handle big data in text mining. For dimensionality reduction, we should consider the density of data, which has a significant influence on the performance of sentence classification. It requires lots of computations for data of higher dimensions. Eventually, it can cause lots of computational cost and overfitting in the model. Thus, the dimension reduction process is necessary to improve the performance of the model. Diverse methods have been proposed from only lessening the noise of data like misspelling or informal text to including semantic and syntactic information. On top of it, the expression and selection of the text features have impacts on the performance of the classifier for sentence classification, which is one of the fields of Natural Language Processing. The common goal of dimension reduction is to find latent space that is representative of raw data from observation space. Existing methods utilize various algorithms for dimensionality reduction, such as feature extraction and feature selection. In addition to these algorithms, word embeddings, learning low-dimensional vector space representations of words, that can capture semantic and syntactic information from data are also utilized. For improving performance, recent studies have suggested methods that the word dictionary is modified according to the positive and negative score of pre-defined words. The basic idea of this study is that similar words have similar vector representations. Once the feature selection algorithm selects the words that are not important, we thought the words that are similar to the selected words also have no impacts on sentence classification. This study proposes two ways to achieve more accurate classification that conduct selective word elimination under specific regulations and construct word embedding based on Word2Vec embedding. To select words having low importance from the text, we use information gain algorithm to measure the importance and cosine similarity to search for similar words. First, we eliminate words that have comparatively low information gain values from the raw text and form word embedding. Second, we select words additionally that are similar to the words that have a low level of information gain values and make word embedding. In the end, these filtered text and word embedding apply to the deep learning models; Convolutional Neural Network and Attention-Based Bidirectional LSTM. This study uses customer reviews on Kindle in Amazon.com, IMDB, and Yelp as datasets, and classify each data using the deep learning models. The reviews got more than five helpful votes, and the ratio of helpful votes was over 70% classified as helpful reviews. Also, Yelp only shows the number of helpful votes. We extracted 100,000 reviews which got more than five helpful votes using a random sampling method among 750,000 reviews. The minimal preprocessing was executed to each dataset, such as removing numbers and special characters from text data. To evaluate the proposed methods, we compared the performances of Word2Vec and GloVe word embeddings, which used all the words. We showed that one of the proposed methods is better than the embeddings with all the words. By removing unimportant words, we can get better performance. However, if we removed too many words, it showed that the performance was lowered. For future research, it is required to consider diverse ways of preprocessing and the in-depth analysis for the co-occurrence of words to measure similarity values among words. Also, we only applied the proposed method with Word2Vec. Other embedding methods such as GloVe, fastText, ELMo can be applied with the proposed methods, and it is possible to identify the possible combinations between word embedding methods and elimination methods.

Thermal pointing error analysis of the observation satellites with interpolated temperature based on PAT method (PAT 기반 온도장 보간을 이용한 관측위성의 열지향오차해석)

  • Lim, Jae Hyuk;Kim, Sun-Won;Kim, Jeong-Hoon;Kim, Chang-Ho;Jun, Hyoung-Yoll;Oh, Hyeon Cheol;Shin, Chang Min;Lee, Byung Chai
    • Journal of the Korean Society for Aeronautical & Space Sciences
    • /
    • v.44 no.1
    • /
    • pp.80-87
    • /
    • 2016
  • In this work, we conduct a thermal pointing error analysis of the observation satellites considering seasonal and daily temperature variation with interpolated temperature based on prescribed average temperature (PAT) method. Maximum 200 degree temperature excursion is applied to the observation satellites during on-orbit operation, which cause the line of sight (LOS) to deviate from the designated pointing direction due to thermo-elastic deformation. To predict and adjust such deviation, the thermo-elastic deformation analysis with a fine structural finite element model is accomplished with interpolated thermal maps calculated from the results of on-station thermal analysis with a coarse thermal model. After verifying the interpolated temperatures by PAT with two benchmark problems, we evaluate the thermal pointing error.

Multiple Cause Model-based Topic Extraction and Semantic Kernel Construction from Text Documents (다중요인모델에 기반한 텍스트 문서에서의 토픽 추출 및 의미 커널 구축)

  • 장정호;장병탁
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.5
    • /
    • pp.595-604
    • /
    • 2004
  • Automatic analysis of concepts or semantic relations from text documents enables not only an efficient acquisition of relevant information, but also a comparison of documents in the concept level. We present a multiple cause model-based approach to text analysis, where latent topics are automatically extracted from document sets and similarity between documents is measured by semantic kernels constructed from the extracted topics. In our approach, a document is assumed to be generated by various combinations of underlying topics. A topic is defined by a set of words that are related to the same topic or cooccur frequently within a document. In a network representing a multiple-cause model, each topic is identified by a group of words having high connection weights from a latent node. In order to facilitate teaming and inferences in multiple-cause models, some approximation methods are required and we utilize an approximation by Helmholtz machines. In an experiment on TDT-2 data set, we extract sets of meaningful words where each set contains some theme-specific terms. Using semantic kernels constructed from latent topics extracted by multiple cause models, we also achieve significant improvements over the basic vector space model in terms of retrieval effectiveness.

A Study on the Construction of Indoor Spatial Information using a Terrestrial LiDAR (지상라이다를 이용한 지하철 역사의 3D 실내공간정보 구축방안 연구)

  • Go, Jong Sik;Jeong, In Hun;Shin, Han Sup;Choi, Yun Soo;Cho, Seong Kil
    • Spatial Information Research
    • /
    • v.21 no.3
    • /
    • pp.89-101
    • /
    • 2013
  • Recently, importance of indoor space is on the rise, as larger and more complex buildings are taking place due to development of building technology. Accordingly, range of the target area of spatial information service is rapidly expanding from outdoor space to indoor space. Various demands for indoor spatial information are expected to be created in the future through development of high technologies such as IT Mobile and convergence with various area. Thus this research takes a look at available methods for building indoor spatial information and then builds high accuracy three-dimensional indoor spatial information using indoor high accuracy laser survey and 3D vector process technique. The accuracy of built 3D indoor model is evaluated by overlap analysis method refer to a digital map, and the result showed that it could guarantee its positional accuracy within 0.04m on the x-axis, 0.06m on the y-axis. This result could be used as a fundamental data for building indoor spatial data and for integrated use of indoor and outdoor spatial information.

Detail Focused Image Classifier Model for Traditional Images (전통문화 이미지를 위한 세부 자질 주목형 이미지 자동 분석기)

  • Kim, Kuekyeng;Hur, Yuna;Kim, Gyeongmin;Yu, Wonhee;Lim, Heuiseok
    • Journal of the Korea Convergence Society
    • /
    • v.8 no.12
    • /
    • pp.85-92
    • /
    • 2017
  • As accessibility toward traditional cultural contents drops compared to its increase in production, the need for higher accessibility for continued management and research to exist. For this, this paper introduces an image classifier model for traditional images based on artificial neural networks, which converts the input image's features into a vector space and by utilizing a RNN based model it recognizes and compares the details of the input which enables the classification of traditional images. This enables the classifiers to classify similarly looking traditional images more precisely by focusing on the details. For the training of this model, a wide range of images were arranged and collected based on the format of the Korean information culture field, which contributes to other researches related to the fields of using traditional cultural images. Also, this research contributes to the further activation of demand, supply, and researches related to traditional culture.