• Title/Summary/Keyword: New classification system

Search Result 1,131, Processing Time 0.028 seconds

The Accuracy of Tuberculosis Notification Reports at a Private General Hospital after Enforcement of New Korean Tuberculosis Surveillance System (새로운 국가결핵감시체계 시행 후 한 민간종합병원에서 작성된 결핵정보관리보고서의 정확도 조사)

  • Kim, Cheol Hong;Koh, Won-Jung;Kwon, O Jung;Ahn, Young Mee;Lim, Seong Young;An, Chang Hyeok;Youn, Jong Wook;Hwang, Jung Hye;Suh, Gee Young;Chung, Man Pyo;Kim, Hojoong
    • Tuberculosis and Respiratory Diseases
    • /
    • v.54 no.2
    • /
    • pp.178-190
    • /
    • 2003
  • Background : The committee of tuberculosis(TB) survey planning for the year 2000 decided to construct the Korean Tuberculosis Surveillance System (KTBS), based on a doctor's routine reporting method. The successful keys of the KTBS rely on the precision of the recorded TB notification forms. The purpose of this study was to determine that the accuracy of the TB notification form written at a private general hospital given to the corresponding health center and to improve the comprehensiveness of these reporting systems. Materials and Methods : 291 adult TB patients who had been diagnosed from August 2000 to January 2001, were enrolled in this study. The lists of TB notification forms were compared with the medical records and the various laboratory results; case characteristics, history of previous treatment, examinations for diagnosis, site of the TB by the international classification of the disease, and treatment. Results : In the list of examinations for a diagnosis in 222 pulmonary TB patients, the concordance rate of the 'sputum smear exam' was 76% but that of the 'sputum culture exam' was only 23%. Among the 198 cases of the sputum culture exam labeled 'not examined', 43(21.7%) cases proved to be true 'not examined', 70 cases(35.4%) were proven to be 'culture positive', and 85(43.0%) cases were proven to be 'culture negative'. In the list of examinations for a diagnosis in 69 extrapulmonary TB patients, the concordance rate of the 'smear exam other than sputum' was 54%. In the list of treatments, the overall concordance rate of the 'type of registration' in the TB notification form was 85%. Among the 246 'new' cases on the TB notification form, 217(88%) cases were true 'new' cases and 13 were proven to be 'relapse', 2 were proven to be 'treatment after failure', one was proven to be 'treatment after default', 12 were proven to be 'transferred-in' and one was proven to be 'chronic'. Among the 204 HREZ prescribed regimen, 172(84.3%) patients were taking the HREZ regimen, and the others were prescribed other drug regimens. Conclusion : Correct recording of the TB notification form at the private sectors is necessary for supporting the effective TB surveillance system in Korea.

Development of validated Nursing Interventions for Home Health Care to Women who have had a Caesarian Delivery (조기퇴원 제왕절개 산욕부를 위한 가정간호 표준서 개발)

  • HwangBo, Su-Ja
    • Journal of Korean Academy of Nursing Administration
    • /
    • v.6 no.1
    • /
    • pp.135-146
    • /
    • 2000
  • The purpose of this study was to develope, based on the Nursing Intervention Classification (NIC) system. a set of standardized nursing interventions which had been validated. and their associated activities. for use with nursing diagnoses related to home health care for women who have had a caesarian delivery and for their newborn babies. This descriptive study for instrument development had three phases: first. selection of nursing diagnoses. second, validation of the preliminary home health care interventions. and third, application of the home care interventions. In the first phases, diagnoses from 30 nursing records of clients of the home health care agency at P. medical center who were seen between April 21 and July 30. 1998. and from 5 textbooks were examined. Ten nursing diagnoses were selected through a comparison with the NANDA (North American Nursing Diagnosis Association) classification In the second phase. using the selected diagnoses. the nursing interventions were defined from the diagnoses-intervention linkage lists along with associated activities for each intervention list in NIC. To develope the preliminary interventions five-rounds of expertise tests were done. During the first four rounds. 5 experts in clinical nursing participated. and for the final content validity test of the preliminary interventions. 13 experts participated using the Fehring's Delphi technique. The expert group evaluated and defined the set of preliminary nursing interventions. In the third phases, clinical tests were held at in a home health care setting with two home health care nurses using the preliminary intervention list as a questionnaire. Thirty clients referred to the home health care agency at P. medical center between October 1998 and March 1999 were the subjects for this phase. Each of the activities were tested using dichotomous question method. The results of the study are as follows: 1. For the ten nursing diagnoses. 63 appropriate interventions were selected from 369 diagnoses interventions links in NlC., and from 1.465 associated nursing activities. From the 63 interventions. the nurses expert group developed 18 interventions and 258 activities as the preliminary intervention list through a five-round validity test 2. For the fifth content validity test using Fehring's model for determining lCV (Intervention Content Validity), a five point Likert scale was used with values converted to weights as follows: 1=0.0. 2=0.25. 3=0.50. 4=0.75. 5=1.0. Activities of less than O.50 were to be deleted. The range of ICV scores for the nursing diagnoses was 0.95-0.66. for the nursing interventions. 0.98-0.77 and for the nursing activities, 0.95-0.85. By Fehring's method. all of these were included in the preliminary intervention list. 3. Using a questionnaire format for the preliminary intervention list. clinical application tests were done. To define nursing diagnoses. home health care nurses applied each nursing diagnoses to every client. and it was found that 13 were most frequently used of 400 times diagnoses were used. Therefore. 13 nursing diagnoses were defined as validated nursing diagnoses. Ten were the same as from the nursing records and textbooks and three were new from the clinical application. The final list included 'Anxiety', 'Aspiration. risk for'. 'Infant behavior, potential for enhanced, organized'. 'Infant feeding pattern. ineffective'. 'Infection'. 'Knowledge deficit'. 'Nutrition, less than body requirements. altered', 'Pain'. 'Parenting'. 'Skin integrity. risk for. impared' and 'Risk for activity intolerance'. 'Self-esteem disturbance', 'Sleep pattern disturbance' 4. In all. there were 19 interventions. 18 preliminary nursing interventions and one more intervention added from the clinical setting. 'Body image enhancement'. For 265 associated nursing activities. clinical application tests were also done. The intervention rate of 19 interventions was from 81.6% to 100%, so all 19 interventions were in c1uded in the validated intervention set. From the 265 nursing activities. 261(98.5%) were accepted and four activities were deleted. those with an implimentation rate of less than 50%. 5. In conclusion. 13 diagnoses. 19 interventions and 261 activities were validated for the final validated nursing intervention set.

  • PDF

Energy expenditure of physical activity in Korean adults and assessment of accelerometer accuracy by gender (성인의 13가지 신체활동의 에너지 소비량 및 가속도계 정확성의 남녀비교)

  • Choi, Yeon-jung;Ju, Mun-jeong;Park, Jung-hye;Park, Jong-hoon;Kim, Eun-kyung
    • Journal of Nutrition and Health
    • /
    • v.50 no.6
    • /
    • pp.552-564
    • /
    • 2017
  • Purpose: The purpose of this study was to measure energy expenditure (EE) the metabolic equivalents (METs) of 13 common physical activities by using a portable telemetry gas exchange system ($K4b^2$) and to assess the accuracy of the accelerometer (Actigraph $GT3X^+$) by gender in Korean adults. Methods: A total of 109 adults (54 males, 55 females) with normal BMI (body mass index) participated in this study. EE and METs of 13 selected activities were simultaneously measured by the $K4b^2$ portable indirect calorimeter and predicted by the $GT3X^+$ Actigraph accelerometer. The accuracy of the accelerometer was assessed by comparing the predicted with the measured EE and METs. Results: EE (kcal/kg/hr) and METs of treadmill walking (3.2 km/h, 4.8 km/h and 5.6 km/h) and running (6.4 km/h) were significantly higher in female than in male participants (p < 0.05). On the other hand, the accelerometer significantly underestimated the EE and METs for all activities except descending stairs, moderate walking, and fast walking in males as well as descending stairs in females. Low intensity activities had the highest rate of accurate classifications (88.3% in males and 91.3% females), whereas vigorous intensity activities had the lowest rate of accurate classifications (43.6% in males and 27.7% in females). Across all activities, the rate of accurate classification was significantly higher in males than in females (75.2% and 58.3% respectively, p < 0.01). Error between the accelerometer and $K4b^2$ was smaller in males than in females, and EE and METs were more accurately estimated during treadmill activities than other activities in both males and females. Conclusion: The accelerometer underestimated EE and METs across various activities in Korean adults. In addition, there appears to be a gender difference in the rate of accurate accelerometer classification of activities according to intensity. Our results indicate the need to develop new accelerometer equations for this population, and gender differences should be considered.

Development of a Detection Model for the Companies Designated as Administrative Issue in KOSDAQ Market (KOSDAQ 시장의 관리종목 지정 탐지 모형 개발)

  • Shin, Dong-In;Kwahk, Kee-Young
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.3
    • /
    • pp.157-176
    • /
    • 2018
  • The purpose of this research is to develop a detection model for companies designated as administrative issue in KOSDAQ market using financial data. Administration issue designates the companies with high potential for delisting, which gives them time to overcome the reasons for the delisting under certain restrictions of the Korean stock market. It acts as an alarm to inform investors and market participants of which companies are likely to be delisted and warns them to make safe investments. Despite this importance, there are relatively few studies on administration issues prediction model in comparison with the lots of studies on bankruptcy prediction model. Therefore, this study develops and verifies the detection model of the companies designated as administrative issue using financial data of KOSDAQ companies. In this study, logistic regression and decision tree are proposed as the data mining models for detecting administrative issues. According to the results of the analysis, the logistic regression model predicted the companies designated as administrative issue using three variables - ROE(Earnings before tax), Cash flows/Shareholder's equity, and Asset turnover ratio, and its overall accuracy was 86% for the validation dataset. The decision tree (Classification and Regression Trees, CART) model applied the classification rules using Cash flows/Total assets and ROA(Net income), and the overall accuracy reached 87%. Implications of the financial indictors selected in our logistic regression and decision tree models are as follows. First, ROE(Earnings before tax) in the logistic detection model shows the profit and loss of the business segment that will continue without including the revenue and expenses of the discontinued business. Therefore, the weakening of the variable means that the competitiveness of the core business is weakened. If a large part of the profits is generated from one-off profit, it is very likely that the deterioration of business management is further intensified. As the ROE of a KOSDAQ company decreases significantly, it is highly likely that the company can be delisted. Second, cash flows to shareholder's equity represents that the firm's ability to generate cash flow under the condition that the financial condition of the subsidiary company is excluded. In other words, the weakening of the management capacity of the parent company, excluding the subsidiary's competence, can be a main reason for the increase of the possibility of administrative issue designation. Third, low asset turnover ratio means that current assets and non-current assets are ineffectively used by corporation, or that asset investment by corporation is excessive. If the asset turnover ratio of a KOSDAQ-listed company decreases, it is necessary to examine in detail corporate activities from various perspectives such as weakening sales or increasing or decreasing inventories of company. Cash flow / total assets, a variable selected by the decision tree detection model, is a key indicator of the company's cash condition and its ability to generate cash from operating activities. Cash flow indicates whether a firm can perform its main activities(maintaining its operating ability, repaying debts, paying dividends and making new investments) without relying on external financial resources. Therefore, if the index of the variable is negative(-), it indicates the possibility that a company has serious problems in business activities. If the cash flow from operating activities of a specific company is smaller than the net profit, it means that the net profit has not been cashed, indicating that there is a serious problem in managing the trade receivables and inventory assets of the company. Therefore, it can be understood that as the cash flows / total assets decrease, the probability of administrative issue designation and the probability of delisting are increased. In summary, the logistic regression-based detection model in this study was found to be affected by the company's financial activities including ROE(Earnings before tax). However, decision tree-based detection model predicts the designation based on the cash flows of the company.

Anomaly Detection for User Action with Generative Adversarial Networks (적대적 생성 모델을 활용한 사용자 행위 이상 탐지 방법)

  • Choi, Nam woong;Kim, Wooju
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.43-62
    • /
    • 2019
  • At one time, the anomaly detection sector dominated the method of determining whether there was an abnormality based on the statistics derived from specific data. This methodology was possible because the dimension of the data was simple in the past, so the classical statistical method could work effectively. However, as the characteristics of data have changed complexly in the era of big data, it has become more difficult to accurately analyze and predict the data that occurs throughout the industry in the conventional way. Therefore, SVM and Decision Tree based supervised learning algorithms were used. However, there is peculiarity that supervised learning based model can only accurately predict the test data, when the number of classes is equal to the number of normal classes and most of the data generated in the industry has unbalanced data class. Therefore, the predicted results are not always valid when supervised learning model is applied. In order to overcome these drawbacks, many studies now use the unsupervised learning-based model that is not influenced by class distribution, such as autoencoder or generative adversarial networks. In this paper, we propose a method to detect anomalies using generative adversarial networks. AnoGAN, introduced in the study of Thomas et al (2017), is a classification model that performs abnormal detection of medical images. It was composed of a Convolution Neural Net and was used in the field of detection. On the other hand, sequencing data abnormality detection using generative adversarial network is a lack of research papers compared to image data. Of course, in Li et al (2018), a study by Li et al (LSTM), a type of recurrent neural network, has proposed a model to classify the abnormities of numerical sequence data, but it has not been used for categorical sequence data, as well as feature matching method applied by salans et al.(2016). So it suggests that there are a number of studies to be tried on in the ideal classification of sequence data through a generative adversarial Network. In order to learn the sequence data, the structure of the generative adversarial networks is composed of LSTM, and the 2 stacked-LSTM of the generator is composed of 32-dim hidden unit layers and 64-dim hidden unit layers. The LSTM of the discriminator consists of 64-dim hidden unit layer were used. In the process of deriving abnormal scores from existing paper of Anomaly Detection for Sequence data, entropy values of probability of actual data are used in the process of deriving abnormal scores. but in this paper, as mentioned earlier, abnormal scores have been derived by using feature matching techniques. In addition, the process of optimizing latent variables was designed with LSTM to improve model performance. The modified form of generative adversarial model was more accurate in all experiments than the autoencoder in terms of precision and was approximately 7% higher in accuracy. In terms of Robustness, Generative adversarial networks also performed better than autoencoder. Because generative adversarial networks can learn data distribution from real categorical sequence data, Unaffected by a single normal data. But autoencoder is not. Result of Robustness test showed that he accuracy of the autocoder was 92%, the accuracy of the hostile neural network was 96%, and in terms of sensitivity, the autocoder was 40% and the hostile neural network was 51%. In this paper, experiments have also been conducted to show how much performance changes due to differences in the optimization structure of potential variables. As a result, the level of 1% was improved in terms of sensitivity. These results suggest that it presented a new perspective on optimizing latent variable that were relatively insignificant.

Multi-Vector Document Embedding Using Semantic Decomposition of Complex Documents (복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 방법론)

  • Park, Jongin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.19-41
    • /
    • 2019
  • According to the rapidly increasing demand for text data analysis, research and investment in text mining are being actively conducted not only in academia but also in various industries. Text mining is generally conducted in two steps. In the first step, the text of the collected document is tokenized and structured to convert the original document into a computer-readable form. In the second step, tasks such as document classification, clustering, and topic modeling are conducted according to the purpose of analysis. Until recently, text mining-related studies have been focused on the application of the second steps, such as document classification, clustering, and topic modeling. However, with the discovery that the text structuring process substantially influences the quality of the analysis results, various embedding methods have actively been studied to improve the quality of analysis results by preserving the meaning of words and documents in the process of representing text data as vectors. Unlike structured data, which can be directly applied to a variety of operations and traditional analysis techniques, Unstructured text should be preceded by a structuring task that transforms the original document into a form that the computer can understand before analysis. It is called "Embedding" that arbitrary objects are mapped to a specific dimension space while maintaining algebraic properties for structuring the text data. Recently, attempts have been made to embed not only words but also sentences, paragraphs, and entire documents in various aspects. Particularly, with the demand for analysis of document embedding increases rapidly, many algorithms have been developed to support it. Among them, doc2Vec which extends word2Vec and embeds each document into one vector is most widely used. However, the traditional document embedding method represented by doc2Vec generates a vector for each document using the whole corpus included in the document. This causes a limit that the document vector is affected by not only core words but also miscellaneous words. Additionally, the traditional document embedding schemes usually map each document into a single corresponding vector. Therefore, it is difficult to represent a complex document with multiple subjects into a single vector accurately using the traditional approach. In this paper, we propose a new multi-vector document embedding method to overcome these limitations of the traditional document embedding methods. This study targets documents that explicitly separate body content and keywords. In the case of a document without keywords, this method can be applied after extract keywords through various analysis methods. However, since this is not the core subject of the proposed method, we introduce the process of applying the proposed method to documents that predefine keywords in the text. The proposed method consists of (1) Parsing, (2) Word Embedding, (3) Keyword Vector Extraction, (4) Keyword Clustering, and (5) Multiple-Vector Generation. The specific process is as follows. all text in a document is tokenized and each token is represented as a vector having N-dimensional real value through word embedding. After that, to overcome the limitations of the traditional document embedding method that is affected by not only the core word but also the miscellaneous words, vectors corresponding to the keywords of each document are extracted and make up sets of keyword vector for each document. Next, clustering is conducted on a set of keywords for each document to identify multiple subjects included in the document. Finally, a Multi-vector is generated from vectors of keywords constituting each cluster. The experiments for 3.147 academic papers revealed that the single vector-based traditional approach cannot properly map complex documents because of interference among subjects in each vector. With the proposed multi-vector based method, we ascertained that complex documents can be vectorized more accurately by eliminating the interference among subjects.

A Study on the Application of IUCN Global Ecosystem Typology Using Land Cover Map in Korea (토지피복지도를 활용한 IUCN 생태계유형분류 국내 적용)

  • Hee-Jung Sohn;Su-Yeon Won;Jeong-Eun Jeon;Eun-Hee Park;Do-Hee Kim;Sang-Hak Han;Young-Keun Song
    • Korean Journal of Environment and Ecology
    • /
    • v.37 no.3
    • /
    • pp.209-220
    • /
    • 2023
  • Over the past few centuries, widespread changes to natural ecosystems caused by human activities have severely threatened biodiversity worldwide. Understanding changes in ecosystems is essential to identifying and managing threats to biodiversity. In line with this need, the IUCN Council formed the IUCN Global Ecosystem Typology (GET) in 2019, taking into account the functions and types of ecosystems. The IUCN provides maps of 10 ecosystem groups and 108 ecological functional groups (EFGs) on a global scale. According to the type classification of IUCN GET ecosystems, Korea's ecosystem is classified into 8 types of Realm (level 1), 18 types of Biome (level 2), and 41 types of Group (level 3). GETs provided by IUCN have low resolution and often do not match the actual land status because it was produced globally. This study aimed to increase the accuracy of Korean IUCN GET type classification by using land cover maps and producing maps that reflected the actual situation. To this end, we ① reviewed the Korean GET data system provided by IUCN GET and ② compared and analyzed it with the current situation in Korea. We evaluated the limitations and usability of the GET through the process and then ③ classified Korea's new Get type reflecting the current situation in Korea by using the national data as much as possible. This study classified Korean GETs into 25 types by using land cover maps and existing national data (Territorial realm: 9, Freshwater: 9, Marine-territorial: 5, Terrestrial-freshwater: 1, and Marine-freshwater-territorial: 1). Compared to the existing map, "F3.2 Constructed lacustrine wetlands", "F3.3 Rice paddies", "F3.4 Freshwater aquafarms", and "T7.3 Plantations" showed the largest area reduction in the modified Korean GET. The area of "T2.2 Temperate Forests" showed the largest area increase, and the "MFT1.3 Coastal saltmarshes and reedbeds" and "F2.2 Small permanent freshwater lakes" types also showed an increase in GET area after modification. Through this process, the existing map, in which the sum of all EFGs in the existing GET accounted for 8.33 times the national area, was modified so that the total sum becomes 1.22 times the national area using the land cover map. This study confirmed that the existing EFG, which had small differences by type and low accuracy, was improved and corrected. This study is significant in that it produced a GET map of Korea that met the GET standard using data reflecting the field conditions. 

A New Approach to Automatic Keyword Generation Using Inverse Vector Space Model (키워드 자동 생성에 대한 새로운 접근법: 역 벡터공간모델을 이용한 키워드 할당 방법)

  • Cho, Won-Chin;Rho, Sang-Kyu;Yun, Ji-Young Agnes;Park, Jin-Soo
    • Asia pacific journal of information systems
    • /
    • v.21 no.1
    • /
    • pp.103-122
    • /
    • 2011
  • Recently, numerous documents have been made available electronically. Internet search engines and digital libraries commonly return query results containing hundreds or even thousands of documents. In this situation, it is virtually impossible for users to examine complete documents to determine whether they might be useful for them. For this reason, some on-line documents are accompanied by a list of keywords specified by the authors in an effort to guide the users by facilitating the filtering process. In this way, a set of keywords is often considered a condensed version of the whole document and therefore plays an important role for document retrieval, Web page retrieval, document clustering, summarization, text mining, and so on. Since many academic journals ask the authors to provide a list of five or six keywords on the first page of an article, keywords are most familiar in the context of journal articles. However, many other types of documents could not benefit from the use of keywords, including Web pages, email messages, news reports, magazine articles, and business papers. Although the potential benefit is large, the implementation itself is the obstacle; manually assigning keywords to all documents is a daunting task, or even impractical in that it is extremely tedious and time-consuming requiring a certain level of domain knowledge. Therefore, it is highly desirable to automate the keyword generation process. There are mainly two approaches to achieving this aim: keyword assignment approach and keyword extraction approach. Both approaches use machine learning methods and require, for training purposes, a set of documents with keywords already attached. In the former approach, there is a given set of vocabulary, and the aim is to match them to the texts. In other words, the keywords assignment approach seeks to select the words from a controlled vocabulary that best describes a document. Although this approach is domain dependent and is not easy to transfer and expand, it can generate implicit keywords that do not appear in a document. On the other hand, in the latter approach, the aim is to extract keywords with respect to their relevance in the text without prior vocabulary. In this approach, automatic keyword generation is treated as a classification task, and keywords are commonly extracted based on supervised learning techniques. Thus, keyword extraction algorithms classify candidate keywords in a document into positive or negative examples. Several systems such as Extractor and Kea were developed using keyword extraction approach. Most indicative words in a document are selected as keywords for that document and as a result, keywords extraction is limited to terms that appear in the document. Therefore, keywords extraction cannot generate implicit keywords that are not included in a document. According to the experiment results of Turney, about 64% to 90% of keywords assigned by the authors can be found in the full text of an article. Inversely, it also means that 10% to 36% of the keywords assigned by the authors do not appear in the article, which cannot be generated through keyword extraction algorithms. Our preliminary experiment result also shows that 37% of keywords assigned by the authors are not included in the full text. This is the reason why we have decided to adopt the keyword assignment approach. In this paper, we propose a new approach for automatic keyword assignment namely IVSM(Inverse Vector Space Model). The model is based on a vector space model. which is a conventional information retrieval model that represents documents and queries by vectors in a multidimensional space. IVSM generates an appropriate keyword set for a specific document by measuring the distance between the document and the keyword sets. The keyword assignment process of IVSM is as follows: (1) calculating the vector length of each keyword set based on each keyword weight; (2) preprocessing and parsing a target document that does not have keywords; (3) calculating the vector length of the target document based on the term frequency; (4) measuring the cosine similarity between each keyword set and the target document; and (5) generating keywords that have high similarity scores. Two keyword generation systems were implemented applying IVSM: IVSM system for Web-based community service and stand-alone IVSM system. Firstly, the IVSM system is implemented in a community service for sharing knowledge and opinions on current trends such as fashion, movies, social problems, and health information. The stand-alone IVSM system is dedicated to generating keywords for academic papers, and, indeed, it has been tested through a number of academic papers including those published by the Korean Association of Shipping and Logistics, the Korea Research Academy of Distribution Information, the Korea Logistics Society, the Korea Logistics Research Association, and the Korea Port Economic Association. We measured the performance of IVSM by the number of matches between the IVSM-generated keywords and the author-assigned keywords. According to our experiment, the precisions of IVSM applied to Web-based community service and academic journals were 0.75 and 0.71, respectively. The performance of both systems is much better than that of baseline systems that generate keywords based on simple probability. Also, IVSM shows comparable performance to Extractor that is a representative system of keyword extraction approach developed by Turney. As electronic documents increase, we expect that IVSM proposed in this paper can be applied to many electronic documents in Web-based community and digital library.

A Study on Market Size Estimation Method by Product Group Using Word2Vec Algorithm (Word2Vec을 활용한 제품군별 시장규모 추정 방법에 관한 연구)

  • Jung, Ye Lim;Kim, Ji Hui;Yoo, Hyoung Sun
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.1-21
    • /
    • 2020
  • With the rapid development of artificial intelligence technology, various techniques have been developed to extract meaningful information from unstructured text data which constitutes a large portion of big data. Over the past decades, text mining technologies have been utilized in various industries for practical applications. In the field of business intelligence, it has been employed to discover new market and/or technology opportunities and support rational decision making of business participants. The market information such as market size, market growth rate, and market share is essential for setting companies' business strategies. There has been a continuous demand in various fields for specific product level-market information. However, the information has been generally provided at industry level or broad categories based on classification standards, making it difficult to obtain specific and proper information. In this regard, we propose a new methodology that can estimate the market sizes of product groups at more detailed levels than that of previously offered. We applied Word2Vec algorithm, a neural network based semantic word embedding model, to enable automatic market size estimation from individual companies' product information in a bottom-up manner. The overall process is as follows: First, the data related to product information is collected, refined, and restructured into suitable form for applying Word2Vec model. Next, the preprocessed data is embedded into vector space by Word2Vec and then the product groups are derived by extracting similar products names based on cosine similarity calculation. Finally, the sales data on the extracted products is summated to estimate the market size of the product groups. As an experimental data, text data of product names from Statistics Korea's microdata (345,103 cases) were mapped in multidimensional vector space by Word2Vec training. We performed parameters optimization for training and then applied vector dimension of 300 and window size of 15 as optimized parameters for further experiments. We employed index words of Korean Standard Industry Classification (KSIC) as a product name dataset to more efficiently cluster product groups. The product names which are similar to KSIC indexes were extracted based on cosine similarity. The market size of extracted products as one product category was calculated from individual companies' sales data. The market sizes of 11,654 specific product lines were automatically estimated by the proposed model. For the performance verification, the results were compared with actual market size of some items. The Pearson's correlation coefficient was 0.513. Our approach has several advantages differing from the previous studies. First, text mining and machine learning techniques were applied for the first time on market size estimation, overcoming the limitations of traditional sampling based- or multiple assumption required-methods. In addition, the level of market category can be easily and efficiently adjusted according to the purpose of information use by changing cosine similarity threshold. Furthermore, it has a high potential of practical applications since it can resolve unmet needs for detailed market size information in public and private sectors. Specifically, it can be utilized in technology evaluation and technology commercialization support program conducted by governmental institutions, as well as business strategies consulting and market analysis report publishing by private firms. The limitation of our study is that the presented model needs to be improved in terms of accuracy and reliability. The semantic-based word embedding module can be advanced by giving a proper order in the preprocessed dataset or by combining another algorithm such as Jaccard similarity with Word2Vec. Also, the methods of product group clustering can be changed to other types of unsupervised machine learning algorithm. Our group is currently working on subsequent studies and we expect that it can further improve the performance of the conceptually proposed basic model in this study.

Transfer Learning using Multiple ConvNet Layers Activation Features with Principal Component Analysis for Image Classification (전이학습 기반 다중 컨볼류션 신경망 레이어의 활성화 특징과 주성분 분석을 이용한 이미지 분류 방법)

  • Byambajav, Batkhuu;Alikhanov, Jumabek;Fang, Yang;Ko, Seunghyun;Jo, Geun Sik
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.1
    • /
    • pp.205-225
    • /
    • 2018
  • Convolutional Neural Network (ConvNet) is one class of the powerful Deep Neural Network that can analyze and learn hierarchies of visual features. Originally, first neural network (Neocognitron) was introduced in the 80s. At that time, the neural network was not broadly used in both industry and academic field by cause of large-scale dataset shortage and low computational power. However, after a few decades later in 2012, Krizhevsky made a breakthrough on ILSVRC-12 visual recognition competition using Convolutional Neural Network. That breakthrough revived people interest in the neural network. The success of Convolutional Neural Network is achieved with two main factors. First of them is the emergence of advanced hardware (GPUs) for sufficient parallel computation. Second is the availability of large-scale datasets such as ImageNet (ILSVRC) dataset for training. Unfortunately, many new domains are bottlenecked by these factors. For most domains, it is difficult and requires lots of effort to gather large-scale dataset to train a ConvNet. Moreover, even if we have a large-scale dataset, training ConvNet from scratch is required expensive resource and time-consuming. These two obstacles can be solved by using transfer learning. Transfer learning is a method for transferring the knowledge from a source domain to new domain. There are two major Transfer learning cases. First one is ConvNet as fixed feature extractor, and the second one is Fine-tune the ConvNet on a new dataset. In the first case, using pre-trained ConvNet (such as on ImageNet) to compute feed-forward activations of the image into the ConvNet and extract activation features from specific layers. In the second case, replacing and retraining the ConvNet classifier on the new dataset, then fine-tune the weights of the pre-trained network with the backpropagation. In this paper, we focus on using multiple ConvNet layers as a fixed feature extractor only. However, applying features with high dimensional complexity that is directly extracted from multiple ConvNet layers is still a challenging problem. We observe that features extracted from multiple ConvNet layers address the different characteristics of the image which means better representation could be obtained by finding the optimal combination of multiple ConvNet layers. Based on that observation, we propose to employ multiple ConvNet layer representations for transfer learning instead of a single ConvNet layer representation. Overall, our primary pipeline has three steps. Firstly, images from target task are given as input to ConvNet, then that image will be feed-forwarded into pre-trained AlexNet, and the activation features from three fully connected convolutional layers are extracted. Secondly, activation features of three ConvNet layers are concatenated to obtain multiple ConvNet layers representation because it will gain more information about an image. When three fully connected layer features concatenated, the occurring image representation would have 9192 (4096+4096+1000) dimension features. However, features extracted from multiple ConvNet layers are redundant and noisy since they are extracted from the same ConvNet. Thus, a third step, we will use Principal Component Analysis (PCA) to select salient features before the training phase. When salient features are obtained, the classifier can classify image more accurately, and the performance of transfer learning can be improved. To evaluate proposed method, experiments are conducted in three standard datasets (Caltech-256, VOC07, and SUN397) to compare multiple ConvNet layer representations against single ConvNet layer representation by using PCA for feature selection and dimension reduction. Our experiments demonstrated the importance of feature selection for multiple ConvNet layer representation. Moreover, our proposed approach achieved 75.6% accuracy compared to 73.9% accuracy achieved by FC7 layer on the Caltech-256 dataset, 73.1% accuracy compared to 69.2% accuracy achieved by FC8 layer on the VOC07 dataset, 52.2% accuracy compared to 48.7% accuracy achieved by FC7 layer on the SUN397 dataset. We also showed that our proposed approach achieved superior performance, 2.8%, 2.1% and 3.1% accuracy improvement on Caltech-256, VOC07, and SUN397 dataset respectively compare to existing work.