• Title/Summary/Keyword: 단어군집화

Search Result 81, Processing Time 0.026 seconds

Domain Analysis of Research on Prediction and Analysis of Slope Failure by Co-Word Analysis (동시출현단어 분석을 활용한 비탈면 붕괴 예측 및 분석 연구에 관한 지적구조 분석)

  • Kim, Sun-Kyum;Kim, Seung-Hyun
    • The Journal of Engineering Geology
    • /
    • v.31 no.3
    • /
    • pp.307-319
    • /
    • 2021
  • Although it is currently conducting slope management and research using digital technologies such as drones, big data, and artificial intelligence, it is still somewhat insufficient and is still vulnerable to slope failure. For this reason, it is inevitable to present the development direction for research on prediction and analysis of slope failure using the digital technologies to effectively deal with slope failure, which requires a preemptive understanding of prediction and analysis of slope failure. In this paper, we collected literature data based on the Web of Science for five years from January 1, 2016 to December 31, 2020 and analyzed by co-word analysis to identify the domain structure of research on prediction and analysis of slope failure. Detailed subject areas were identified through network analysis, and the domain relationships between keywords were visualized to derive global and regionally oriented keywords through relationship, centrality analysis. In addition, the clusters formed by performing cluster analysis were displayed on the multidimensional scailing map, and the domain structure according to the correlation between each keyword was presented. The results of this study reveal the domain structure of research on prediction and analysis of slope failure, and are expected to be usefully used to find future research directions.

Performance of Korean spontaneous speech recognizers based on an extended phone set derived from acoustic data (음향 데이터로부터 얻은 확장된 음소 단위를 이용한 한국어 자유발화 음성인식기의 성능)

  • Bang, Jeong-Uk;Kim, Sang-Hun;Kwon, Oh-Wook
    • Phonetics and Speech Sciences
    • /
    • v.11 no.3
    • /
    • pp.39-47
    • /
    • 2019
  • We propose a method to improve the performance of spontaneous speech recognizers by extending their phone set using speech data. In the proposed method, we first extract variable-length phoneme-level segments from broadcast speech signals, and convert them to fixed-length latent vectors using an long short-term memory (LSTM) classifier. We then cluster acoustically similar latent vectors and build a new phone set by choosing the number of clusters with the lowest Davies-Bouldin index. We also update the lexicon of the speech recognizer by choosing the pronunciation sequence of each word with the highest conditional probability. In order to analyze the acoustic characteristics of the new phone set, we visualize its spectral patterns and segment duration. Through speech recognition experiments using a larger training data set than our own previous work, we confirm that the new phone set yields better performance than the conventional phoneme-based and grapheme-based units in both spontaneous speech recognition and read speech recognition.

Headword Finding System Using Document Expansion (문서 확장을 이용한 표제어 검색시스템)

  • Kim, Jae-Hoon;Kim, Hyung-Chul
    • Journal of Information Management
    • /
    • v.42 no.4
    • /
    • pp.137-154
    • /
    • 2011
  • A headword finding system is defined as an information retrieval system using a word gloss as a query. We use the gloss as a document in order to implement such a system. Generally the gloss is very short in length and then makes very difficult to find the most proper headword for a given query. To alleviate this problem, we expand the document using the concept of query expansion in information retrieval. In this paper, we use 2 document expansion methods : gloss expansion and similar word expansion. The former is the process of inserting glosses of words, which include in the document, into a seed document. The latter is also the process of inserting similar words into a seed document. We use a featureless clustering algorithm for getting the similar words. The performance (r-inclusion rate) amounts to almost 100% when the queries are word glosses and r is 16, and to 66.9% when the queries are written in person by users. Through several experiments, we have observed that the document expansions are very useful for the headword finding system. In the future, new measures including the r-inclusion rate of our proposed measure are required for performance evaluation of headword finding systems and new evaluation sets are also needed for objective assessment.

Mention Detection and Coreference Resolution Pipeline Model for Dialogue Data (대화 데이터를 위한 멘션 탐지 및 상호참조해결 파이프라인 모델)

  • Kim, Damrin;Kim, Hongjin;Park, Seongsik;Kim, Harksoo
    • Annual Conference on Human and Language Technology
    • /
    • 2021.10a
    • /
    • pp.264-269
    • /
    • 2021
  • 상호참조해결은 주어진 문서에서 상호참조해결의 대상이 될 수 있는 멘션을 추출하고, 같은 개체를 의미하는 멘션 쌍 또는 집합을 찾는 자연어처리 작업이다. 하나의 멘션 내에 멘션이 될 수 있는 다른 단어를 포함하는 중첩 멘션은 순차적 레이블링으로 해결할 수 없는 문제가 있다. 본 논문에서는 이러한 문제를 해결하기 위해 멘션의 시작 단어의 위치를 여는 괄호('('), 마지막 위치를 닫는 괄호(')')로 태깅하고 이 괄호들을 예측하는 멘션 탐지 모델과 멘션 탐지 모델에서 예측된 멘션을 바탕으로 포인터 네트워크를 이용하여 같은 개체를 나타내는 멘션을 군집화하는 상호참조해결 모델을 제안한다. 실험 결과, 4개의 영어 대화 데이터셋에서 멘션 탐지 모델은 F1-score (Light) 94.17%, (AMI) 90.86%, (Persuasion) 92.93%, (Switchboard) 91.04%의 성능을 보이고, 상호참조해결 모델에서는 CoNLL F1 (Light) 69.1%, (AMI) 57.6%, (Persuasion) 71.0%, (Switchboard) 65.7%의 성능을 보인다.

  • PDF

Professional Baseball Viewing Culture Survey According to Corona 19 using Social Network Big Data (소셜네트워크 빅데이터를 활용한 코로나 19에 따른 프로야구 관람문화조사)

  • Kim, Gi-Tak
    • Journal of Korea Entertainment Industry Association
    • /
    • v.14 no.6
    • /
    • pp.139-150
    • /
    • 2020
  • The data processing of this study focuses on the textom and social media words about three areas: 'Corona 19 and professional baseball', 'Corona 19 and professional baseball', and 'Corona 19 and professional sports' The data was collected and refined in a web environment and then processed in batch, and the Ucinet6 program was used to visualize it. Specifically, the web environment was collected using Naver, Daum, and Google's channels, and was summarized into 30 words through expert meetings among the extracted words and used in the final study. 30 extracted words were visualized through a matrix, and a CONCOR analysis was performed to identify clusters of similarity and commonality of words. As a result of analysis, the clusters related to Corona 19 and Pro Baseball were composed of one central cluster and five peripheral clusters, and it was found that the contents related to the opening of professional baseball according to the corona 19 wave were mainly searched. The cluster related to Corona 19 and unrelated to professional baseball consisted of one central cluster and five peripheral clusters, and it was found that the keyword of the position of professional baseball related to the professional baseball game according to Corona 19 was mainly searched. Corona 19 and the cluster related to professional sports consisted of one central cluster and five peripheral clusters, and it was found that the keywords related to the start of professional sports according to the aftermath of Corona 19 were mainly searched.

Analyzing the discriminative characteristic of cover letters using text mining focused on Air Force applicants (텍스트 마이닝을 이용한 공군 부사관 지원자 자기소개서의 차별적 특성 분석)

  • Kwon, Hyeok;Kim, Wooju
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.3
    • /
    • pp.75-94
    • /
    • 2021
  • The low birth rate and shortened military service period are causing concerns about selecting excellent military officers. The Republic of Korea entered a low birth rate society in 1984 and an aged society in 2018 respectively, and is expected to be in a super-aged society in 2025. In addition, the troop-oriented military is changed as a state-of-the-art weapons-oriented military, and the reduction of the military service period was implemented in 2018 to ease the burden of military service for young people and play a role in the society early. Some observe that the application rate for military officers is falling due to a decrease of manpower resources and a preference for shortened mandatory military service over military officers. This requires further consideration of the policy of securing excellent military officers. Most of the related studies have used social scientists' methodologies, but this study applies the methodology of text mining suitable for large-scale documents analysis. This study extracts words of discriminative characteristics from the Republic of Korea Air Force Non-Commissioned Officer Applicant cover letters and analyzes the polarity of pass and fail. It consists of three steps in total. First, the application is divided into general and technical fields, and the words characterized in the cover letter are ordered according to the difference in the frequency ratio of each field. The greater the difference in the proportion of each application field, the field character is defined as 'more discriminative'. Based on this, we extract the top 50 words representing discriminative characteristics in general fields and the top 50 words representing discriminative characteristics in technology fields. Second, the number of appropriate topics in the overall cover letter is calculated through the LDA. It uses perplexity score and coherence score. Based on the appropriate number of topics, we then use LDA to generate topic and probability, and estimate which topic words of discriminative characteristic belong to. Subsequently, the keyword indicators of questions used to set the labeling candidate index, and the most appropriate index indicator is set as the label for the topic when considering the topic-specific word distribution. Third, using L-LDA, which sets the cover letter and label as pass and fail, we generate topics and probabilities for each field of pass and fail labels. Furthermore, we extract only words of discriminative characteristics that give labeled topics among generated topics and probabilities by pass and fail labels. Next, we extract the difference between the probability on the pass label and the probability on the fail label by word of the labeled discriminative characteristic. A positive figure can be seen as having the polarity of pass, and a negative figure can be seen as having the polarity of fail. This study is the first research to reflect the characteristics of cover letters of Republic of Korea Air Force non-commissioned officer applicants, not in the private sector. Moreover, these methodologies can apply text mining techniques for multiple documents, rather survey or interview methods, to reduce analysis time and increase reliability for the entire population. For this reason, the methodology proposed in the study is also applicable to other forms of multiple documents in the field of military personnel. This study shows that L-LDA is more suitable than LDA to extract discriminative characteristics of Republic of Korea Air Force Noncommissioned cover letters. Furthermore, this study proposes a methodology that uses a combination of LDA and L-LDA. Therefore, through the analysis of the results of the acquisition of non-commissioned Republic of Korea Air Force officers, we would like to provide information available for acquisition and promotional policies and propose a methodology available for research in the field of military manpower acquisition.

News Topic Extraction based on Word Similarity (단어 유사도를 이용한 뉴스 토픽 추출)

  • Jin, Dongxu;Lee, Soowon
    • Journal of KIISE
    • /
    • v.44 no.11
    • /
    • pp.1138-1148
    • /
    • 2017
  • Topic extraction is a technology that automatically extracts a set of topics from a set of documents, and this has been a major research topic in the area of natural language processing. Representative topic extraction methods include Latent Dirichlet Allocation (LDA) and word clustering-based methods. However, there are problems with these methods, such as repeated topics and mixed topics. The problem of repeated topics is one in which a specific topic is extracted as several topics, while the problem of mixed topic is one in which several topics are mixed in a single extracted topic. To solve these problems, this study proposes a method to extract topics using an LDA that is robust against the problem of repeated topic, going through the steps of separating and merging the topics using the similarity between words to correct the extracted topics. As a result of the experiment, the proposed method showed better performance than the conventional LDA method.

The Relationship Between Character and Costume in literary Work using Semantic networks -The novel 「Norwegian Wood」- (시맨틱 네트워크를 통한 문학작품 속 인물과 의상의 관계 -소설 「노르웨이의 숲」-)

  • Choi, Yeong-Hyeon;Kim, Seong Eun;Lee, Kyu-Hye
    • Journal of Digital Convergence
    • /
    • v.19 no.1
    • /
    • pp.307-314
    • /
    • 2021
  • This study aimed to apply the principle of the semantic network to a long novel in an attempt to understand the structure of the entire document and the manifested relationships between words and words. The costume expressions in Murakami's novel Norwegian Wood were analyzed based on the characters' symbols, relationships, and personality characteristics. The study identified the symbols of the characters in the novel and the relationship properties between the characters through the Clauset-Newman-Moore clustering algorithm. The descriptions and symbols of the relationships between the characters were identified within the worldview that the author had intended. Further, it was confirmed that the expression of each costume according to the character's personality was also connected to the clue that explained said character. This fusion study is academically significant in that it presents a new methodology for analyzing literary works

Towards Automatic Evaluation of Category Fluency Test Performance : Distinguishing Groups using Word Clustering (자동 범주유창성검사 평가를 향하여: 단어 군집화를 활용한 그룹간 구별)

  • Lee, Yong-Jae;Wolters, Maria;Lee, Hee-Jin;Park, Jong-C.
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2012.06b
    • /
    • pp.471-473
    • /
    • 2012
  • The Category Fluency Test (CFT) is a widely used verbal fluency test. The standard measure of scoring the test is the number of distinct words that a subject generates during the test. Recently, other measures have also been proposed to evaluate performance, such as clustering and switching. In this study, we examine clusters and switches can be assessed using word similarity measures. Based on these measures, we can distinguish between subject groups.

sent2dl : Augmenting Distributional Semantics to Symbolic Sentence Meaning Representation based on Description Logic SROIQ (sent2dl : 기술논리 SROIQ 기반 기호적 문장 의미 표상에 분산 표상 더하기)

  • Schin, Seung-Woo;Oh, Ju-Min;Noh, Hyung-Jong;Lee, Yeon-Soo
    • Annual Conference on Human and Language Technology
    • /
    • 2020.10a
    • /
    • pp.199-204
    • /
    • 2020
  • 기존의 자연어 의미 표상 방법은 크게 나눠보았을 때 두 가지가 있다. 첫 번째로, 전통적인 기호 기반 의미 표상 방법론이다. 이 방법론들은 논리적이고 해석가능하다는 장점이 있으나, 구축에 시간이 많이 들고 정작 기호 자체의 의미를 더욱 미시적으로 파악하기 어렵다는 단점이 있었다. 반면, 최근 대두된 분산 표상의 경우 단어 하나하나의 의미는 상대적으로 잘 파악하는 반면, 문장 등의 복잡한 구조의 의미를 나타내는 데 있어 상대적으로 약한 측면을 보이며 해석가능하지 않다는 단점이 있다. 본 논문에서는 이 둘의 장점을 섞어서 서로의 단점을 보완하는 새로운 의미 표상을 제안하였으며, 이 표상이 유의미하게 문장의 의미를 담고 있음을 비지도 문장 군집화 문제를 통해 간접적으로 보였다.

  • PDF