• Title/Summary/Keyword: Text features

Search Result 580, Processing Time 0.029 seconds

Topic modeling for automatic classification of learner question and answer in teaching-learning support system (교수-학습지원시스템에서 학습자 질의응답 자동분류를 위한 토픽 모델링)

  • Kim, Kyungrog;Song, Hye jin;Moon, Nammee
    • Journal of Digital Contents Society
    • /
    • v.18 no.2
    • /
    • pp.339-346
    • /
    • 2017
  • There is increasing interest in text analysis based on unstructured data such as articles and comments, questions and answers. This is because they can be used to identify, evaluate, predict, and recommend features from unstructured text data, which is the opinion of people. The same holds true for TEL, where the MOOC service has evolved to automate debating, questioning and answering services based on the teaching-learning support system in order to generate question topics and to automatically classify the topics relevant to new questions based on question and answer data accumulated in the system. Therefore, in this study, we propose topic modeling using LDA to automatically classify new query topics. The proposed method enables the generation of a dictionary of question topics and the automatic classification of topics relevant to new questions. Experimentation showed high automatic classification of over 0.7 in some queries. The more new queries were included in the various topics, the better the automatic classification results.

Improving the Performance of Korean Text Chunking by Machine learning Approaches based on Feature Set Selection (자질집합선택 기반의 기계학습을 통한 한국어 기본구 인식의 성능향상)

  • Hwang, Young-Sook;Chung, Hoo-jung;Park, So-Young;Kwak, Young-Jae;Rim, Hae-Chang
    • Journal of KIISE:Software and Applications
    • /
    • v.29 no.9
    • /
    • pp.654-668
    • /
    • 2002
  • In this paper, we present an empirical study for improving the Korean text chunking based on machine learning and feature set selection approaches. We focus on two issues: the problem of selecting feature set for Korean chunking, and the problem of alleviating the data sparseness. To select a proper feature set, we use a heuristic method of searching through the space of feature sets using the estimated performance from a machine learning algorithm as a measure of "incremental usefulness" of a particular feature set. Besides, for smoothing the data sparseness, we suggest a method of using a general part-of-speech tag set and selective lexical information under the consideration of Korean language characteristics. Experimental results showed that chunk tags and lexical information within a given context window are important features and spacing unit information is less important than others, which are independent on the machine teaming techniques. Furthermore, using the selective lexical information gives not only a smoothing effect but also the reduction of the feature space than using all of lexical information. Korean text chunking based on the memory-based learning and the decision tree learning with the selected feature space showed the performance of precision/recall of 90.99%/92.52%, and 93.39%/93.41% respectively.

Developing of Text Plagiarism Detection Model using Korean Corpus Data (한글 말뭉치를 이용한 한글 표절 탐색 모델 개발)

  • Ryu, Chang-Keon;Kim, Hyong-Jun;Cho, Hwan-Gue
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.14 no.2
    • /
    • pp.231-235
    • /
    • 2008
  • Recently we witnessed a few scandals on plagiarism among academic paper and novels. Plagiarism on documents is getting worse more frequently. Although plagiarism on English had been studied so long time, we hardly find the systematic and complete studies on plagiarisms in Korean documents. Since the linguistic features of Korean are quite different from those of English, we cannot apply the English-based method to Korean documents directly. In this paper, we propose a new plagiarism detecting method for Korean, and we throughly tested our algorithm with one benchmark Korean text corpus. The proposed method is based on "k-mer" and "local alignment" which locates the region of plagiarized document pairs fast and accurately. Using a Korean corpus which contains more than 10 million words, we establish a probability model (or local alignment score (random similarity by chance). The experiment has shown that our system was quite successful to detect the plagiarized documents.

Features of the Rural Revitalization Projects in Jang-su County Using LDA Topic Analysis of News Data - Focused on Keyword of Tourism and Livelihood - (뉴스데이터의 LDA 토픽 분석을 통한 장수군 농촌지역 활성화 사업의 특징 - 관광·생활 키워드를 중심으로 -)

  • Kim, Young-Jin;Son, Yong-hoon
    • Journal of Korean Society of Rural Planning
    • /
    • v.24 no.4
    • /
    • pp.69-80
    • /
    • 2018
  • In this study, we typified the project for revitalizing the rural area through text analysis using news data, and analyzed the main direction and characteristics of the project. In order to examine the factors emphasized among the issues related to the revitalization of rural areas, we used news data related to 'tourism' and 'livelihood', which are the main keyword of the project to promote rural areas. In the analysis, text mining techniques were used. Topic modeling was conducted on LDA techniques for major projects in 'tourism' and 'livelihood' keyword. Based on this, this study typified the projects that are carried out for the activation of rural areas by topic. As a result of the analysis, it was fount that the topics included in the project were distributed in 11 sub-types(Tourism Promotion, Regional Specialization, Local Festival, Development of Regional Scale, Urban and Rural Exchange, Agricultural Support, Community Forest Management, Improve the Settlement Environment, General Welfare Service, Low Class Support, Others). The characteristics of the rural revitalization projects were examined, and it was confirmed that domestic projects were carried out by tourism-oriented projects. To summarize, the government is making projects to revitalize rural areas through related ministries. Within the structure where the project is spreading to the region, a lot of projects are being carried out. It is understood that the tourism and welfare oriented projects are being carried out in the revitalization project of the domestic rural area. Therefore, in order to achieve the goal of rural revitalization, it is believed that it will be effective to carry out a balanced project to improve the settlement environment of the residents.

A study on research trends for gestational diabetes mellitus and breastfeeding: Focusing on text network analysis and topic modeling (임신성 당뇨와 모유수유에 대한 연구 동향 분석: 텍스트네트워크 분석과 토픽모델링 중심)

  • Lee, Junglim;Kim, Youngji;Kwak, Eunju;Park, Seungmi
    • The Journal of Korean Academic Society of Nursing Education
    • /
    • v.27 no.2
    • /
    • pp.175-185
    • /
    • 2021
  • Purpose: The aim of this study was to identify core keywords and topic groups in the 'Gestational diabetes mellitus (GDM) and Breastfeeding' field of research for better understanding research trends in the past 20 years. Methods: This was a text-mining and topic modeling study composed of four steps: 1) collecting abstracts, 2) extracting and cleaning semantic morphemes, 3) building a co-occurrence matrix, and 4) analyzing network features and clustering topic groups. Results: A total of 635 papers published between 2001 and 2020 were found in databases (Web of Science, CINAHL, RISS, DBPIA, RISS, KISS). Among them, 3,639 words extracted from 366 articles selected according to the conditions were analyzed by text network analysis and topic modeling. The most important keywords were 'exposure', 'fetus', 'hypoglycemia', 'prevention' and 'program'. Six topic groups were identified through topic modeling. The main topics of the study were 'cardiovascular disease' and 'obesity'. Through the topic modeling analysis, six themes were derived: 'cardiovascular disease', 'obesity', 'complication prevention strategy', 'support of breastfeeding', 'educational program' and 'management of GDM'. Conclusion: This study showed that over the past 20 years many studies have been conducted on complications such as cardiovascular diseases and obesity related to gestational diabetes and breastfeeding. In order to prevent complications of gestational diabetes and promote breastfeeding, various nursing interventions, including gestational diabetes management and educational programs for GDM pregnancies, should be developed in nursing fields.

Question Similarity Measurement of Chinese Crop Diseases and Insect Pests Based on Mixed Information Extraction

  • Zhou, Han;Guo, Xuchao;Liu, Chengqi;Tang, Zhan;Lu, Shuhan;Li, Lin
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.11
    • /
    • pp.3991-4010
    • /
    • 2021
  • The Question Similarity Measurement of Chinese Crop Diseases and Insect Pests (QSM-CCD&IP) aims to judge the user's tendency to ask questions regarding input problems. The measurement is the basis of the Agricultural Knowledge Question and Answering (Q & A) system, information retrieval, and other tasks. However, the corpus and measurement methods available in this field have some deficiencies. In addition, error propagation may occur when the word boundary features and local context information are ignored when the general method embeds sentences. Hence, these factors make the task challenging. To solve the above problems and tackle the Question Similarity Measurement task in this work, a corpus on Chinese crop diseases and insect pests(CCDIP), which contains 13 categories, was established. Then, taking the CCDIP as the research object, this study proposes a Chinese agricultural text similarity matching model, namely, the AgrCQS. This model is based on mixed information extraction. Specifically, the hybrid embedding layer can enrich character information and improve the recognition ability of the model on the word boundary. The multi-scale local information can be extracted by multi-core convolutional neural network based on multi-weight (MM-CNN). The self-attention mechanism can enhance the fusion ability of the model on global information. In this research, the performance of the AgrCQS on the CCDIP is verified, and three benchmark datasets, namely, AFQMC, LCQMC, and BQ, are used. The accuracy rates are 93.92%, 74.42%, 86.35%, and 83.05%, respectively, which are higher than that of baseline systems without using any external knowledge. Additionally, the proposed method module can be extracted separately and applied to other models, thus providing reference for related research.

Feature selection for text data via topic modeling (토픽 모형을 이용한 텍스트 데이터의 단어 선택)

  • Woosol, Jang;Ye Eun, Kim;Won, Son
    • The Korean Journal of Applied Statistics
    • /
    • v.35 no.6
    • /
    • pp.739-754
    • /
    • 2022
  • Usually, text data consists of many variables, and some of them are closely correlated. Such multi-collinearity often results in inefficient or inaccurate statistical analysis. For supervised learning, one can select features by examining the relationship between target variables and explanatory variables. On the other hand, for unsupervised learning, since target variables are absent, one cannot use such a feature selection procedure as in supervised learning. In this study, we propose a word selection procedure that employs topic models to find latent topics. We substitute topics for the target variables and select terms which show high relevance for each topic. Applying the procedure to real data, we found that the proposed word selection procedure can give clear topic interpretation by removing high-frequency words prevalent in various topics. In addition, we observed that, by applying the selected variables to the classifiers such as naïve Bayes classifiers and support vector machines, the proposed feature selection procedure gives results comparable to those obtained by using class label information.

Analysis of Mission, Vision and Core values in Korean Tertiary General Hospitals Through Text Mining (텍스트 마이닝을 통한 상급종합병원의 미션, 비전, 핵심가치 분석 연구)

  • Ji-Hoon Lee
    • Korea Journal of Hospital Management
    • /
    • v.28 no.2
    • /
    • pp.32-43
    • /
    • 2023
  • Purposes: This research is conducted to identify main features and trends of mission, vision and core values in Korean tertiary general hospitals by using text-mining. Methodology: For the study, 45 mission, 112 vision and 190 core values are collected from 45 tertiary general hospitals' homepages in 2022 and use word frequency analysis and Leyword co-occurrence analysis. Findings: In the tertiary general hospitals' mission, there are high frequency words such as 'health', 'humanity', 'medical treatment', 'education', 'research', 'happiness', 'love', 'best', 'spirit', and mission mainly includes the content of contributing humanity's health and happiness with these words. In case of vision, high frequency words are 'hospital', 'medical treatment', 'research', 'lead', 'trust', 'centered', 'patient', 'best', 'future'. By using these words in vision, it represents the definition and characteristics of vision such as ideal organizations in the future, goals and targets. As a result of the Leyword co-occurrence analysis, vision includes the content of 'high-tech medical treatment', 'special care for patients', 'leading education and research', 'the highest trust with customer', 'creative talents training'. -astly, the high frequency word-pairs in core values are 'social distribution', 'innovation pursuit', 'cooperation and harmony', and it defines standards of behavior for organizations. Practical Implication: To correct the problems of vision, mission and core values from findings, firstly, it needs for Korean tertiary general hospitals to use the words that can explain organization's identity and differentiate others in their mission. Secondly, considering strengthening the role of hospitals in their community and the importance of members in organizations, it is necessary to establish vision with considering community and members to activate vision effectively. Thirdly, because there are no specific guidelines of establishing mission, vision and core values for healthcare organizations, this research concepts and results could be utilized when other organizations establish mission, vision and core values.

  • PDF

A Study on the Design and Implementation of an AI Mock Interview System for Computer Science Interview Preparation Using LLM-based ChatGPT (LLM 기반 ChatGPT를 활용한 컴퓨터 분야 면접 준비용 AI 모의 면접 시스템의 설계 및 구현에 대한 연구)

  • Jae-Sung Chun;Hee-Kwon Jang;Ji-Hye Kim;Chang-Min Bae;Dong-Gyu Lee;Il-Young Moon
    • Journal of Practical Engineering Education
    • /
    • v.16 no.5_spc
    • /
    • pp.643-651
    • /
    • 2024
  • This study aims to design and implement an AI mock interview system for Computer Science (CS) interview preparation using LLM (Large Language Model) based ChatGPT. The system utilizes AI's natural language processing and speech recognition capabilities to analyze and provide real-time feedback on interview responses, helping users improve their weaknesses during the preparation process. According to a survey, 90% of users reported that the real-time feedback function provided substantial assistance in their interview preparation. Key features include GPT prompt generation and Speech-to-Text functionality, which converts voice data into text. The system received positive evaluations for its response time and feedback accuracy. Future research will explore expanding the range of question types and applying the system to various industries.

Analyzing Tasks in the Geometry Area of 7th Grade of Korean and US Textbooks from the Perspective of Mathematical Modeling (수학적 모델링 관점에 따른 한국과 미국의 중학교 1학년 교과서 기하 영역에 제시된 과제 분석)

  • Jung, Hye-Yun;Jung, Jin-Ho;Lee, Kyeong-Hwa
    • Journal of the Korean School Mathematics Society
    • /
    • v.23 no.2
    • /
    • pp.179-201
    • /
    • 2020
  • The purpose of this study is to analyze tasks reflected in Korean and US textbooks according to the mathematical modeling perspectives, and then to compare the diversity of learning opportunities given to students from both countries. For this, we analyzed mathematical modeling tasks of textbooks based on three aspects: mathematical modeling process, data, and expression. Results are as follows. First, with respect to modeling process, Korean textbook provides a high percentage of the task at all stages of modeling than US textbook. Second, with respect to data, both countries' textbooks have the highest percentage of matching task. Korean textbooks have a large gap in data characteristics by textbook. Third, with respect to expression, both countries' textbooks have the highest percentage of text and picture. Korean textbooks have a large gap in the type of expression than US textbooks, and some textbooks have no other expression except for text and picture. Fourth, tasks were analyzed by integrating the three features. The three features were not combined in various ways. It is necessary to diversify the integration of the three features.