• Title/Summary/Keyword: 언어TEXT

Search Result 756, Processing Time 0.03 seconds

Automatic Detection of Off-topic Documents using ConceptNet and Essay Prompt in Automated English Essay Scoring (영어 작문 자동채점에서 ConceptNet과 작문 프롬프트를 이용한 주제-이탈 문서의 자동 검출)

  • Lee, Kong Joo;Lee, Gyoung Ho
    • Journal of KIISE
    • /
    • v.42 no.12
    • /
    • pp.1522-1534
    • /
    • 2015
  • This work presents a new method that can predict, without the use of training data, whether an input essay is written on a given topic. ConceptNet is a common-sense knowledge base that is generated automatically from sentences that are extracted from a variety of document types. An essay prompt is the topic that an essay should be written about. The method that is proposed in this paper uses ConceptNet and an essay prompt to decide whether or not an input essay is off-topic. We introduce a way to find the shortest path between two nodes on ConceptNet, as well as a way to calculate the semantic similarity between two nodes. Not only an essay prompt but also a student's essay can be represented by concept nodes in ConceptNet. The semantic similarity between the concepts that represent an essay prompt and the other concepts that represent a student's essay can be used for a calculation to rank "on-topicness" ; if a low ranking is derived, an essay is regarded as off-topic. We used eight different essay prompts and a student-essay collection for the performance evaluation, whereby our proposed method shows a performance that is better than those of the previous studies. As ConceptNet enables the conduction of a simple text inference, our new method looks very promising with respect to the design of an essay prompt for which a simple inference is required.

A Semiotic Approach on the Political UCC Contents Focused on Video UCC, (정치적 UCC 콘텐츠에 대한 기호학적 연구 동영상 UCC, 을 중심으로)

  • Mha, Joung-Mee;Kang, Ki-Ho
    • Korean journal of communication and information
    • /
    • v.46
    • /
    • pp.245-279
    • /
    • 2009
  • UCC, an abbreviation for User Created Contents, is not only a symbol of desire but also a product of creativity that a producer contains his or her subjective disposition. More and more UCC tend to have significantly increased in Web 2.0 environment. However, the research on the contents as a creative product has rarely been processed. It may be fairly said that this results in the indifference of researchers in the special field like the political contents since UCC is usually produced by amateurs. Producers' various desire is unlikely revealed, which leads to the flow of users into open media such as the Internet. It could also be available to represent the property, of plural visual language signs in a field. Moreover, UCC has the attribute of re-mediation in effective communication, so the differences between the semiotic properties in the Internet contents could be a significant material for researches. This could contribute to establish a theoretical system for the visual communication. Therefore, this study aims to analyse the signification of the political video UCC, . To develop this analysis, I apply Greimas' Generative Trajectory of Signification Theory to the text, or the UCC. He classifies it as three structures: deep structure, superficial structure, and discourse structure. As a result, the text shows meaningful contents delivering core political messages. In addition, this approach could exam that 'Obama Syndrome' in American recent presidential campaigns is caused by web 2.0 based on Internet campaigns including video UCC.

  • PDF

Detecting Errors in POS-Tagged Corpus on XGBoost and Cross Validation (XGBoost와 교차검증을 이용한 품사부착말뭉치에서의 오류 탐지)

  • Choi, Min-Seok;Kim, Chang-Hyun;Park, Ho-Min;Cheon, Min-Ah;Yoon, Ho;Namgoong, Young;Kim, Jae-Kyun;Kim, Jae-Hoon
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.7
    • /
    • pp.221-228
    • /
    • 2020
  • Part-of-Speech (POS) tagged corpus is a collection of electronic text in which each word is annotated with a tag as the corresponding POS and is widely used for various training data for natural language processing. The training data generally assumes that there are no errors, but in reality they include various types of errors, which cause performance degradation of systems trained using the data. To alleviate this problem, we propose a novel method for detecting errors in the existing POS tagged corpus using the classifier of XGBoost and cross-validation as evaluation techniques. We first train a classifier of a POS tagger using the POS-tagged corpus with some errors and then detect errors from the POS-tagged corpus using cross-validation, but the classifier cannot detect errors because there is no training data for detecting POS tagged errors. We thus detect errors by comparing the outputs (probabilities of POS) of the classifier, adjusting hyperparameters. The hyperparameters is estimated by a small scale error-tagged corpus, in which text is sampled from a POS-tagged corpus and which is marked up POS errors by experts. In this paper, we use recall and precision as evaluation metrics which are widely used in information retrieval. We have shown that the proposed method is valid by comparing two distributions of the sample (the error-tagged corpus) and the population (the POS-tagged corpus) because all detected errors cannot be checked. In the near future, we will apply the proposed method to a dependency tree-tagged corpus and a semantic role tagged corpus.

A BERGPT-chatbot for mitigating negative emotions

  • Song, Yun-Gyeong;Jung, Kyung-Min;Lee, Hyun
    • Journal of the Korea Society of Computer and Information
    • /
    • v.26 no.12
    • /
    • pp.53-59
    • /
    • 2021
  • In this paper, we propose a BERGPT-chatbot, a domestic AI chatbot that can alleviate negative emotions based on text input such as 'Replika'. We made BERGPT-chatbot into a chatbot capable of mitigating negative emotions by pipelined two models, KR-BERT and KoGPT2-chatbot. We applied a creative method of giving emotions to unrefined everyday datasets through KR-BERT, and learning additional datasets through KoGPT2-chatbot. The development background of BERGPT-chatbot is as follows. Currently, the number of people with depression is increasing all over the world. This phenomenon is emerging as a more serious problem due to COVID-19, which causes people to increase long-term indoor living or limit interpersonal relationships. Overseas artificial intelligence chatbots aimed at relieving negative emotions or taking care of mental health care, have increased in use due to the pandemic. In Korea, Psychological diagnosis chatbots similar to those of overseas cases are being operated. However, as the domestic chatbot is a system that outputs a button-based answer rather than a text input-based answer, when compared to overseas chatbots, domestic chatbots remain at a low level of diagnosing human psychology. Therefore, we proposed a chatbot that helps mitigating negative emotions through BERGPT-chatbot. Finally, we compared BERGPT-chatbot and KoGPT2-chatbot through 'Perplexity', an internal evaluation metric for evaluating language models, and showed the superity of BERGPT-chatbot.

Study on the Use of Objectification Strategy in Academic Writing (학술적 글쓰기에서의 객관화 전략 사용 양상 연구 - 한국어 학습자와 한국어 모어 화자 간의 비교를 중심으로 -)

  • Kim, Han-saem;Bae, Mi-yeon
    • Cross-Cultural Studies
    • /
    • v.49
    • /
    • pp.95-126
    • /
    • 2017
  • The purpose of this paper is to compare learners' academic texts with academic texts of native speakers and to examine the usage patterns of learners' objectification strategies in detail. In order to achieve objectivity as a discourse mechanism applied to describe the results of academic inquiry in a scientific way with universality and validity, we analyzed concepts and signs such as related intentionality, accuracy, and mitigation of the linguistic markers of objectification strategies. As a result of the comparison, it was analyzed that there are intersectional overlaps with the signs that reveal objectivity, signs indicating related mechanisms, and there is a different set that is differentiated. Objective markers can be broadly classified as emphasizing stativity of research results, separating research subjects from research results, and generalizing research contents. Sustainable expressions and noun phrases emphasize statehood, and non-inhabited expressions, passive expressions, and self-quotations are maintained in the distance between the claimant and the writer, and the pluralization through first-person pronouns and suffixes contributes to generalization. In the case of the learner, the non-inhuman expression of the quotation type appears to be very less compared to the maw speaker, which could be due to the lack of recognition of the citation method of the Korean academic text. Next, in the generalization of the research contents, the expression of 'we' was very less compared to the maw speakers.

Analysis of the Types of News Stories on the Online Broadcast -Focusing upon the Broadcasting Websites of NAVER Newsstand- (온라인 방송의 뉴스기사 유형에 대한 분석 -네이버 뉴스스탠드의 방송사 홈페이지를 중심으로-)

  • Park, Kwang Soon
    • Journal of Digital Convergence
    • /
    • v.19 no.3
    • /
    • pp.177-185
    • /
    • 2021
  • This paper aimed to grasp what the percentage in the types of news stories on the online broadcast is, which was conducted by analyzing the news stories of 9 broadcasting websites on the Naver newsstand. For the analysis, a total of 270 days' samples were selected, including 30 days per broadcast on 9 broadcasting websites. For a method of analysis, One-way ANOVA was used to examine the difference among broadcasting websites. The analysis was made centering with priorities given to the type of news stories by the composition of language, the type of genre as a standard of stories, and so on. As a result of analysis, all the programs in the off-line broadcast have been produced and transmitted as a video-typed story, but a half of those in on-line broadcast have been made up of the stories composed of photo and text. The online newspaper has been producing a new type of news' story using video-typed story or computer graphic while the online broadcast has actively been utilizing stories composed of photos and text, which are types of newspaper's stories. From above-mentioned results, it can be understood that the boundary among media is getting more and more indistinct on the environment of online media, showing the phenomenon that the type of broadcast's stories is becoming old-fashioned.

KOMUChat: Korean Online Community Dialogue Dataset for AI Learning (KOMUChat : 인공지능 학습을 위한 온라인 커뮤니티 대화 데이터셋 연구)

  • YongSang Yoo;MinHwa Jung;SeungMin Lee;Min Song
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.2
    • /
    • pp.219-240
    • /
    • 2023
  • Conversational AI which allows users to interact with satisfaction is a long-standing research topic. To develop conversational AI, it is necessary to build training data that reflects real conversations between people, but current Korean datasets are not in question-answer format or use honorifics, making it difficult for users to feel closeness. In this paper, we propose a conversation dataset (KOMUChat) consisting of 30,767 question-answer sentence pairs collected from online communities. The question-answer pairs were collected from post titles and first comments of love and relationship counsel boards used by men and women. In addition, we removed abuse records through automatic and manual cleansing to build high quality dataset. To verify the validity of KOMUChat, we compared and analyzed the result of generative language model learning KOMUChat and benchmark dataset. The results showed that our dataset outperformed the benchmark dataset in terms of answer appropriateness, user satisfaction, and fulfillment of conversational AI goals. The dataset is the largest open-source single turn text data presented so far and it has the significance of building a more friendly Korean dataset by reflecting the text styles of the online community.

Quantification of Schedule Delay Risk of Rain via Text Mining of a Construction Log (공사일지의 텍스트 마이닝을 통한 우천 공기지연 리스크 정량화)

  • Park, Jongho;Cho, Mingeon;Eom, Sae Ho;Park, Sun-Kyu
    • KSCE Journal of Civil and Environmental Engineering Research
    • /
    • v.43 no.1
    • /
    • pp.109-117
    • /
    • 2023
  • Schedule delays present a major risk factor, as they can adversely affect construction projects, such as through increasing construction costs, claims from a client, and/or a decrease in construction quality due to trims to stages to catch up on lost time. Risk management has been conducted according to the importance and priority of schedule delay risk, but quantification of risk on the depth of schedule delay tends to be inadequate due to limitations in data collection. Therefore, this research used the BERT (Bidirectional Encoder Representations from Transformers) language model to convert the contents of aconstruction log, which comprised unstructured data, into WBS (Work Breakdown Structure)-based structured data, and to form a model of classification and quantification of risk. A process was applied to eight highway construction sites, and 75 cases of rain schedule delay risk were obtained from 8 out of 39 detailed work kinds. Through a K-S test, a significant probability distribution was derived for fourkinds of work, and the risk impact was compared. The process presented in this study can be used to derive various schedule delay risks in construction projects and to quantify their depth.

AI-based stuttering automatic classification method: Using a convolutional neural network (인공지능 기반의 말더듬 자동분류 방법: 합성곱신경망(CNN) 활용)

  • Jin Park;Chang Gyun Lee
    • Phonetics and Speech Sciences
    • /
    • v.15 no.4
    • /
    • pp.71-80
    • /
    • 2023
  • This study primarily aimed to develop an automated stuttering identification and classification method using artificial intelligence technology. In particular, this study aimed to develop a deep learning-based identification model utilizing the convolutional neural networks (CNNs) algorithm for Korean speakers who stutter. To this aim, speech data were collected from 9 adults who stutter and 9 normally-fluent speakers. The data were automatically segmented at the phrasal level using Google Cloud speech-to-text (STT), and labels such as 'fluent', 'blockage', prolongation', and 'repetition' were assigned to them. Mel frequency cepstral coefficients (MFCCs) and the CNN-based classifier were also used for detecting and classifying each type of the stuttered disfluency. However, in the case of prolongation, five results were found and, therefore, excluded from the classifier model. Results showed that the accuracy of the CNN classifier was 0.96, and the F1-score for classification performance was as follows: 'fluent' 1.00, 'blockage' 0.67, and 'repetition' 0.74. Although the effectiveness of the automatic classification identifier was validated using CNNs to detect the stuttered disfluencies, the performance was found to be inadequate especially for the blockage and prolongation types. Consequently, the establishment of a big speech database for collecting data based on the types of stuttered disfluencies was identified as a necessary foundation for improving classification performance.

A Systematic Review on the Present Condition of the Internal Robot Therapy (국내 로봇치료 연구 현황에 대한 체계적 고찰)

  • Song, Ji-Hyeon;Sim, Eun-Ji;Yom, Ji-Yun;Oh, Min-Kyeong;Yi, Hu-Shin;Yoo, Doo-Han
    • The Journal of Korean society of community based occupational therapy
    • /
    • v.6 no.1
    • /
    • pp.49-60
    • /
    • 2016
  • Objective : By organizing systematically the study case that use Robot Therapy as intervention tool according to PICO (Patient, Intervention, Comparison, Outcome), This study aims to investigate the domestic Robot Therapy's present condition. Methods : We searched 710 pieces of domestic scientific journal and master's thesis during the past nine years in 'Research Information Sharing Service' and 'National Digital Science Library' database using the keyword 'Robot therapy'. We finally chose 15 pieces of domestic scientific journal and master's thesis among the domestic studies that based on the full text which is affordable and used robot by therapeutic intervention tool. Chosen studies were layed out by PICO that could organize the resources systematically. Results : The quality of study tool was used to the method of evidence-based study level of 5 step classification. More than three stages of quality level study was 13. Result of dividing the studies using robot therapy by intervention field, language, lower extremity(gait), cognition, development and study for the region of the upper extremity of five is advancing. Conclusion : Nationally, the robot therapy has been used in various area that include the upper extremity and lower extremity's intervention of language, cognition, growth and others. We hope that this study for baseline data will be utilized in various area engaging to domestic robot therapy.