• Title/Summary/Keyword: Korean text classification

Search Result 413, Processing Time 0.025 seconds

A New Approach to Automatic Keyword Generation Using Inverse Vector Space Model (키워드 자동 생성에 대한 새로운 접근법: 역 벡터공간모델을 이용한 키워드 할당 방법)

  • Cho, Won-Chin;Rho, Sang-Kyu;Yun, Ji-Young Agnes;Park, Jin-Soo
    • Asia pacific journal of information systems
    • /
    • v.21 no.1
    • /
    • pp.103-122
    • /
    • 2011
  • Recently, numerous documents have been made available electronically. Internet search engines and digital libraries commonly return query results containing hundreds or even thousands of documents. In this situation, it is virtually impossible for users to examine complete documents to determine whether they might be useful for them. For this reason, some on-line documents are accompanied by a list of keywords specified by the authors in an effort to guide the users by facilitating the filtering process. In this way, a set of keywords is often considered a condensed version of the whole document and therefore plays an important role for document retrieval, Web page retrieval, document clustering, summarization, text mining, and so on. Since many academic journals ask the authors to provide a list of five or six keywords on the first page of an article, keywords are most familiar in the context of journal articles. However, many other types of documents could not benefit from the use of keywords, including Web pages, email messages, news reports, magazine articles, and business papers. Although the potential benefit is large, the implementation itself is the obstacle; manually assigning keywords to all documents is a daunting task, or even impractical in that it is extremely tedious and time-consuming requiring a certain level of domain knowledge. Therefore, it is highly desirable to automate the keyword generation process. There are mainly two approaches to achieving this aim: keyword assignment approach and keyword extraction approach. Both approaches use machine learning methods and require, for training purposes, a set of documents with keywords already attached. In the former approach, there is a given set of vocabulary, and the aim is to match them to the texts. In other words, the keywords assignment approach seeks to select the words from a controlled vocabulary that best describes a document. Although this approach is domain dependent and is not easy to transfer and expand, it can generate implicit keywords that do not appear in a document. On the other hand, in the latter approach, the aim is to extract keywords with respect to their relevance in the text without prior vocabulary. In this approach, automatic keyword generation is treated as a classification task, and keywords are commonly extracted based on supervised learning techniques. Thus, keyword extraction algorithms classify candidate keywords in a document into positive or negative examples. Several systems such as Extractor and Kea were developed using keyword extraction approach. Most indicative words in a document are selected as keywords for that document and as a result, keywords extraction is limited to terms that appear in the document. Therefore, keywords extraction cannot generate implicit keywords that are not included in a document. According to the experiment results of Turney, about 64% to 90% of keywords assigned by the authors can be found in the full text of an article. Inversely, it also means that 10% to 36% of the keywords assigned by the authors do not appear in the article, which cannot be generated through keyword extraction algorithms. Our preliminary experiment result also shows that 37% of keywords assigned by the authors are not included in the full text. This is the reason why we have decided to adopt the keyword assignment approach. In this paper, we propose a new approach for automatic keyword assignment namely IVSM(Inverse Vector Space Model). The model is based on a vector space model. which is a conventional information retrieval model that represents documents and queries by vectors in a multidimensional space. IVSM generates an appropriate keyword set for a specific document by measuring the distance between the document and the keyword sets. The keyword assignment process of IVSM is as follows: (1) calculating the vector length of each keyword set based on each keyword weight; (2) preprocessing and parsing a target document that does not have keywords; (3) calculating the vector length of the target document based on the term frequency; (4) measuring the cosine similarity between each keyword set and the target document; and (5) generating keywords that have high similarity scores. Two keyword generation systems were implemented applying IVSM: IVSM system for Web-based community service and stand-alone IVSM system. Firstly, the IVSM system is implemented in a community service for sharing knowledge and opinions on current trends such as fashion, movies, social problems, and health information. The stand-alone IVSM system is dedicated to generating keywords for academic papers, and, indeed, it has been tested through a number of academic papers including those published by the Korean Association of Shipping and Logistics, the Korea Research Academy of Distribution Information, the Korea Logistics Society, the Korea Logistics Research Association, and the Korea Port Economic Association. We measured the performance of IVSM by the number of matches between the IVSM-generated keywords and the author-assigned keywords. According to our experiment, the precisions of IVSM applied to Web-based community service and academic journals were 0.75 and 0.71, respectively. The performance of both systems is much better than that of baseline systems that generate keywords based on simple probability. Also, IVSM shows comparable performance to Extractor that is a representative system of keyword extraction approach developed by Turney. As electronic documents increase, we expect that IVSM proposed in this paper can be applied to many electronic documents in Web-based community and digital library.

Comparative Analysis of 4-gram Word Clusters in South vs. North Korean High School English Textbooks (남북한 고등학교 영어교과서 4-gram 연어 비교 분석)

  • Kim, Jeong-ryeol
    • The Journal of the Korea Contents Association
    • /
    • v.20 no.7
    • /
    • pp.274-281
    • /
    • 2020
  • N-gram analysis casts a new look at the n-word cluster in use different from the previously known idioms. It analyzes a corpus of English textbooks for frequently occurring n consecutive words mechanically using a concordance software, which is different from the previously known idioms. The current paper aims at extracting and comparing 4-gram words clusters between South Korean high school English textbooks and its North Korean counterpart. The classification criteria includes number of tokens and types between the two across oral and written languages in the textbooks. The criteria also use the grammatical categories and functional categories to classify and compare the 4-gram words clusters. The grammatical categories include noun phrases, verb phrases, prepositional phrases, partial clauses and others. The functional categories include deictic function, text organizers, stance and others. The findings are: South Korean high school English textbook contains more tokens and types in both oral and written languages. Verb phrase and partial clause 4-grams are grammatically most frequently encountered categories across both South and North Korean high school English textbooks. Stance is most dominant functional category in both South and North Korean English textbooks.

Study of Rhetorical Puns in Korean Comic Strips in Daily Newspaper (한국 신문만화의 언어유희적 기법 연구)

  • Kim, Eul-Ho
    • Cartoon and Animation Studies
    • /
    • s.10
    • /
    • pp.1-16
    • /
    • 2006
  • This thesis aims to recall the importance of language in comics by studying comic strips in Korean daily newspapers: the comic strips are analyzed for rhetorical puns in its language text as they representatively show the value and role of language in comics. Moreover, Korean comic strips, as they developed into current affairs comics, acquired a stronger media characteristic of communicating information compared to other genres of cartoons. As a result, comics strips have become a genre where language plays an important role and the words needing to be able to convey the meaning quickly and implicitly. Due to tight control of national authority, the language technique developed into an indirect expression rather than a stronger direct imaging technique. The political oppression of the comic strip paradoxically brought on the rhetorical development in the creative techniques. Based on this analysis, the writer studied the rhetorical puns of the texts Korean comic strips by implementing the classification techniques of rhetoric expressions. As a result, through quotes and analysis of actual comic strips, the writer confirmed that Korean comic strips do actually show tremendously vast rhetorical puns in its language application techniques. The writer was also able to conclude that the rhetorical puns in comics were the force entertaining and impressing the readers, and also acting as the creative principle. Concluding this study, the writer emphasizes that language, not only in comic strips, is a combination of words and images and is also an important factor in all cartoons in general. Thus the thesis proposes that the training of humanistic thoughts and linguistic sensitivity are as important as learning to draw in the creation of cartoons.

  • PDF

Korean Word Sense Disambiguation using Dictionary and Corpus (사전과 말뭉치를 이용한 한국어 단어 중의성 해소)

  • Jeong, Hanjo;Park, Byeonghwa
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.1
    • /
    • pp.1-13
    • /
    • 2015
  • As opinion mining in big data applications has been highlighted, a lot of research on unstructured data has made. Lots of social media on the Internet generate unstructured or semi-structured data every second and they are often made by natural or human languages we use in daily life. Many words in human languages have multiple meanings or senses. In this result, it is very difficult for computers to extract useful information from these datasets. Traditional web search engines are usually based on keyword search, resulting in incorrect search results which are far from users' intentions. Even though a lot of progress in enhancing the performance of search engines has made over the last years in order to provide users with appropriate results, there is still so much to improve it. Word sense disambiguation can play a very important role in dealing with natural language processing and is considered as one of the most difficult problems in this area. Major approaches to word sense disambiguation can be classified as knowledge-base, supervised corpus-based, and unsupervised corpus-based approaches. This paper presents a method which automatically generates a corpus for word sense disambiguation by taking advantage of examples in existing dictionaries and avoids expensive sense tagging processes. It experiments the effectiveness of the method based on Naïve Bayes Model, which is one of supervised learning algorithms, by using Korean standard unabridged dictionary and Sejong Corpus. Korean standard unabridged dictionary has approximately 57,000 sentences. Sejong Corpus has about 790,000 sentences tagged with part-of-speech and senses all together. For the experiment of this study, Korean standard unabridged dictionary and Sejong Corpus were experimented as a combination and separate entities using cross validation. Only nouns, target subjects in word sense disambiguation, were selected. 93,522 word senses among 265,655 nouns and 56,914 sentences from related proverbs and examples were additionally combined in the corpus. Sejong Corpus was easily merged with Korean standard unabridged dictionary because Sejong Corpus was tagged based on sense indices defined by Korean standard unabridged dictionary. Sense vectors were formed after the merged corpus was created. Terms used in creating sense vectors were added in the named entity dictionary of Korean morphological analyzer. By using the extended named entity dictionary, term vectors were extracted from the input sentences and then term vectors for the sentences were created. Given the extracted term vector and the sense vector model made during the pre-processing stage, the sense-tagged terms were determined by the vector space model based word sense disambiguation. In addition, this study shows the effectiveness of merged corpus from examples in Korean standard unabridged dictionary and Sejong Corpus. The experiment shows the better results in precision and recall are found with the merged corpus. This study suggests it can practically enhance the performance of internet search engines and help us to understand more accurate meaning of a sentence in natural language processing pertinent to search engines, opinion mining, and text mining. Naïve Bayes classifier used in this study represents a supervised learning algorithm and uses Bayes theorem. Naïve Bayes classifier has an assumption that all senses are independent. Even though the assumption of Naïve Bayes classifier is not realistic and ignores the correlation between attributes, Naïve Bayes classifier is widely used because of its simplicity and in practice it is known to be very effective in many applications such as text classification and medical diagnosis. However, further research need to be carried out to consider all possible combinations and/or partial combinations of all senses in a sentence. Also, the effectiveness of word sense disambiguation may be improved if rhetorical structures or morphological dependencies between words are analyzed through syntactic analysis.

Community Structure and Floristic Composition of Cymbidium goeringii Group in Korean Islets (한반도 도서지역 춘란집단의 종조성과 군락구조)

  • Song, Hong-Seon;Park, Yong-Jin
    • FLOWER RESEARCH JOURNAL
    • /
    • v.18 no.2
    • /
    • pp.110-116
    • /
    • 2010
  • This text was analyzed and investigated the vegetation and floristic composition by ordination and classification of phytosociological method, to evaluate the species composition and community structure of Cymbidium goeringii group in Korean islets. In habitat of 33 plots, the mean altitude was 65.9m, the direction was the southeast slope, the mean slope was 7.9%. The coverage of Cymbidium goeringii was 4.5%. The appearing plants with the Cymbidium goeringii was the total 102 taxa, and it was the kind of trees 68 taxa (66.7%), herbs 34 taxa (33.3%), evergreen plants 36 taxa (35.3%) and deciduous plants 66 taxa (64.7 %) respectively. The frequency of appearing plant was the highest in the Eurya japonica (48.5%), followed by Pinus thunbergii (45.5%), Smilax china (36.4%), Carex lanceolata (33.3%), Hedera rhombea (33.3%), Machilus thunbergii (30.0%), Styrax japonicus (30.3%) and Pinus densiflora (27.3%), respectively. The vegetation of tree layer in Cymbidium goeringii group was classified into Pinus thunbergii community, Pinus densiflora community, Castanopsis sieboldii community and Quercus variabilis community. Pinus densiflora community showed a strong combination with Cymbidium goeringii group in Korean islets. Pinus thunbergii community among communities was combined with Castanopsis sieboldii community, and Pinus densiflora community and Quercus variabilis community were combined.

A School-tailored High School Integrated Science Q&A Chatbot with Sentence-BERT: Development and One-Year Usage Analysis (인공지능 문장 분류 모델 Sentence-BERT 기반 학교 맞춤형 고등학교 통합과학 질문-답변 챗봇 -개발 및 1년간 사용 분석-)

  • Gyeongmo Min;Junehee Yoo
    • Journal of The Korean Association For Science Education
    • /
    • v.44 no.3
    • /
    • pp.231-248
    • /
    • 2024
  • This study developed a chatbot for first-year high school students, employing open-source software and the Korean Sentence-BERT model for AI-powered document classification. The chatbot utilizes the Sentence-BERT model to find the six most similar Q&A pairs to a student's query and presents them in a carousel format. The initial dataset, built from online resources, was refined and expanded based on student feedback and usability throughout over the operational period. By the end of the 2023 academic year, the chatbot integrated a total of 30,819 datasets and recorded 3,457 student interactions. Analysis revealed students' inclination to use the chatbot when prompted by teachers during classes and primarily during self-study sessions after school, with an average of 2.1 to 2.2 inquiries per session, mostly via mobile phones. Text mining identified student input terms encompassing not only science-related queries but also aspects of school life such as assessment scope. Topic modeling using BERTopic, based on Sentence-BERT, categorized 88% of student questions into 35 topics, shedding light on common student interests. A year-end survey confirmed the efficacy of the carousel format and the chatbot's role in addressing curiosities beyond integrated science learning objectives. This study underscores the importance of developing chatbots tailored for student use in public education and highlights their educational potential through long-term usage analysis.

A Systematic Review on the Present Condition of the Internal Robot Therapy (국내 로봇치료 연구 현황에 대한 체계적 고찰)

  • Song, Ji-Hyeon;Sim, Eun-Ji;Yom, Ji-Yun;Oh, Min-Kyeong;Yi, Hu-Shin;Yoo, Doo-Han
    • The Journal of Korean society of community based occupational therapy
    • /
    • v.6 no.1
    • /
    • pp.49-60
    • /
    • 2016
  • Objective : By organizing systematically the study case that use Robot Therapy as intervention tool according to PICO (Patient, Intervention, Comparison, Outcome), This study aims to investigate the domestic Robot Therapy's present condition. Methods : We searched 710 pieces of domestic scientific journal and master's thesis during the past nine years in 'Research Information Sharing Service' and 'National Digital Science Library' database using the keyword 'Robot therapy'. We finally chose 15 pieces of domestic scientific journal and master's thesis among the domestic studies that based on the full text which is affordable and used robot by therapeutic intervention tool. Chosen studies were layed out by PICO that could organize the resources systematically. Results : The quality of study tool was used to the method of evidence-based study level of 5 step classification. More than three stages of quality level study was 13. Result of dividing the studies using robot therapy by intervention field, language, lower extremity(gait), cognition, development and study for the region of the upper extremity of five is advancing. Conclusion : Nationally, the robot therapy has been used in various area that include the upper extremity and lower extremity's intervention of language, cognition, growth and others. We hope that this study for baseline data will be utilized in various area engaging to domestic robot therapy.

Topic Modeling based Interdisciplinarity Measurement in the Informatics Related Journals (토픽 모델링 기반 정보학 분야 학술지의 학제성 측정 연구)

  • Jin, Seol A;Song, Min
    • Journal of the Korean Society for information Management
    • /
    • v.33 no.1
    • /
    • pp.7-32
    • /
    • 2016
  • This study has measured interdisciplinarity using a topic modeling, which automatically extracts sub-topics based on term information appeared in documents group unlike the traditional top-down approach employing the references and classification system as a basis. We used titles and abstracts of the articles published in top 20 journals for the past five years by the 5-year impact factor under the category of 'Information & Library Science' in JCR 2013. We applied 'Discipline Diversity' and 'Network Coherence' as factors in measuring interdisciplinarity; 'Shannon Entropy Index' and 'Stirling Diversity Index' were used as indices to gauge diversity of fields while topic network's average path length was employed as an index representing network cohesion. After classifying the types of interdisciplinarity with the diversity and cohesion indices produced, we compared the topic networks of journals that represent each type. As a result, we found that the text-based diversity index showed different ranking when compared to the reference-based diversity index. This signifies that those two indices can be utilized complimentarily. It was also confirmed that the characteristics and interconnectedness of the sub-topics dealt with in each journal can be intuitively understood through the topic networks classified by considering both the diversity and cohesion. In conclusion, the topic modeling-based measurement of interdisciplinarity that this study proposed was confirmed to be applicable serving multiple roles in showing the interdisciplinarity of the journals.

Web Site Keyword Selection Method by Considering Semantic Similarity Based on Word2Vec (Word2Vec 기반의 의미적 유사도를 고려한 웹사이트 키워드 선택 기법)

  • Lee, Donghun;Kim, Kwanho
    • The Journal of Society for e-Business Studies
    • /
    • v.23 no.2
    • /
    • pp.83-96
    • /
    • 2018
  • Extracting keywords representing documents is very important because it can be used for automated services such as document search, classification, recommendation system as well as quickly transmitting document information. However, when extracting keywords based on the frequency of words appearing in a web site documents and graph algorithms based on the co-occurrence of words, the problem of containing various words that are not related to the topic potentially in the web page structure, There is a difficulty in extracting the semantic keyword due to the limit of the performance of the Korean tokenizer. In this paper, we propose a method to select candidate keywords based on semantic similarity, and solve the problem that semantic keyword can not be extracted and the accuracy of Korean tokenizer analysis is poor. Finally, we use the technique of extracting final semantic keywords through filtering process to remove inconsistent keywords. Experimental results through real web pages of small business show that the performance of the proposed method is improved by 34.52% over the statistical similarity based keyword selection technique. Therefore, it is confirmed that the performance of extracting keywords from documents is improved by considering semantic similarity between words and removing inconsistent keywords.

The Analysis of Inquiry Scopes in High School General Science Textbook Based on the 6th Curriculum - Emphasizing the Analysis of Inquiry Experiment - (제 6차 교육과정에 따른 고등학교 공통과학 교과서의 탐구영역 분석 - 탐구 실험을 중심으로 -)

  • Park, Won-Hyuck;Kim, Eun-A
    • Journal of The Korean Association For Science Education
    • /
    • v.19 no.4
    • /
    • pp.528-541
    • /
    • 1999
  • In order to obtain data for developing an ideal science curriculum. four kinds of General Science textbooks based on the 6th curriculum were analyzed. Particularly inquiry activities were analyzed by Scientific Inquiry Evaluation Inventory(SIEI). The results are as follows: 1) The average number of inquiry activities in four kinds of textbooks is 115.5. And the number in each textbook is very diverse: textbook A contains 94 inquiry activities, textbook B 147. textbook C 100 and textbook D 121. 2) As for the number of inquiry activity scopes in four kinds of textbook. observation comes to 22, experiment 117, interpreting data 196, investigation 64, discussion 51, classification 4 and prediction 8. And then the conceptional inquiry activity is about 2.3 times as many as the inquiry experiment. 3) According to the analysis of each inquiry task by SIEI. textbook A has 268, textbook B 328, textbook C 207 and textbook D 304. 4) In the analysis of the structure of inquiry activity, the evaluation of the competition and cooperation scale shows more emphasis on common tasks. no pooled results(87.1 %). The discussion scale mostly consists of activities without discussion required among students(83.5%). The evaluation of openness scale shows more emphasis on activities with problems, procedures and answers presented(58.3%). In the evaluation of inquiry scope scale, the inquiry scope scale mostly has the activities to demonstrate or verify the contents of the text(66.9%). 5) As for the analysis of inquiry activities as a whole. The inquiry pyramid in four kinds of General Science textbooks shows the type I that emphasizes the inquiry activities in low level such as gathering and organizing data. The inquiry index in four kinds of textbooks is average 47.8, shows very high level (above 35).

  • PDF