Search | Korea Science

A System for Automatic Classification of Traditional Culture Texts (전통문화 콘텐츠 표준체계를 활용한 자동 텍스트 분류 시스템)

Hur, YunA;Lee, DongYub;Kim, Kuekyeng;Yu, Wonhee;Lim, HeuiSeok
- Journal of the Korea Convergence Society
- /
- v.8 no.12
- /
- pp.39-47
- /
- 2017
The Internet have increased the number of digital web documents related to the history and traditions of Korean Culture. However, users who search for creators or materials related to traditional cultures are not able to get the information they want and the results are not enough. Document classification is required to access this effective information. In the past, document classification has been difficult to manually and manually classify documents, but it has recently been difficult to spend a lot of time and money. Therefore, this paper develops an automatic text classification model of traditional cultural contents based on the data of the Korean information culture field composed of systematic classifications of traditional cultural contents. This study applied TF-IDF model, Bag-of-Words model, and TF-IDF/Bag-of-Words combined model to extract word frequencies for 'Korea Traditional Culture' data. And we developed the automatic text classification model of traditional cultural contents using Support Vector Machine classification algorithm.
https://doi.org/10.15207/JKCS.2017.8.12.039 인용 PDF KSCI

Development of Block-based Code Generation and Recommendation Model Using Natural Language Processing Model (자연어 처리 모델을 활용한 블록 코드 생성 및 추천 모델 개발)

Jeon, In-seong;Song, Ki-Sang
- Journal of The Korean Association of Information Education
- /
- v.26 no.3
- /
- pp.197-207
- /
- 2022
In this paper, we develop a machine learning based block code generation and recommendation model for the purpose of reducing cognitive load of learners during coding education that learns the learner's block that has been made in the block programming environment using natural processing model and fine-tuning and then generates and recommends the selectable blocks for the next step. To develop the model, the training dataset was produced by pre-processing 50 block codes that were on the popular block programming language web site 'Entry'. Also, after dividing the pre-processed blocks into training dataset, verification dataset and test dataset, we developed a model that generates block codes based on LSTM, Seq2Seq, and GPT-2 model. In the results of the performance evaluation of the developed model, GPT-2 showed a higher performance than the LSTM and Seq2Seq model in the BLEU and ROUGE scores which measure sentence similarity. The data results generated through the GPT-2 model, show that the performance was relatively similar in the BLEU and ROUGE scores except for the case where the number of blocks was 1 or 17.
https://doi.org/10.14352/jkaie.2022.26.3.197 인용 PDF KSCI

Development of Overhead Projector Films, CD-ROM, and Bio-Cosmos Home Page as Teaching Resources for High School Biology (고교 생물의 오버헤드 프로젝터용 필름 제작 및 전달 매체로서의 CD-ROM과 홈페이지의 설계)

Song, Bang-Ho;Sin, Youn-Uk;Choi, Mie-Sook;Park, Chang-Bo;Ahn, Na-Young;Kang, Jae-Seuk;Kim, Jeung-Hyun;Seo, Hae-Ae;Kwon, Duck-Kee;Sohn, Jong-Kyung;Chung, Hwa-Sook;Yang, Hong-Jun;Park, Sung-Ho
- Journal of The Korean Association For Science Education
- /
- v.19 no.3
- /
- pp.428-440
- /
- 1999
The colorful overhead projector films, named as Bio-cosmos II, including photographs, pictures, concept maps, and diagrams, were developed and manufactured as audio-visual teaching aids and teaching resources for students' biology learning in high school, and the CD-ROM and web sites for their application to the school were also constructed. The content of the films was organized based upon the analysis of seven different biology textbooks approved by the Ministry of Education. The films were designated based on various instructional strategies and manufactured using multimedia with various educational softwares. The CD-ROM was composed of the scenes as logo, initial main, chapters list, contents, and quit. Initial main scene indicated various chapters according to the texts of biology areas in General Science, Biology I, and II. Each chapters linked with the scenes for detailed concept maps, the downstream real subjects, and contents. The subject screens were composed of various types of summarized diagrams including lesson contents, figures, pictures, photographs, and their explanation, experimental procedures and results, tables for summarized contents, and additional animation with video captures, explanations, glossary, etc. Most files were manufactured in software Adobe Photoshop by scanning the pictures, figures and photographs, and then the explanation, modification, storing with PICT or PSD files, and transformation with JPG files, were processed in the aspect of high quality in terms of instructional strategies and graphic skills on gracefulness, clearness, colorfulness, brightness, and distinctness. A 14 films for biology areas in General Science, 80 for Biology I, and 142 for Biology II were manufactured and loaded to the CD-ROM and web site, and the files had been attempted to opened with an internet home-page of http://gic.kyungpook.ac.kr/biocosmos.
PDF

Optimal supervised LSA method using selective feature dimension reduction (선택적 자질 차원 축소를 이용한 최적의 지도적 LSA 방법)

Kim, Jung-Ho;Kim, Myung-Kyu;Cha, Myung-Hoon;In, Joo-Ho;Chae, Soo-Hoan
- Science of Emotion and Sensibility
- /
- v.13 no.1
- /
- pp.47-60
- /
- 2010
Most of the researches about classification usually have used kNN(k-Nearest Neighbor), SVM(Support Vector Machine), which are known as learn-based model, and Bayesian classifier, NNA(Neural Network Algorithm), which are known as statistics-based methods. However, there are some limitations of space and time when classifying so many web pages in recent internet. Moreover, most studies of classification are using uni-gram feature representation which is not good to represent real meaning of words. In case of Korean web page classification, there are some problems because of korean words property that the words have multiple meanings(polysemy). For these reasons, LSA(Latent Semantic Analysis) is proposed to classify well in these environment(large data set and words' polysemy). LSA uses SVD(Singular Value Decomposition) which decomposes the original term-document matrix to three different matrices and reduces their dimension. From this SVD's work, it is possible to create new low-level semantic space for representing vectors, which can make classification efficient and analyze latent meaning of words or document(or web pages). Although LSA is good at classification, it has some drawbacks in classification. As SVD reduces dimensions of matrix and creates new semantic space, it doesn't consider which dimensions discriminate vectors well but it does consider which dimensions represent vectors well. It is a reason why LSA doesn't improve performance of classification as expectation. In this paper, we propose new LSA which selects optimal dimensions to discriminate and represent vectors well as minimizing drawbacks and improving performance. This method that we propose shows better and more stable performance than other LSAs' in low-dimension space. In addition, we derive more improvement in classification as creating and selecting features by reducing stopwords and weighting specific values to them statistically.
PDF

Korean Word Sense Disambiguation using Dictionary and Corpus (사전과 말뭉치를 이용한 한국어 단어 중의성 해소)

Jeong, Hanjo;Park, Byeonghwa
- Journal of Intelligence and Information Systems
- /
- v.21 no.1
- /
- pp.1-13
- /
- 2015
As opinion mining in big data applications has been highlighted, a lot of research on unstructured data has made. Lots of social media on the Internet generate unstructured or semi-structured data every second and they are often made by natural or human languages we use in daily life. Many words in human languages have multiple meanings or senses. In this result, it is very difficult for computers to extract useful information from these datasets. Traditional web search engines are usually based on keyword search, resulting in incorrect search results which are far from users' intentions. Even though a lot of progress in enhancing the performance of search engines has made over the last years in order to provide users with appropriate results, there is still so much to improve it. Word sense disambiguation can play a very important role in dealing with natural language processing and is considered as one of the most difficult problems in this area. Major approaches to word sense disambiguation can be classified as knowledge-base, supervised corpus-based, and unsupervised corpus-based approaches. This paper presents a method which automatically generates a corpus for word sense disambiguation by taking advantage of examples in existing dictionaries and avoids expensive sense tagging processes. It experiments the effectiveness of the method based on Naïve Bayes Model, which is one of supervised learning algorithms, by using Korean standard unabridged dictionary and Sejong Corpus. Korean standard unabridged dictionary has approximately 57,000 sentences. Sejong Corpus has about 790,000 sentences tagged with part-of-speech and senses all together. For the experiment of this study, Korean standard unabridged dictionary and Sejong Corpus were experimented as a combination and separate entities using cross validation. Only nouns, target subjects in word sense disambiguation, were selected. 93,522 word senses among 265,655 nouns and 56,914 sentences from related proverbs and examples were additionally combined in the corpus. Sejong Corpus was easily merged with Korean standard unabridged dictionary because Sejong Corpus was tagged based on sense indices defined by Korean standard unabridged dictionary. Sense vectors were formed after the merged corpus was created. Terms used in creating sense vectors were added in the named entity dictionary of Korean morphological analyzer. By using the extended named entity dictionary, term vectors were extracted from the input sentences and then term vectors for the sentences were created. Given the extracted term vector and the sense vector model made during the pre-processing stage, the sense-tagged terms were determined by the vector space model based word sense disambiguation. In addition, this study shows the effectiveness of merged corpus from examples in Korean standard unabridged dictionary and Sejong Corpus. The experiment shows the better results in precision and recall are found with the merged corpus. This study suggests it can practically enhance the performance of internet search engines and help us to understand more accurate meaning of a sentence in natural language processing pertinent to search engines, opinion mining, and text mining. Naïve Bayes classifier used in this study represents a supervised learning algorithm and uses Bayes theorem. Naïve Bayes classifier has an assumption that all senses are independent. Even though the assumption of Naïve Bayes classifier is not realistic and ignores the correlation between attributes, Naïve Bayes classifier is widely used because of its simplicity and in practice it is known to be very effective in many applications such as text classification and medical diagnosis. However, further research need to be carried out to consider all possible combinations and/or partial combinations of all senses in a sentence. Also, the effectiveness of word sense disambiguation may be improved if rhetorical structures or morphological dependencies between words are analyzed through syntactic analysis.
https://doi.org/10.13088/jiis.2015.21.1.01 인용 PDF KSCI

Design and Implementation of a Question Management System based on a Concept Lattice (개념 망 구조를 기반으로 한 문항 관리 시스템의 설계 및 구현)

Kim, Mi-Hye
- The Journal of the Korea Contents Association
- /
- v.8 no.11
- /
- pp.412-425
- /
- 2008
One of the important elements for improving academic achievement of learners in education through e-learning is to support learners to study by finding questions they want with providing various evaluation questions. However, most of question retrieval systems usually depend on keyword search based on only a syntactical analysis and/or a hierarchical browsing system classified by the topics of subjects. In such a system it is not easy to find integrative questions associated with each other. In order to improve this problem, in this paper we proposed a question management and retrieval system which allows users to easily manage questions and also to effectively find questions for study on the Web. Then, we implemented a system that gives to access questions for the domain of C language programming. The system makes it possible to easily search questions related to not only a single theme but also questions integrated by interrelationship between topics and questions. This is done by supporting to be able to retrieve questions according to conceptual interrelationships between questions from user query. Consequently, it is expected that the proposed system will provide learners to understand the basic theories and the concepts of the subjects as well as to improve the ability of comprehensive knowledge utilization and problem-solving.
https://doi.org/10.5392/JKCA.2008.8.11.412 인용 PDF

Investigating Opinion Mining Performance by Combining Feature Selection Methods with Word Embedding and BOW (Bag-of-Words) (속성선택방법과 워드임베딩 및 BOW (Bag-of-Words)를 결합한 오피니언 마이닝 성과에 관한 연구)

Eo, Kyun Sun;Lee, Kun Chang
- Journal of Digital Convergence
- /
- v.17 no.2
- /
- pp.163-170
- /
- 2019
Over the past decade, the development of the Web explosively increased the data. Feature selection step is an important step in extracting valuable data from a large amount of data. This study proposes a novel opinion mining model based on combining feature selection (FS) methods with Word embedding to vector (Word2vec) and BOW (Bag-of-words). FS methods adopted for this study are CFS (Correlation based FS) and IG (Information Gain). To select an optimal FS method, a number of classifiers ranging from LR (logistic regression), NN (neural network), NBN (naive Bayesian network) to RF (random forest), RS (random subspace), ST (stacking). Empirical results with electronics and kitchen datasets showed that LR and ST classifiers combined with IG applied to BOW features yield best performance in opinion mining. Results with laptop and restaurant datasets revealed that the RF classifier using IG applied to Word2vec features represents best performance in opinion mining.
https://doi.org/10.14400/JDC.2019.17.2.163 인용 PDF KSCI HTML

Financial Fraud Detection using Text Mining Analysis against Municipal Cybercriminality (지자체 사이버 공간 안전을 위한 금융사기 탐지 텍스트 마이닝 방법)

Choi, Sukjae;Lee, Jungwon;Kwon, Ohbyung
- Journal of Intelligence and Information Systems
- /
- v.23 no.3
- /
- pp.119-138
- /
- 2017
Recently, SNS has become an important channel for marketing as well as personal communication. However, cybercrime has also evolved with the development of information and communication technology, and illegal advertising is distributed to SNS in large quantity. As a result, personal information is lost and even monetary damages occur more frequently. In this study, we propose a method to analyze which sentences and documents, which have been sent to the SNS, are related to financial fraud. First of all, as a conceptual framework, we developed a matrix of conceptual characteristics of cybercriminality on SNS and emergency management. We also suggested emergency management process which consists of Pre-Cybercriminality (e.g. risk identification) and Post-Cybercriminality steps. Among those we focused on risk identification in this paper. The main process consists of data collection, preprocessing and analysis. First, we selected two words 'daechul(loan)' and 'sachae(private loan)' as seed words and collected data with this word from SNS such as twitter. The collected data are given to the two researchers to decide whether they are related to the cybercriminality, particularly financial fraud, or not. Then we selected some of them as keywords if the vocabularies are related to the nominals and symbols. With the selected keywords, we searched and collected data from web materials such as twitter, news, blog, and more than 820,000 articles collected. The collected articles were refined through preprocessing and made into learning data. The preprocessing process is divided into performing morphological analysis step, removing stop words step, and selecting valid part-of-speech step. In the morphological analysis step, a complex sentence is transformed into some morpheme units to enable mechanical analysis. In the removing stop words step, non-lexical elements such as numbers, punctuation marks, and double spaces are removed from the text. In the step of selecting valid part-of-speech, only two kinds of nouns and symbols are considered. Since nouns could refer to things, the intent of message is expressed better than the other part-of-speech. Moreover, the more illegal the text is, the more frequently symbols are used. The selected data is given 'legal' or 'illegal'. To make the selected data as learning data through the preprocessing process, it is necessary to classify whether each data is legitimate or not. The processed data is then converted into Corpus type and Document-Term Matrix. Finally, the two types of 'legal' and 'illegal' files were mixed and randomly divided into learning data set and test data set. In this study, we set the learning data as 70% and the test data as 30%. SVM was used as the discrimination algorithm. Since SVM requires gamma and cost values as the main parameters, we set gamma as 0.5 and cost as 10, based on the optimal value function. The cost is set higher than general cases. To show the feasibility of the idea proposed in this paper, we compared the proposed method with MLE (Maximum Likelihood Estimation), Term Frequency, and Collective Intelligence method. Overall accuracy and was used as the metric. As a result, the overall accuracy of the proposed method was 92.41% of illegal loan advertisement and 77.75% of illegal visit sales, which is apparently superior to that of the Term Frequency, MLE, etc. Hence, the result suggests that the proposed method is valid and usable practically. In this paper, we propose a framework for crisis management caused by abnormalities of unstructured data sources such as SNS. We hope this study will contribute to the academia by identifying what to consider when applying the SVM-like discrimination algorithm to text analysis. Moreover, the study will also contribute to the practitioners in the field of brand management and opinion mining.
https://doi.org/10.13088/jiis.2017.23.3.119 인용 PDF KSCI

Prediction of Key Variables Affecting NBA Playoffs Advancement: Focusing on 3 Points and Turnover Features (미국 프로농구(NBA)의 플레이오프 진출에 영향을 미치는 주요 변수 예측: 3점과 턴오버 속성을 중심으로)

An, Sehwan;Kim, Youngmin
- Journal of Intelligence and Information Systems
- /
- v.28 no.1
- /
- pp.263-286
- /
- 2022
This study acquires NBA statistical information for a total of 32 years from 1990 to 2022 using web crawling, observes variables of interest through exploratory data analysis, and generates related derived variables. Unused variables were removed through a purification process on the input data, and correlation analysis, t-test, and ANOVA were performed on the remaining variables. For the variable of interest, the difference in the mean between the groups that advanced to the playoffs and did not advance to the playoffs was tested, and then to compensate for this, the average difference between the three groups (higher/middle/lower) based on ranking was reconfirmed. Of the input data, only this year's season data was used as a test set, and 5-fold cross-validation was performed by dividing the training set and the validation set for model training. The overfitting problem was solved by comparing the cross-validation result and the final analysis result using the test set to confirm that there was no difference in the performance matrix. Because the quality level of the raw data is high and the statistical assumptions are satisfied, most of the models showed good results despite the small data set. This study not only predicts NBA game results or classifies whether or not to advance to the playoffs using machine learning, but also examines whether the variables of interest are included in the major variables with high importance by understanding the importance of input attribute. Through the visualization of SHAP value, it was possible to overcome the limitation that could not be interpreted only with the result of feature importance, and to compensate for the lack of consistency in the importance calculation in the process of entering/removing variables. It was found that a number of variables related to three points and errors classified as subjects of interest in this study were included in the major variables affecting advancing to the playoffs in the NBA. Although this study is similar in that it includes topics such as match results, playoffs, and championship predictions, which have been dealt with in the existing sports data analysis field, and comparatively analyzed several machine learning models for analysis, there is a difference in that the interest features are set in advance and statistically verified, so that it is compared with the machine learning analysis result. Also, it was differentiated from existing studies by presenting explanatory visualization results using SHAP, one of the XAI models.
https://doi.org/10.13088/jiis.2022.28.1.263 인용 PDF KSCI

Automatic Target Recognition Study using Knowledge Graph and Deep Learning Models for Text and Image data (지식 그래프와 딥러닝 모델 기반 텍스트와 이미지 데이터를 활용한 자동 표적 인식 방법 연구)

Kim, Jongmo;Lee, Jeongbin;Jeon, Hocheol;Sohn, Mye
- Journal of Internet Computing and Services
- /
- v.23 no.5
- /
- pp.145-154
- /
- 2022
Automatic Target Recognition (ATR) technology is emerging as a core technology of Future Combat Systems (FCS). Conventional ATR is performed based on IMINT (image information) collected from the SAR sensor, and various image-based deep learning models are used. However, with the development of IT and sensing technology, even though data/information related to ATR is expanding to HUMINT (human information) and SIGINT (signal information), ATR still contains image oriented IMINT data only is being used. In complex and diversified battlefield situations, it is difficult to guarantee high-level ATR accuracy and generalization performance with image data alone. Therefore, we propose a knowledge graph-based ATR method that can utilize image and text data simultaneously in this paper. The main idea of the knowledge graph and deep model-based ATR method is to convert the ATR image and text into graphs according to the characteristics of each data, align it to the knowledge graph, and connect the heterogeneous ATR data through the knowledge graph. In order to convert the ATR image into a graph, an object-tag graph consisting of object tags as nodes is generated from the image by using the pre-trained image object recognition model and the vocabulary of the knowledge graph. On the other hand, the ATR text uses the pre-trained language model, TF-IDF, co-occurrence word graph, and the vocabulary of knowledge graph to generate a word graph composed of nodes with key vocabulary for the ATR. The generated two types of graphs are connected to the knowledge graph using the entity alignment model for improvement of the ATR performance from images and texts. To prove the superiority of the proposed method, 227 documents from web documents and 61,714 RDF triples from dbpedia were collected, and comparison experiments were performed on precision, recall, and f1-score in a perspective of the entity alignment..
https://doi.org/10.7472/jksii.2022.23.5.145 인용 PDF KSCI HTML

Search Result 1,308, Processing Time 0.026 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)