• Title/Summary/Keyword: large language model

Search Result 282, Processing Time 0.021 seconds

Experiment and Simulation for Evaluation of Jena Storage Plug-in Considering Hierarchical Structure (계층 구조를 고려한 Jena Plug-in 저장소의 평가를 위한 실험 및 시뮬레이션)

  • Shin, Hee-Young;Jeong, Dong-Won;Baik, Doo-Kwon
    • Journal of the Korea Society for Simulation
    • /
    • v.17 no.2
    • /
    • pp.31-47
    • /
    • 2008
  • As OWL(Web Ontology Language) has been selected as a standard ontology description language by W3C, many ontologies have been building and developing in OWL. The lena developed by HP as an Application Programming Interface(API) provides various APIs to develop inference engines as well as storages, and it is widely used for system development. However, the storage model of Jena2 stores most owl documents not acceptable into a single table and it shows low processing performance for a large ontology data set. Most of all, Jena2 storage model does not consider hierarchical structures of classes and properties. In addition, it shows low query processing performance using the hierarchical structure because of many join operations. To solve these issues, this paper proposes an OWL ontology relational database model. The proposed model semantically classifies and stores information such as classes, properties, and instances. It improves the query processing performance by managing hierarchical information in a separate table. This paper also describes the implementation and evaluation results. This paper also shows the experiment and evaluation result and the comparative analysis on both results. The experiment and evaluation show our proposal provides a prominent performance as against Jena2.

  • PDF

Automatic Classification of Academic Articles Using BERT Model Based on Deep Learning (딥러닝 기반의 BERT 모델을 활용한 학술 문헌 자동분류)

  • Kim, In hu;Kim, Seong hee
    • Journal of the Korean Society for information Management
    • /
    • v.39 no.3
    • /
    • pp.293-310
    • /
    • 2022
  • In this study, we analyzed the performance of the BERT-based document classification model by automatically classifying documents in the field of library and information science based on the KoBERT. For this purpose, abstract data of 5,357 papers in 7 journals in the field of library and information science were analyzed and evaluated for any difference in the performance of automatic classification according to the size of the learned data. As performance evaluation scales, precision, recall, and F scale were used. As a result of the evaluation, subject areas with large amounts of data and high quality showed a high level of performance with an F scale of 90% or more. On the other hand, if the data quality was low, the similarity with other subject areas was high, and there were few features that were clearly distinguished thematically, a meaningful high-level performance evaluation could not be derived. This study is expected to be used as basic data to suggest the possibility of using a pre-trained learning model to automatically classify the academic documents.

A Study on Automatic Classification of Class Diagram Images (클래스 다이어그램 이미지의 자동 분류에 관한 연구)

  • Kim, Dong Kwan
    • Journal of the Korea Convergence Society
    • /
    • v.13 no.3
    • /
    • pp.1-9
    • /
    • 2022
  • UML class diagrams are used to visualize the static aspects of a software system and are involved from analysis and design to documentation and testing. Software modeling using class diagrams is essential for software development, but it may be not an easy activity for inexperienced modelers. The modeling productivity could be improved with a dataset of class diagrams which are classified by domain categories. To this end, this paper provides a classification method for a dataset of class diagram images. First, real class diagrams are selected from collected images. Then, class names are extracted from the real class diagram images and the class diagram images are classified according to domain categories. The proposed classification model has achieved 100.00%, 95.59%, 97.74%, and 97.77% in precision, recall, F1-score, and accuracy, respectively. The accuracy scores for the domain categorization are distributed between 81.1% and 95.2%. Although the number of class diagram images in the experiment is not large enough, the experimental results indicate that it is worth considering the proposed approach to class diagram image classification.

Network Traffic Analysis System Based on Data Engineering Methodology (데이터 엔지니어링 방법론을 기반으로한 네트워크 트래픽 분석 시스템)

  • Han, Young-Shin;Kim, Tae-Kyu;Jung, Jason J.;Jung, Chan-Ki;Lee, Chil-Gee
    • Journal of the Korea Society for Simulation
    • /
    • v.18 no.1
    • /
    • pp.27-34
    • /
    • 2009
  • Currently network users, especially the number of internet users, increase rapidly. Also, high quality of service is required and this requirement results a sudden network traffic increment. As a result, an efficient management system for huge network traffic becomes an important issue. Ontology/data engineering based context awareness using the System Entity Structure (SES) concepts enables network administrators to access traffic data easily and efficiently. The network traffic analysis system, which is studied in this paper, is designed and implemented based on a model and simulation using data engineering methodology to be avaiable in evaluating large network traffic data. Extensible Markup Language (XML) is used for metadata language in this system. The information which is extracted from the network traffic analysis system could be modeled and simulated in Discrete Event Simulation (DEVS) methodology for further works such as post simulation evaluation, web services, and etc.

Empirical Study for Automatic Evaluation of Abstractive Summarization by Error-Types (오류 유형에 따른 생성요약 모델의 본문-요약문 간 요약 성능평가 비교)

  • Seungsoo Lee;Sangwoo Kang
    • Korean Journal of Cognitive Science
    • /
    • v.34 no.3
    • /
    • pp.197-226
    • /
    • 2023
  • Generative Text Summarization is one of the Natural Language Processing tasks. It generates a short abbreviated summary while preserving the content of the long text. ROUGE is a widely used lexical-overlap based metric for text summarization models in generative summarization benchmarks. Although it shows very high performance, the studies report that 30% of the generated summary and the text are still inconsistent. This paper proposes a methodology for evaluating the performance of the summary model without using the correct summary. AggreFACT is a human-annotated dataset that classifies the types of errors in neural text summarization models. Among all the test candidates, the two cases, generation summary, and when errors occurred throughout the summary showed the highest correlation results. We observed that the proposed evaluation score showed a high correlation with models finetuned with BART and PEGASUS, which is pretrained with a large-scale Transformer structure.

A Word Embedding used Word Sense and Feature Mirror Model (단어 의미와 자질 거울 모델을 이용한 단어 임베딩)

  • Lee, JuSang;Shin, JoonChoul;Ock, CheolYoung
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.4
    • /
    • pp.226-231
    • /
    • 2017
  • Word representation, an important area in natural language processing(NLP) used machine learning, is a method that represents a word not by text but by distinguishable symbol. Existing word embedding employed a large number of corpora to ensure that words are positioned nearby within text. However corpus-based word embedding needs several corpora because of the frequency of word occurrence and increased number of words. In this paper word embedding is done using dictionary definitions and semantic relationship information(hypernyms and antonyms). Words are trained using the feature mirror model(FMM), a modified Skip-Gram(Word2Vec). Sense similar words have similar vector. Furthermore, it was possible to distinguish vectors of antonym words.

WV-BTM: A Technique on Improving Accuracy of Topic Model for Short Texts in SNS (WV-BTM: SNS 단문의 주제 분석을 위한 토픽 모델 정확도 개선 기법)

  • Song, Ae-Rin;Park, Young-Ho
    • Journal of Digital Contents Society
    • /
    • v.19 no.1
    • /
    • pp.51-58
    • /
    • 2018
  • As the amount of users and data of NS explosively increased, research based on SNS Big data became active. In social mining, Latent Dirichlet Allocation(LDA), which is a typical topic model technique, is used to identify the similarity of each text from non-classified large-volume SNS text big data and to extract trends therefrom. However, LDA has the limitation that it is difficult to deduce a high-level topic due to the semantic sparsity of non-frequent word occurrence in the short sentence data. The BTM study improved the limitations of this LDA through a combination of two words. However, BTM also has a limitation that it is impossible to calculate the weight considering the relation with each subject because it is influenced more by the high frequency word among the combined words. In this paper, we propose a technique to improve the accuracy of existing BTM by reflecting semantic relation between words.

Coreference Resolution for Korean Using Random Forests (랜덤 포레스트를 이용한 한국어 상호참조 해결)

  • Jeong, Seok-Won;Choi, MaengSik;Kim, HarkSoo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.5 no.11
    • /
    • pp.535-540
    • /
    • 2016
  • Coreference resolution is to identify mentions in documents and is to group co-referred mentions in the documents. It is an essential step for natural language processing applications such as information extraction, event tracking, and question-answering. Recently, various coreference resolution models based on ML (machine learning) have been proposed, As well-known, these ML-based models need large training data that are manually annotated with coreferred mention tags. Unfortunately, we cannot find usable open data for learning ML-based models in Korean. Therefore, we propose an efficient coreference resolution model that needs less training data than other ML-based models. The proposed model identifies co-referred mentions using random forests based on sieve-guided features. In the experiments with baseball news articles, the proposed model showed a better CoNLL F1-score of 0.6678 than other ML-based models.

Filter-mBART Based Neural Machine Translation Using Parallel Corpus Filtering (병렬 말뭉치 필터링을 적용한 Filter-mBART기반 기계번역 연구)

  • Moon, Hyeonseok;Park, Chanjun;Eo, Sugyeong;Park, JeongBae;Lim, Heuiseok
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.5
    • /
    • pp.1-7
    • /
    • 2021
  • In the latest trend of machine translation research, the model is pretrained through a large mono lingual corpus and then finetuned with a parallel corpus. Although many studies tend to increase the amount of data used in the pretraining stage, it is hard to say that the amount of data must be increased to improve machine translation performance. In this study, through an experiment based on the mBART model using parallel corpus filtering, we propose that high quality data can yield better machine translation performance, even utilizing smaller amount of data. We propose that it is important to consider the quality of data rather than the amount of data, and it can be used as a guideline for building a training corpus.

Evaluation of Sentimental Texts Automatically Generated by a Generative Adversarial Network (생성적 적대 네트워크로 자동 생성한 감성 텍스트의 성능 평가)

  • Park, Cheon-Young;Choi, Yong-Seok;Lee, Kong Joo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.8 no.6
    • /
    • pp.257-264
    • /
    • 2019
  • Recently, deep neural network based approaches have shown a good performance for various fields of natural language processing. A huge amount of training data is essential for building a deep neural network model. However, collecting a large size of training data is a costly and time-consuming job. A data augmentation is one of the solutions to this problem. The data augmentation of text data is more difficult than that of image data because texts consist of tokens with discrete values. Generative adversarial networks (GANs) are widely used for image generation. In this work, we generate sentimental texts by using one of the GANs, CS-GAN model that has a discriminator as well as a classifier. We evaluate the usefulness of generated sentimental texts according to various measurements. CS-GAN model not only can generate texts with more diversity but also can improve the performance of its classifier.