• Title/Summary/Keyword: Language Models

Search Result 872, Processing Time 0.029 seconds

Manchu Script Letters Dataset Creation and Labeling

  • Aaron Daniel Snowberger;Choong Ho Lee
    • Journal of information and communication convergence engineering
    • /
    • v.22 no.1
    • /
    • pp.80-87
    • /
    • 2024
  • The Manchu language holds historical significance, but a complete dataset of Manchu script letters for training optical character recognition machine-learning models is currently unavailable. Therefore, this paper describes the process of creating a robust dataset of extracted Manchu script letters. Rather than performing automatic letter segmentation based on whitespace or the thickness of the central word stem, an image of the Manchu script was manually inspected, and one copy of the desired letter was selected as a region of interest. This selected region of interest was used as a template to match all other occurrences of the same letter within the Manchu script image. Although the dataset in this study contained only 4,000 images of five Manchu script letters, these letters were collected from twenty-eight writing styles. A full dataset of Manchu letters is expected to be obtained through this process. The collected dataset was normalized and trained using a simple convolutional neural network to verify its effectiveness.

Aspect-Based Sentiment Analysis Using BERT: Developing Aspect Category Sentiment Classification Models (BERT를 활용한 속성기반 감성분석: 속성카테고리 감성분류 모델 개발)

  • Park, Hyun-jung;Shin, Kyung-shik
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.4
    • /
    • pp.1-25
    • /
    • 2020
  • Sentiment Analysis (SA) is a Natural Language Processing (NLP) task that analyzes the sentiments consumers or the public feel about an arbitrary object from written texts. Furthermore, Aspect-Based Sentiment Analysis (ABSA) is a fine-grained analysis of the sentiments towards each aspect of an object. Since having a more practical value in terms of business, ABSA is drawing attention from both academic and industrial organizations. When there is a review that says "The restaurant is expensive but the food is really fantastic", for example, the general SA evaluates the overall sentiment towards the 'restaurant' as 'positive', while ABSA identifies the restaurant's aspect 'price' as 'negative' and 'food' aspect as 'positive'. Thus, ABSA enables a more specific and effective marketing strategy. In order to perform ABSA, it is necessary to identify what are the aspect terms or aspect categories included in the text, and judge the sentiments towards them. Accordingly, there exist four main areas in ABSA; aspect term extraction, aspect category detection, Aspect Term Sentiment Classification (ATSC), and Aspect Category Sentiment Classification (ACSC). It is usually conducted by extracting aspect terms and then performing ATSC to analyze sentiments for the given aspect terms, or by extracting aspect categories and then performing ACSC to analyze sentiments for the given aspect category. Here, an aspect category is expressed in one or more aspect terms, or indirectly inferred by other words. In the preceding example sentence, 'price' and 'food' are both aspect categories, and the aspect category 'food' is expressed by the aspect term 'food' included in the review. If the review sentence includes 'pasta', 'steak', or 'grilled chicken special', these can all be aspect terms for the aspect category 'food'. As such, an aspect category referred to by one or more specific aspect terms is called an explicit aspect. On the other hand, the aspect category like 'price', which does not have any specific aspect terms but can be indirectly guessed with an emotional word 'expensive,' is called an implicit aspect. So far, the 'aspect category' has been used to avoid confusion about 'aspect term'. From now on, we will consider 'aspect category' and 'aspect' as the same concept and use the word 'aspect' more for convenience. And one thing to note is that ATSC analyzes the sentiment towards given aspect terms, so it deals only with explicit aspects, and ACSC treats not only explicit aspects but also implicit aspects. This study seeks to find answers to the following issues ignored in the previous studies when applying the BERT pre-trained language model to ACSC and derives superior ACSC models. First, is it more effective to reflect the output vector of tokens for aspect categories than to use only the final output vector of [CLS] token as a classification vector? Second, is there any performance difference between QA (Question Answering) and NLI (Natural Language Inference) types in the sentence-pair configuration of input data? Third, is there any performance difference according to the order of sentence including aspect category in the QA or NLI type sentence-pair configuration of input data? To achieve these research objectives, we implemented 12 ACSC models and conducted experiments on 4 English benchmark datasets. As a result, ACSC models that provide performance beyond the existing studies without expanding the training dataset were derived. In addition, it was found that it is more effective to reflect the output vector of the aspect category token than to use only the output vector for the [CLS] token as a classification vector. It was also found that QA type input generally provides better performance than NLI, and the order of the sentence with the aspect category in QA type is irrelevant with performance. There may be some differences depending on the characteristics of the dataset, but when using NLI type sentence-pair input, placing the sentence containing the aspect category second seems to provide better performance. The new methodology for designing the ACSC model used in this study could be similarly applied to other studies such as ATSC.

CRNN-Based Korean Phoneme Recognition Model with CTC Algorithm (CTC를 적용한 CRNN 기반 한국어 음소인식 모델 연구)

  • Hong, Yoonseok;Ki, Kyungseo;Gweon, Gahgene
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.8 no.3
    • /
    • pp.115-122
    • /
    • 2019
  • For Korean phoneme recognition, Hidden Markov-Gaussian Mixture model(HMM-GMM) or hybrid models which combine artificial neural network with HMM have been mainly used. However, current approach has limitations in that such models require force-aligned corpus training data that is manually annotated by experts. Recently, researchers used neural network based phoneme recognition model which combines recurrent neural network(RNN)-based structure with connectionist temporal classification(CTC) algorithm to overcome the problem of obtaining manually annotated training data. Yet, in terms of implementation, these RNN-based models have another difficulty in that the amount of data gets larger as the structure gets more sophisticated. This problem of large data size is particularly problematic in the Korean language, which lacks refined corpora. In this study, we introduce CTC algorithm that does not require force-alignment to create a Korean phoneme recognition model. Specifically, the phoneme recognition model is based on convolutional neural network(CNN) which requires relatively small amount of data and can be trained faster when compared to RNN based models. We present the results from two different experiments and a resulting best performing phoneme recognition model which distinguishes 49 Korean phonemes. The best performing phoneme recognition model combines CNN with 3hop Bidirectional LSTM with the final Phoneme Error Rate(PER) at 3.26. The PER is a considerable improvement compared to existing Korean phoneme recognition models that report PER ranging from 10 to 12.

A Comparative Research on End-to-End Clinical Entity and Relation Extraction using Deep Neural Networks: Pipeline vs. Joint Models (심층 신경망을 활용한 진료 기록 문헌에서의 종단형 개체명 및 관계 추출 비교 연구 - 파이프라인 모델과 결합 모델을 중심으로 -)

  • Sung-Pil Choi
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.57 no.1
    • /
    • pp.93-114
    • /
    • 2023
  • Information extraction can facilitate the intensive analysis of documents by providing semantic triples which consist of named entities and their relations recognized in the texts. However, most of the research so far has been carried out separately for named entity recognition and relation extraction as individual studies, and as a result, the effective performance evaluation of the entire information extraction systems was not performed properly. This paper introduces two models of end-to-end information extraction that can extract various entity names in clinical records and their relationships in the form of semantic triples, namely pipeline and joint models and compares their performances in depth. The pipeline model consists of an entity recognition sub-system based on bidirectional GRU-CRFs and a relation extraction module using multiple encoding scheme, whereas the joint model was implemented with a single bidirectional GRU-CRFs equipped with multi-head labeling method. In the experiments using i2b2/VA 2010, the performance of the pipeline model was 5.5% (F-measure) higher. In addition, through a comparative experiment with existing state-of-the-art systems using large-scale neural language models and manually constructed features, the objective performance level of the end-to-end models implemented in this paper could be identified properly.

Design of Translator for generating Secure Java Bytecode from Thread code of Multithreaded Models (다중스레드 모델의 스레드 코드를 안전한 자바 바이트코드로 변환하기 위한 번역기 설계)

  • 김기태;유원희
    • Proceedings of the Korea Society for Industrial Systems Conference
    • /
    • 2002.06a
    • /
    • pp.148-155
    • /
    • 2002
  • Multithreaded models improve the efficiency of parallel systems by combining inner parallelism, asynchronous data availability and the locality of von Neumann model. This model executes thread code which is generated by compiler and of which quality is given by the method of generation. But multithreaded models have the demerit that execution model is restricted to a specific platform. On the contrary, Java has the platform independency, so if we can translate from threads code to Java bytecode, we can use the advantages of multithreaded models in many platforms. Java executes Java bytecode which is intermediate language format for Java virtual machine. Java bytecode plays a role of an intermediate language in translator and Java virtual machine work as back-end in translator. But, Java bytecode which is translated from multithreaded models have the demerit that it is not secure. This paper, multhithread code whose feature of platform independent can execute in java virtual machine. We design and implement translator which translate from thread code of multithreaded code to Java bytecode and which check secure problems from Java bytecode.

  • PDF

Preparation of Soil Input Files to a Crop Model Using the Korean Soil Information System (흙토람 데이터베이스를 활용한 작물 모델의 토양입력자료 생성)

  • Yoo, Byoung Hyun;Kim, Kwang Soo
    • Korean Journal of Agricultural and Forest Meteorology
    • /
    • v.19 no.3
    • /
    • pp.174-179
    • /
    • 2017
  • Soil parameters are required inputs to crop models, which estimate crop yield under a given environment condition. The Korean Soil Information System (KSIS), which provides detailed soil profile record of 390 soil series in the HTML (HyperText Markup Language) format, would be useful to prepare soil input files. Korean Soil Information System Processing Tool (KSISPT) was developed to aid generation of soil input data based on the KSIS database. Java was used to implement the tool that consists of a set of modules for parsing the HTML document of the KSIS, storing data required for preparing soil input file, calculating additional soil parameter, and writing soil input file to a local disk. Using the automated soil data preparation tool, about 940 soil input data were created for the DSSAT model and the ORYZA 2000 model, respectively. In combination with soil series distribution map at 30m resolution, spatial analysis of crop yield could be projected under climate change, which would help the development of adaptation strategies.

Treemapping Work-Sharing Relationships among Business Process Performers (트리맵을 이용한 비즈니스 프로세스 수행자간 업무공유 관계 시각화)

  • Ahn, Hyun;Kim, Kwanghoon Pio
    • Journal of Internet Computing and Services
    • /
    • v.17 no.4
    • /
    • pp.69-77
    • /
    • 2016
  • Recently, the importance of visual analytics has been recognized in the field of business intelligence. From the view of business intelligence, visual analytics aims for acquiring valuable insights for decision making by interactively visualizing a variety of business information. In this paper, we propose a treemap-based method for visualizing work-sharing relationships among business process performers. A work-sharing relationship is established between two performers who jointly participate in a specific activity of a business process and is an important factor for understanding organizational structures and behaviors in a process-centric organization. To this end, we design and implement a treemap-based visualization tool for representing work-sharing relationships as well as basic hierarchical information in business processes. Finally, we evaluate usefulness of the proposed visualization tool through an operational example using XPDL (XML Process Definition Language) process models.

A Korean Homonym Disambiguation System Based on Statistical, Model Using weights

  • Kim, Jun-Su;Lee, Wang-Woo;Kim, Chang-Hwan;Ock, Cheol-young
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2002.02a
    • /
    • pp.166-176
    • /
    • 2002
  • A homonym could be disambiguated by another words in the context as nouns, predicates used with the homonym. This paper using semantic information (co-occurrence data) obtained from definitions of part of speech (POS) tagged UMRD-S$^1$), In this research, we have analyzed the result of an experiment on a homonym disambiguation system based on statistical model, to which Bayes'theorem is applied, and suggested a model established of the weight of sense rate and the weight of distance to the adjacent words to improve the accuracy. The result of applying the homonym disambiguation system using semantic information to disambiguating homonyms appearing on the dictionary definition sentences showed average accuracy of 98.32% with regard to the most frequent 200 homonyms. We selected 49 (31 substantives and 18 predicates) out of the 200 homonyms that were used in the experiment, and performed an experiment on 50,703 sentences extracted from Sejong Project tagged corpus (i.e. a corpus of morphologically analyzed words) of 3.5 million words that includes one of the 49 homonyms. The result of experimenting by assigning the weight of sense rate(prior probability) and the weight of distance concerning the 5 words at the front/behind the homonym to be disambiguated showed better accuracy than disambiguation systems based on existing statistical models by 2.93%,

  • PDF

A Methodology for the Development of NCO Effectiveness Analysis Model based on the Reference Model (참조모델기반 NCO 효과분석모델 개발방법)

  • Lim, Nam-Kyu;Lee, Tae-Gong;Son, Hyun-Sik;Park, Ji-Hyeon;Kim, Jae-Won
    • Journal of the military operations research society of Korea
    • /
    • v.36 no.3
    • /
    • pp.95-111
    • /
    • 2010
  • The NCO effectiveness analysis related elements and their relationship are more complicated than the PCW. However, Effectiveness analysis models provide single effectiveness element centric effective analysis method so far. Therefore, A model to provide unified view and common language about NCO effectiveness is required. EA use reference model as a common language to control complexity and change. The objective of this study is presenting a methodology to develop NCO effectiveness analysis model based on reference model and implementation model concept. To do this, First, the concept of EA based reference and implementation model is studied, Second, we study related effectiveness analysis method and model component and their relationship identification methodology, third, we propose methodology to develop NCO effectiveness analysis model. Finally, we prove the effectiveness of the methodology using case study.

Access Control for XML Documents Using Extended RBAC (확장된 RBAC를 이용한 XML문서에 대한 접근제어)

  • Kim, Jong-Hun;Ban, Yong-Ho
    • Journal of Korea Multimedia Society
    • /
    • v.8 no.7
    • /
    • pp.869-881
    • /
    • 2005
  • XML(eXtensible Markup Language) has emerged as a prevalent standard for document representation and exchange on the Internet. XML documents contain information of different sensitivity degrees, so that XML Document must selectively shared by user communities. There is thus the need for models and mechanisms enabling the specification and enforcement of access control policies for XML documents. Mechanisms are also required enabling a secure and selective dissemination of documents to users, according to the authorizations which the users have. In this paper, we give an account of access control model and mechanisms, which XML documents can be securely protected in web environments. We make RBAC Based access Control polices to the problem of secure and selective access of XML documents. The proposed model and mechanism guarantee that the secure use for XML documents through definition of authority for element, attribute, link within XML document as well as XML document.

  • PDF