• Title/Summary/Keyword: input sequence length

Search Result 40, Processing Time 0.021 seconds

Fine-tuning BERT-based NLP Models for Sentiment Analysis of Korean Reviews: Optimizing the sequence length (BERT 기반 자연어처리 모델의 미세 조정을 통한 한국어 리뷰 감성 분석: 입력 시퀀스 길이 최적화)

  • Sunga Hwang;Seyeon Park;Beakcheol Jang
    • Journal of Internet Computing and Services
    • /
    • v.25 no.4
    • /
    • pp.47-56
    • /
    • 2024
  • This paper proposes a method for fine-tuning BERT-based natural language processing models to perform sentiment analysis on Korean review data. By varying the input sequence length during this process and comparing the performance, we aim to explore the optimal performance according to the input sequence length. For this purpose, text review data collected from the clothing shopping platform M was utilized. Through web scraping, review data was collected. During the data preprocessing stage, positive and negative satisfaction scores were recalibrated to improve the accuracy of the analysis. Specifically, the GPT-4 API was used to reset the labels to reflect the actual sentiment of the review texts, and data imbalance issues were addressed by adjusting the data to 6:4 ratio. The reviews on the clothing shopping platform averaged about 12 tokens in length, and to provide the optimal model suitable for this, five BERT-based pre-trained models were used in the modeling stage, focusing on input sequence length and memory usage for performance comparison. The experimental results indicated that an input sequence length of 64 generally exhibited the most appropriate performance and memory usage. In particular, the KcELECTRA model showed optimal performance and memory usage at an input sequence length of 64, achieving higher than 92% accuracy and reliability in sentiment analysis of Korean review data. Furthermore, by utilizing BERTopic, we provide a Korean review sentiment analysis process that classifies new incoming review data by category and extracts sentiment scores for each category using the final constructed model.

A Effective Generation of Protocol Test Case Using The Depth-Tree (깊이트리를 이용한 효율적인 프로토콜 시험항목 생성)

  • 허기택;이동호
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.18 no.9
    • /
    • pp.1395-1403
    • /
    • 1993
  • Protocol conformance is crucial to inter-operability and cost effective computer communication. Given a protocol specification, the task of checking whether an inplementation conforms to the specification is called conformance testing. The efficiency and fault coverage of conformance testing are largely dependent on how test cases are chosen. Some states may have more one UIO sequence when the protocol is represented by FSM (Finite State Machine). The length of test sequence can be minimized if the optimal test sequences are chosen. In this paper, we construct the depth-tree to find the maximum overlapping among the test sequence. By using the resulting depth-tree, we generate the minimum-length test sequence. We show the example of the minimum-length test sequence obtained by using the resulting depth-tree.

  • PDF

Coreference Resolution using Hierarchical Pointer Networks (계층적 포인터 네트워크를 이용한 상호참조해결)

  • Park, Cheoneum;Lee, Changki
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.9
    • /
    • pp.542-549
    • /
    • 2017
  • Sequence-to-sequence models and similar pointer networks suffer from performance degradation when an input is composed of multiple sentences or when the length of the input sentence is long. To solve this problem, this paper proposes a hierarchical pointer network model that uses both the word level and sentence level information to encode input sequences composed of several sentences at the word level and sentence level. We propose a hierarchical pointer network based coreference resolution that performs a coreference resolution for all mentions. The experimental results show that the proposed model has a precision of 87.07%, recall of 65.39% and CoNLL F1 74.61%, which is an improvement of 21.83% compared to an existing rule-based model.

Evaluation of LSTM Model for Inflow Prediction of Lake Sapgye (삽교호 유입량 예측을 위한 LSTM 모형의 적용성 평가)

  • Hwang, Byung-Gi
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.22 no.4
    • /
    • pp.287-294
    • /
    • 2021
  • A Python-based LSTM model was constructed using a Tensorflow backend to estimate the amount of outflow during floods in the Gokgyo-cheon basin flowing into the Sapgyo Lake. To understand the effects of the length of input data used for learning, i.e., the sequence length, on the performance of the model, the model was implemented by increasing the sequence length to three, five, and seven hours. Consequently, when the sequence length was three hours, the prediction performance was excellent over the entire period. As a result of predicting three extreme rainfall events in the model verification, it was confirmed that an average NSE of 0.96 or higher was obtained for one hour in the leading time, and the accuracy decreased gradually for more than two hours in the leading time. In conclusion, the flood level at the Gangcheong station of Gokgyo-cheon can be predicted with high accuracy if the prediction is performed for one hour of leading time with a sequence length of three hours.

A Reranking Model for Korean Morphological Analysis Based on Sequence-to-Sequence Model (Sequence-to-Sequence 모델 기반으로 한 한국어 형태소 분석의 재순위화 모델)

  • Choi, Yong-Seok;Lee, Kong Joo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.7 no.4
    • /
    • pp.121-128
    • /
    • 2018
  • A Korean morphological analyzer adopts sequence-to-sequence (seq2seq) model, which can generate an output sequence of different length from an input. In general, a seq2seq based Korean morphological analyzer takes a syllable-unit based sequence as an input, and output a syllable-unit based sequence. Syllable-based morphological analysis has the advantage that unknown words can be easily handled, but has the disadvantages that morpheme-based information is ignored. In this paper, we propose a reranking model as a post-processor of seq2seq model that can improve the accuracy of morphological analysis. The seq2seq based morphological analyzer can generate K results by using a beam-search method. The reranking model exploits morpheme-unit embedding information as well as n-gram of morphemes in order to reorder K results. The experimental results show that the reranking model can improve 1.17% F1 score comparing with the original seq2seq model.

A New Approach to Estimating the MIMO Channel in Wireless Networks

  • Kim, Jee-Hoon;Song, Hyoung-Kyu
    • Journal of information and communication convergence engineering
    • /
    • v.5 no.3
    • /
    • pp.229-232
    • /
    • 2007
  • This paper investigates on the use of constant-amplitude zero-autocorrelation (CAZAC) sequence for channel estimation in multiple-input multiple-output (MIMO) system over indoor wireless channel. Since the symbol-length of the conventional 4-phase CAZAC sequence is short, there is a limitation to use it for MIMO system in multipath environments. An algorithm which generates longer CAZAC sequences is proposed to overcome that problem. Flexible symbol-length of 4-phase CAZAC sequences can be made by the proposed algorithm. Therefore appropriate symbol-length of CAZAC sequences could be utilized as preambles in accordance with the number of transmit antennas and channel condition. The effect of the number of CAZAC sequences for channel estimation is presented in terms of mean square error (MSE).

Linear-Time Korean Morphological Analysis Using an Action-based Local Monotonic Attention Mechanism

  • Hwang, Hyunsun;Lee, Changki
    • ETRI Journal
    • /
    • v.42 no.1
    • /
    • pp.101-107
    • /
    • 2020
  • For Korean language processing, morphological analysis is a critical component that requires extensive work. This morphological analysis can be conducted in an end-to-end manner without requiring a complicated feature design using a sequence-to-sequence model. However, the sequence-to-sequence model has a time complexity of O(n2) for an input length n when using the attention mechanism technique for high performance. In this study, we propose a linear-time Korean morphological analysis model using a local monotonic attention mechanism relying on monotonic alignment, which is a characteristic of Korean morphological analysis. The proposed model indicates an extreme improvement in a single threaded environment and a high morphometric F1-measure even for a hard attention model with the elimination of the attention mechanism formula.

A new method to predict the protein sequence alignment quality (단백질 서열정렬 정확도 예측을 위한 새로운 방법)

  • Lee, Min-Ho;Jeong, Chan-Seok;Kim, Dong-Seop
    • Bioinformatics and Biosystems
    • /
    • v.1 no.1
    • /
    • pp.82-87
    • /
    • 2006
  • The most popular protein structure prediction method is comparative modeling. To guarantee accurate comparative modeling, the sequence alignment between a query protein and a template should be accurate. Although choosing the best template based on the protein sequence alignments is most critical to perform more accurate fold-recognition in comparative modeling, even more critical is the sequence alignment quality. Contrast to a lot of attention to developing a method for choosing the best template, prediction of alignment accuracy has not gained much interest. Here, we develop a method for prediction of the shift score, a recently proposed measure for alignment quality. We apply support vector regression (SVR) to predict shift score. The alignment between a query protein and a template protein of length n in our own library is transformed into an input vector of length n +2. Structural alignments are assumed to be the best alignment, and SVR is trained to predict the shift score between structural alignment and profile-profile alignment of a query protein to a template protein. The performance is assessed by Pearson correlation coefficient. The trained SVR predicts shift score with the correlation between observed and predicted shift score of 0.80.

  • PDF

Score Image Retrieval to Inaccurate OMR performance

  • Kim, Haekwang
    • Journal of Broadcast Engineering
    • /
    • v.26 no.7
    • /
    • pp.838-843
    • /
    • 2021
  • This paper presents an algorithm for effective retrieval of score information to an input score image. The originality of the proposed algorithm is that it is designed to be robust to recognition errors by an OMR (Optical Music Recognition), while existing methods such as pitch histogram requires error induced OMR result be corrected before retrieval process. This approach helps people to retrieve score without training on music score for error correction. OMR takes a score image as input, recognizes musical symbols, and produces structural symbolic notation of the score as output, for example, in MusicXML format. Among the musical symbols on a score, it is observed that filled noteheads are rarely detected with errors with its simple black filled round shape for OMR processing. Barlines that separate measures also strong to OMR errors with its long uniform length vertical line characteristic. The proposed algorithm consists of a descriptor for a score and a similarity measure between a query score and a reference score. The descriptor is based on note-count, the number of filled noteheads in a measure. Each part of a score is represented by a sequence of note-count numbers. The descriptor is an n-gram sequence of the note-count sequence. Simulation results show that the proposed algorithm works successfully to a certain degree in score image-based retrieval for an erroneous OMR output.

Malware Classification Possibility based on Sequence Information (순서 정보 기반 악성코드 분류 가능성)

  • Yun, Tae-Uk;Park, Chan-Soo;Hwang, Tae-Gyu;Kim, Sung Kwon
    • Journal of KIISE
    • /
    • v.44 no.11
    • /
    • pp.1125-1129
    • /
    • 2017
  • LSTM(Long Short-term Memory) is a kind of RNN(Recurrent Neural Network) in which a next-state is updated by remembering the previous states. The information of calling a sequence in a malware can be defined as system call function that is called at each time. In this paper, we use calling sequences of system calls in malware codes as input for malware classification to utilize the feature remembering previous states via LSTM. We run an experiment to show that our method can classify malware and measure accuracy by changing the length of system call sequences.