Search | Korea Science

Document Classification Methodology Using Autoencoder-based Keywords Embedding

Seobin Yoon;Namgyu Kim
- Journal of the Korea Society of Computer and Information
- /
- v.28 no.9
- /
- pp.35-46
- /
- 2023
In this study, we propose a Dual Approach methodology to enhance the accuracy of document classifiers by utilizing both contextual and keyword information. Firstly, contextual information is extracted using Google's BERT, a pre-trained language model known for its outstanding performance in various natural language understanding tasks. Specifically, we employ KoBERT, a pre-trained model on the Korean corpus, to extract contextual information in the form of the CLS token. Secondly, keyword information is generated for each document by encoding the set of keywords into a single vector using an Autoencoder. We applied the proposed approach to 40,130 documents related to healthcare and medicine from the National R&D Projects database of the National Science and Technology Information Service (NTIS). The experimental results demonstrate that the proposed methodology outperforms existing methods that rely solely on document or word information in terms of accuracy for document classification.
https://doi.org/10.9708/jksci.2023.28.09.035 인용 PDF HTML

Grammatical Properties of Kes Constructions in a Speech Corpus (연설문 말뭉치에서 나타나는 '것' 구문의 문법적 특징)

Kim, Jong-Bok;Lee, Seung-Han;Kim, Kyung-Min
- Korean Journal of Cognitive Science
- /
- v.19 no.3
- /
- pp.257-281
- /
- 2008
The expression 'kes' is one of the most widely used ones in the language whose uses are highly dependent upon the context. These highly-context dependent uses make it hard to determine its grammatical properties. As a way of examining the properties in a rather controlled context, this paper collects a series of speeches made by government officials and examines the grammatical properties of the expression in the corpus. In particular, the paper, based on the 539 instances of 'kes' uses extracted from the corpus, focuses on the 7 types of 'kes' constructions most widely used in the collected speech corpus.
PDF

A Dynamic Link Model for Korean POS-Tagging (한국어 품사 태깅을 위한 다이내믹 링크 모델)

Hwang, Myeong-Jin;Kang, Mi-Young;Kwon, Hyuk-Chul
- Annual Conference on Human and Language Technology
- /
- 2007.10a
- /
- pp.282-289
- /
- 2007
통계를 이용한 품사 태깅에서는 자료부족 문제가 이슈가 된다. 한국어나 터키어와 같은 교착어는 어절(word)이 다수 형태소로 구성되어 있어서 자료부족 문제가 더 심각하다. 이러한 문제를 극복하고자 교착어 문장을 어절 열이 아니라 형태소의 열이라 가정한 연구도 있었으나, 어절 특성이 사라지기 때문에 파생에 의한 어절의 문법 범주 변화 등의 통계정보와 어절 간의 통계정보를 구하기 어렵다. 본 논문은 효율적인 어절 간 전이확률 계산 방법론을 고안함으로써 어절 단위의 정보를 유지하면서도 자료부족문제를 해결할 수 있는 확률 모델을 제안한다. 즉, 한국어의 형태통사적인 특성을 고려하면 앞 어절의 마지막 형태소와 함께 뒤 어절의 처음 혹은 끝 형태소-즉 두 개의 어절 간 전이 링크만으로도 어절 간 전이확률 계산 시 필요한 대부분 정보를 얻을 수 있고, 문맥에 따라 두 링크 중 하나만 필요하다는 관찰을 토대로 규칙을 이용해 두전이링크 중 하나를 선택해 전이확률 계산에 사용하는 '다이내믹 링크 모델'을 제안한다. 형태소 품사 bi-gram만을 사용하는 이 모델은 실험 말뭉치에 대해 96.60%의 정확도를 보인다. 이는 같은 말뭉치에 대해 형태소 품사 tri-gram 등의 더 많은 문맥 정보를 사용하는 다른 모델을 평가했을 때와 대등한 성능이다.
PDF

korean-Hanja Translation System based on Semantic Processing (의미처리 기반의 한글-한자 변환 시스템)

Kim, Hong-Soon;Sin, Joon-Choul;Ok, Cheol-Young
- Annual Conference of KIPS
- /
- 2011.04a
- /
- pp.398-401
- /
- 2011
워드프로세서에서의 한자를 가진 한글 어휘의 한자 변환 작업은 사용자에 의해 음절/단어 단위의 변환으로 많은 시간이 소요되어 효율이 떨어진다. 본 논문에서는 한글 문장의 의미처리를 통해 문맥에 맞는 한자를 자동 변환하는 시스템을 제안한다. 문맥에 맞는 한글-한자 변환을 위해서는 우선 정확한 형태소 분석 및 동형이의어 분별이 선행되어야 한다. 이를 위해 본 논문에서는 은닉마르코프모델 기반의 형태소 및 동형이의어 동시 태깅 시스템을 구현하였다. 제안한 시스템은 형태의미 세종 말뭉치 1,100만여 어절을 이용하여 unigram과 bigram을 추출 하였고, unigram을 이용하여 어절의 생성확률 사전을 구축하고 bigram을 이용하여 전이확률 학습사전을 구축하였다. 그리고 품사 및 동형이의어 태깅 후 명사를 표준국어대사전에 등재된 한자로 변환하는 시스템을 구현하였다. 구현된 시스템의 성능 확인을 위해 전체 세종 말뭉치를 문장단위로 비학습 말뭉치를 구성하여 실험하였고, 실험결과 한자를 가진 동형이의어에 대한 한자 변환에서 90.35%의 정확률을 보였다.
https://doi.org/10.3745/PKIPS.y2011m04a.398 인용 PDF

Array Localization for Multithreaded Code Generation (다중스레드 코드 생성을 위한 배열 지역화)

Yang, Chang-Mo;Yu, Won-Hui
- The Transactions of the Korea Information Processing Society
- /
- v.3 no.6
- /
- pp.1407-1417
- /
- 1996
In recent researches on thread partitioning algorithms break a thread at the long latency operation and merge threads to get the longer threads under the given constraints. Due to this limitation, even a program with little parallelism is partitioned into small-sized threads and context-swithings occur frequently. In the paper, we propose another method array localization about the array name, dependence distance(the difference of accessed element index from loop index), and the element usage that indicates whether element is used or defined. Using this information we can allocate array elements to the node where the corresponding loop activation is executed. By array localization, remote accesses to array elements can be replaced with local accesses to localized array elements. As a resuit,the boundaries of some threads are removed, programs can be partitioned into the larger threads and the number of context switchings reduced.
PDF

(Prediction of reduction goals : deterministic approach) (리덕션 골의 예상: 결정적인 접근 방법)

이경옥
- Journal of KIISE:Software and Applications
- /
- v.30 no.5_6
- /
- pp.461-465
- /
- 2003
The technique of reduction goal prediction in LR parsing has several applications such as the computation of right context. An LR parser generating the set of pre-determined reduction goals was previously suggested. The set approach is nondeterministic, and so it is inappropriate in some applications. This paper suggests a deterministic technique to give a uniquely predictable reduction symbol.
PDF KSCI

A Search Method for Components Based-on XML Component Specification (XML 컴포넌트 명세서 기반의 컴포넌트 검색 기법)

Park, Seo-Young;Shin, Yoeng-Gil;Wu, Chi-Su
- Journal of KIISE:Software and Applications
- /
- v.27 no.2
- /
- pp.180-192
- /
- 2000
Recently, the component technology has played a main role in software reuse. It has changed the code-based reuse into the binary code-based reuse, because components can be easily combined into the developing software only through component interfaces. Since components and component users have increased rapidly, it is necessary that the users of components search for the most proper components for HTML among the enormous number of components on the Internet. It is desirable to use web-document-typed specifications for component specifications on the Internet. This paper proposes to use XML component specifications instead of HTML specifications, because it is impossible to represent the semantics of contexts using HTML. We also propose the XML context-search method based on XML component specifications. Component users use the contexts for the component properties and the terms for the values of component properties in their queries for searching components. The index structure for the context-based search method is the inverted file indexing structure of term-context-component specification. Not only an XML context-based search method but also a variety of search methods based on context-based search, such as keyword, search, faceted search, and browsing search method, are provided for the convenience of users. We use the 3-layer architecture, with an interface layer, a query expansion layer, and an XML search engine layer, of the search engine for the efficient index scheme. In this paper, an XML DTD(Document Type Definition) for component specification is defined and the experimental results of comparing search performance of XML with HTML are discussed.
PDF

Eosinophilic Infiltration in the Liver: Unusual Manifestation of Hepatic Segmental Involvement (비전형적인 간 분절 호산구 침윤: 증례 보고)

Lee, Hyun-Joo;Kim, Dae-Jung;Heo, Jin-Hyung;Kim, Kyoung-Ah;Yoon, Sang-Wook;Lee, Jong-Tae
- Investigative Magnetic Resonance Imaging
- /
- v.16 no.1
- /
- pp.76-80
- /
- 2012
Eosinophilic infiltration in the liver is not a rare disease and it is usually presented as multiple, small, ill defined, oval or round, low attenuated lesions on portal phase of computed tomography. We reported case of hepatic eosinophilic infiltration in the liver, as an unusual manifestation of segmental involvement.
PDF KSCI

Template Constrained Sequence to Sequence based Conversational Utterance Error Correction Method (문장틀 기반 Sequence to Sequence 구어체 문장 문법 교정기)

Jeesu Jung;Seyoun Won;Hyein Seo;Sangkeun Jung;Du-Seong Chang
- Annual Conference on Human and Language Technology
- /
- 2022.10a
- /
- pp.553-558
- /
- 2022
최근, 구어체 데이터에 대한 자연어처리 응용 기술이 늘어나고 있다. 구어체 문장은 소통 방식 등의 형태로 인해 정제되지 않은 형태로써, 필연적으로 띄어쓰기, 문장 왜곡 등의 다양한 문법적 오류를 포함한다. 자동 문법 교정기는 이러한 구어체 데이터의 전처리 및 일차적 정제 도구로써 활용된다. 사전학습된 트랜스포머 기반 문장 생성 연구가 활발해지며, 이를 활용한 자동 문법 교정기 역시 연구되고 있다. 트랜스포머 기반 문장 교정 시, 교정의 필요 유무를 잘못 판단하여, 오류가 생기게 된다. 이러한 오류는 대체로 문맥에 혼동을 주는 단어의 등장으로 인해 발생한다. 본 논문은 트랜스포머 기반 문법 교정기의 오류를 보강하기 위한 방식으로써, 필요하지 않은 형태소인 고유명사를 마스킹한 입력 및 출력 문장틀 형태를 제안하며, 이러한 문장틀에 대해 고유명사를 복원한 경우 성능이 증강됨을 보인다.
PDF

DNN based Speech Detection for the Media Audio (미디어 오디오에서의 DNN 기반 음성 검출)

Jang, Inseon;Ahn, ChungHyun;Seo, Jeongil;Jang, Younseon
- Journal of Broadcast Engineering
- /
- v.22 no.5
- /
- pp.632-642
- /
- 2017
In this paper, we propose a DNN based speech detection system using acoustic characteristics and context information of media audio. The speech detection for discriminating between speech and non-speech included in the media audio is a necessary preprocessing technique for effective speech processing. However, since the media audio signal includes various types of sound sources, it has been difficult to achieve high performance with the conventional signal processing techniques. The proposed method improves the speech detection performance by separating the harmonic and percussive components of the media audio and constructing the DNN input vector reflecting the acoustic characteristics and context information of the media audio. In order to verify the performance of the proposed system, a data set for speech detection was made using more than 20 hours of drama, and an 8-hour Hollywood movie data set, which was publicly available, was further acquired and used for experiments. In the experiment, it is shown that the proposed system provides better performance than the conventional method through the cross validation for two data sets.
https://doi.org/10.5909/JBE.2017.22.5.632 인용 PDF KSCI KPUBS

Search Result 108, Processing Time 0.025 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)