Search | Korea Science

HKIB-20000 & HKIB-40075: Hangul Benchmark Collections for Text Categorization Research

Kim, Jin-Suk;Choe, Ho-Seop;You, Beom-Jong;Seo, Jeong-Hyun;Lee, Suk-Hoon;Ra, Dong-Yul
- Journal of Computing Science and Engineering
- /
- v.3 no.3
- /
- pp.165-180
- /
- 2009
The HKIB, or Hankookilbo, test collections are two archives of Korean newswire stories manually categorized with semi-hierarchical or hierarchical category taxonomies. The base newswire stories were made available by the Hankook Ilbo (The Korea Daily) for research purposes. At first, Chungnam National University and KISTI collaborated to manually tag 40,075 news stories with categories by semi-hierarchical and balanced three-level classification scheme, where each news story has only one level-3 category (single-labeling). We refer to this original data set as HKIB-40075 test collection. And then Yonsei University and KISTI collaborated to select 20,000 newswire stories from the HKIB-40075 test collection, to rearrange the classification scheme to be fully hierarchical but unbalanced, and to assign one or more categories to each news story (multi-labeling). We refer to this modified data set as HKIB-20000 test collection. We benchmark a k-NN categorization algorithm both on HKIB-20000 and on HKIB-40075, illustrating properties of the collections, providing baseline results for future studies, and suggesting new directions for further research on Korean text categorization problem.
https://doi.org/10.5626/JCSE.2009.3.3.165 인용 PDF

A Study On The Text Recognition Using Artificial Intelligence Technique (인공지능 기법을 이용한 텍스트 인식에 관한 연구)

이행세;최태영;김영길;김정우
- Journal of the Korean Institute of Telematics and Electronics
- /
- v.26 no.11
- /
- pp.1782-1793
- /
- 1989
Stroke crossing number, syntactic pattern recognition procedure, top down recognition structure, and heuristic approach are studied for the Korean text recognition. We propose new algorithms: 1)Korean vowel seperation using limited scanning method in the Korean characters, 2) extracting strokes using stroke width method, 3) stroke crossing number and its properties, 4) average, standard deviation, and mode of stroke crossing number, and 5) classification and recognition methods of limited chinese character. These are studied with computer simuladtions and experiments.
PDF

A System for Automatic Classification of Traditional Culture Texts (전통문화 콘텐츠 표준체계를 활용한 자동 텍스트 분류 시스템)

Hur, YunA;Lee, DongYub;Kim, Kuekyeng;Yu, Wonhee;Lim, HeuiSeok
- Journal of the Korea Convergence Society
- /
- v.8 no.12
- /
- pp.39-47
- /
- 2017
The Internet have increased the number of digital web documents related to the history and traditions of Korean Culture. However, users who search for creators or materials related to traditional cultures are not able to get the information they want and the results are not enough. Document classification is required to access this effective information. In the past, document classification has been difficult to manually and manually classify documents, but it has recently been difficult to spend a lot of time and money. Therefore, this paper develops an automatic text classification model of traditional cultural contents based on the data of the Korean information culture field composed of systematic classifications of traditional cultural contents. This study applied TF-IDF model, Bag-of-Words model, and TF-IDF/Bag-of-Words combined model to extract word frequencies for 'Korea Traditional Culture' data. And we developed the automatic text classification model of traditional cultural contents using Support Vector Machine classification algorithm.
https://doi.org/10.15207/JKCS.2017.8.12.039 인용 PDF KSCI

Document classification using a deep neural network in text mining (텍스트 마이닝에서 심층 신경망을 이용한 문서 분류)

Lee, Bo-Hui;Lee, Su-Jin;Choi, Yong-Seok
- The Korean Journal of Applied Statistics
- /
- v.33 no.5
- /
- pp.615-625
- /
- 2020
The document-term frequency matrix is a term extracted from documents in which the group information exists in text mining. In this study, we generated the document-term frequency matrix for document classification according to research field. We applied the traditional term weighting function term frequency-inverse document frequency (TF-IDF) to the generated document-term frequency matrix. In addition, we applied term frequency-inverse gravity moment (TF-IGM). We also generated a document-keyword weighted matrix by extracting keywords to improve the document classification accuracy. Based on the keywords matrix extracted, we classify documents using a deep neural network. In order to find the optimal model in the deep neural network, the accuracy of document classification was verified by changing the number of hidden layers and hidden nodes. Consequently, the model with eight hidden layers showed the highest accuracy and all TF-IGM document classification accuracy (according to parameter changes) were higher than TF-IDF. In addition, the deep neural network was confirmed to have better accuracy than the support vector machine. Therefore, we propose a method to apply TF-IGM and a deep neural network in the document classification.
https://doi.org/10.5351/KJAS.2020.33.5.615 인용 PDF KSCI

AP, IP Prediction For Corpus-based Korean Text-To-Speech (코퍼스 방식 음성합성에서의 개선된 운율구 경계 예측)

Kwon, O-Hil;Hong, Mun-Ki;Kang, Sun-Mee;Shin, Ji-Young
- Speech Sciences
- /
- v.9 no.3
- /
- pp.25-34
- /
- 2002
One of the most important factor in the performance of Korean text-to-speech system is the prediction of accentual and intonational phrase boundary. The previous method of prediction shows only the 75-85% which is not proper in the practical and commercial system. Therefore, more accurate prediction must be needed in the practical system. In this study, we propose the simple and more accurate method of the prediction of AP, IP.
PDF

Pyrolysis Mass Spectrometry as a Method for the Classification of Actinomycetes

;;;Michael Goodfellow
- Proceedings of the Zoological Society Korea Conference
- /
- 1994.10a
- /
- pp.89.2-89
- /
- 1994
No Abstract, See Full Text
PDF

Classification of Wetlands in Korea

Lee, Hyo-Hye-Mi;Cho, Kang-Hyun
- Proceedings of the Zoological Society Korea Conference
- /
- 1999.10b
- /
- pp.124.1-124
- /
- 1999
No Abstract, See Full Text
PDF

Development of SVM-based Construction Project Document Classification Model to Derive Construction Risk (건설 리스크 도출을 위한 SVM 기반의 건설프로젝트 문서 분류 모델 개발)

Kang, Donguk;Cho, Mingeon;Cha, Gichun;Park, Seunghee
- KSCE Journal of Civil and Environmental Engineering Research
- /
- v.43 no.6
- /
- pp.841-849
- /
- 2023
Construction projects have risks due to various factors such as construction delays and construction accidents. Based on these construction risks, the method of calculating the construction period of the construction project is mainly made by subjective judgment that relies on supervisor experience. In addition, unreasonable shortening construction to meet construction project schedules delayed by construction delays and construction disasters causes negative consequences such as poor construction, and economic losses are caused by the absence of infrastructure due to delayed schedules. Data-based scientific approaches and statistical analysis are needed to solve the risks of such construction projects. Data collected in actual construction projects is stored in unstructured text, so to apply data-based risks, data pre-processing involves a lot of manpower and cost, so basic data through a data classification model using text mining is required. Therefore, in this study, a document-based data generation classification model for risk management was developed through a data classification model based on SVM (Support Vector Machine) by collecting construction project documents and utilizing text mining. Through quantitative analysis through future research results, it is expected that risk management will be possible by being used as efficient and objective basic data for construction project process management.
https://doi.org/10.12652/Ksce.2023.43.6.0841 인용 PDF

Exploiting Korean Language Model to Improve Korean Voice Phishing Detection (한국어 언어 모델을 활용한 보이스피싱 탐지 기능 개선)

Boussougou, Milandu Keith Moussavou;Park, Dong-Joo
- KIPS Transactions on Software and Data Engineering
- /
- v.11 no.10
- /
- pp.437-446
- /
- 2022
Text classification task from Natural Language Processing (NLP) combined with state-of-the-art (SOTA) Machine Learning (ML) and Deep Learning (DL) algorithms as the core engine is widely used to detect and classify voice phishing call transcripts. While numerous studies on the classification of voice phishing call transcripts are being conducted and demonstrated good performances, with the increase of non-face-to-face financial transactions, there is still the need for improvement using the latest NLP technologies. This paper conducts a benchmarking of Korean voice phishing detection performances of the pre-trained Korean language model KoBERT, against multiple other SOTA algorithms based on the classification of related transcripts from the labeled Korean voice phishing dataset called KorCCVi. The results of the experiments reveal that the classification accuracy on a test set of the KoBERT model outperforms the performances of all other models with an accuracy score of 99.60%.
https://doi.org/10.3745/KTSDE.2022.11.10.437 인용 PDF KSCI

Search Result 413, Processing Time 0.026 seconds

HKIB-20000 & HKIB-40075: Hangul Benchmark Collections for Text Categorization Research

A Study On The Text Recognition Using Artificial Intelligence Technique (인공지능 기법을 이용한 텍스트 인식에 관한 연구)

A System for Automatic Classification of Traditional Culture Texts (전통문화 콘텐츠 표준체계를 활용한 자동 텍스트 분류 시스템)

Document classification using a deep neural network in text mining (텍스트 마이닝에서 심층 신경망을 이용한 문서 분류)

AP, IP Prediction For Corpus-based Korean Text-To-Speech (코퍼스 방식 음성합성에서의 개선된 운율구 경계 예측)

Pyrolysis Mass Spectrometry as a Method for the Classification of Actinomycetes

Classification of Wetlands in Korea

Similar Contents Recommendation Model Based On Contents Meta Data Using Language Model (언어모델을 활용한 콘텐츠 메타 데이터 기반 유사 콘텐츠 추천 모델)

Development of SVM-based Construction Project Document Classification Model to Derive Construction Risk (건설 리스크 도출을 위한 SVM 기반의 건설프로젝트 문서 분류 모델 개발)

Exploiting Korean Language Model to Improve Korean Voice Phishing Detection (한국어 언어 모델을 활용한 보이스피싱 탐지 기능 개선)

Search Result 413, Processing Time 0.026 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)