그림 1. 벡터 거리로 단어간의 의미 파악 FIg. 1. Understand the meaning between words in vector distance
그림 2. 문서 분류기의 전체적인 설계 FIg. 2. Overall Design of Document Classification
그림 3. 문서 분류기의 데이터 전처리 과정 Fig. 3. Document calssification data preprocessing
그림 4. 키워드를 기반으로 수집하는 웹 크롤러 설계 Fig. 4. Web crawler designs collected based on keywords
그림 5. Doc2Vec 모델 구조 (PV-DM) Fig. 5. Doc2Vec Model structure (PV-DM)
그림 6. 머신 러닝 기술 분류기 실험 Fig. 6. Machine Learning Algorithm Classifier Experiment
표 1. 위해도 관련 뉴스와 그 외의 뉴스의 라벨링된 데이터 예시 Table 1. Examples of hazad related News and other news labeled data
표 2. Doc2Vec Parameter Table 2. Doc2Vec Parameter
표 3. 분류 모델 성능 비교 분석 Table 3. Classification Model Performance Comparison Analysis
References
- Jun-Ho Roh, Han-joon Kim, Jae-Young Chang. "Improving Hypertext Classification Systems through WordNet-based Feature Abstraction." The Jounal of Society for e-Business Studies, 18.2 pp.95-110(6) 2013.May https://doi.org/10.7838/jsebs.2013.18.2.095
- YunJeong Choi, SeungSoo Park. "Interplay of Text Mining and Data Mining for Classifying Web Contents." KOREAN JOURNAL OF COGNITIVE SCIENCE, 13.3 pp.33-46.(14) 2002.9
- Sunghae Jun "A Big Data Preprocessing using Statistical Text Mining" Journal of Korean Institute of Intelligent Systems Vol. 25, No. 5, pp. 470-476(7) 2015 October https://doi.org/10.5391/JKIIS.2015.25.5.470
- Eun-Soon You, Gun-Hee, Choi, Seung-Hoon Kim "Study on Extraction of Keywords Using TF-IDF and Text Structure of Novels" Korean Society of Computer Information Volume 20, Issue 2, pp.121-129(9) 2015 February
- J. Ramos, "Using tf-idf to determine word relevance in document queries", In Proceedings of the First Instructional Conference on Machine Learning, 2003
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean "Distributed Representations of Words and Phrases and their Compositionality" NIPS'13 Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 pp.3111-3119(9) Lake Tahoe, Nevada December 2013
- Garam Choi, Sung-Pil Choi "A Study on the Deduction of Social Issues Applying Word Embedding: With an Empasis on News Articles related to the Disables" Journal of the Korean Society for Information Management, 35(1) pp.231-250 (20) 2018.3 https://doi.org/10.3743/KOSIM.2018.35.1.231
- Jung-Mi Kim, Ju-Hong Lee. "Text Document Classification Based on Recurrent Neural Network Using Word2vec." Journal of Korean Institute of Intelligent Systems, 27.6 pp. 560-565 (6) 2017.12 https://doi.org/10.5391/JKIIS.2017.27.6.560
- Quoc Le ,Tomas Mikolov "Distributed Representations of Sentences and Documents" ICML'14 Proceedings of the 31st International Conference on International Conference on Machine Learning Volume 32 pp.1188-1196(9) Beijing, China June 2014
- Lucy Park, Sungzoon Cho, "KoNLPy : Korean natural language processing in Python" Proceeding soft he 26th Annual Conferenceon Human & Cognitive Language Technology, 2014 10
- Seong-Ho Choi, Eun-Sol Kim, Byoung-Tak Zhang "An Intention Prediction Method for Dialogue using Paragraph Vector" Korea Computer Congress 2016 pp.977-979(3) 2016.6
- KyuWan Kim, HyunJu Shin, SunJin Kim, KyoungDuek Moon, HyunAh Lee. "Detecting Improper Paragraphs in a News Article Using Logistic Regression Classification and Inter-class Similarity." Journal of Computing Science and Engineering pp.1873-1875.(3) 2017.12
- Dan-Ho Park, Won-Sik Choi, Hong-Jo Kim, Seok-Lyong Lee. "Web Document Classification System Using the Text Analysis and Decision Tree Model." Journal of Computing Science and Engineering, 38.2A 248-251.(4) 2011.11
- Do-Sik Min, Mu-Hee Song, Ki-Jun Son, Sang-Jo Lee. "Spam - mail Filtering Using SVM Classifier." Journal of Computing Science and Engineering 30.1B pp.552-554.(3) 2003.4
- Song-yi Han, Yong-Gyu Jung. "Spam Filtering Using A Complement Naive Bayesian Classifier." Journal of Computing Science and Engineering, 36.2C 325-328.(4) 2009.11
- scikit-learn, https://scikit-learn.org/stable/