DOI QR코드

DOI QR Code

A Comparative Study of Feature Extraction Methods for Authorship Attribution in the Text of Traditional East Asian Medicine with a Focus on Function Words

한의학 고문헌 텍스트에서의 저자 판별 - 기능어의 역할을 중심으로 -

  • Oh, Junho (Korea Institute of Oriental Medicine)
  • Received : 2020.04.17
  • Accepted : 2020.05.11
  • Published : 2020.05.25

Abstract

Objectives : We would like to study what is the most appropriate "feature" to effectively perform authorship attribution of the text of Traditional East Asian Medicine Methods : The authorship attribution performance of the Support Vector Machine (SVM) was compared by cross validation, depending on whether the function words or content words, single word or collocations, and IDF weights were applied or not, using 'Variorum of the Nanjing' as an experimental Corpus. Results : When using the combination of 'function words/uni-bigram/TF', the performance was best with accuracy of 0.732, and the combination of 'content words/unigram/TFIDF' showed the lowest accuracy of 0.351. Conclusions : This shows the following facts from the authorship attribution of the text of East Asian traditional medicine. First, function words play an important role in comparison to content words. Second, collocations was relatively important in content words, but single words have more important meanings in function words. Third, unlike general text analysis, IDF weighting resulted in worse performance.

Keywords

References

  1. 김원중. 한문 해석 사전. 서울. 글항아리. 2013.
  2. 최지명. 기계학습 알고리즘을 이용한 한국어 텍스트 저자 판별. 서울. 석사학위논문(연세대). 2015.
  3. 강남준, 이종영, 최운호. 독립신문 논설의 형태 주석 말뭉치를 활용한 논설 저자 판별 연구 - 어미 사용빈도 분석을 중심으로. 한국사전학. 2010. 15.
  4. 박경모, 최승훈. 강평 (康平) 상한론 (傷寒論)의 고증을 통한 상한론(傷寒論) 과 황제 내경(黃帝內經) 의 비교연구. 대한한의학원전학회지. 1995. 9.
  5. 양승률. 주촌 신만의 보유신편(保幼新編)편찬과 주촌신방(舟村新方). 장서각. 2011. 25.
  6. 오준호. 한의학 고문헌 데이터 분석을 위한 단어 임베딩 기법 비교 : 자연어처리 방법을 적용하여. 대한한의학원전학회지. 2019. 32(1).
  7. 이가은, 안상우. 소아의방(小兒醫方)의 판본비교(板本比較) 및 편제(篇第) 고찰(考察). 한국의사학회지. 2004. 17(1).
  8. Bing-Cho Chan. The authorship of the Dream of the red chamber based on a computerized statistical study of its vocabulary. Hong Kong. Joint Publishing Co Ltd. 1986.
  9. Hsieh-Chang Tu, Jieh Hsiang. A Text-Mining Approach to the Authorship Attribution Problem of Dream of the Red Chamber. Digital Humanities. 2013.
  10. Hu, Xianfeng, Yang Wang and Qiang Wu. Multiple authors Detection: a Quantitative Analysis of Dream of the Red Chamber. Advances in Adaptive Data Analysis. 2014. 6.
  11. Ilker Nadi Bozkurt, Ozgur Baglioglu, Erkan Uyar. Authorship attribution: performance of various features and classification methods. 22nd International Symposium on Computer and Information Sciences, ISCIS 2007. IEEE. 2007.
  12. Matthew L. Jockers, Daniela M. Witten. A comparative study of machine learning methods for authorship attribution. Literary and Linguistic Computing. 2010. 25(2).
  13. Mike Kestemont. Function Words in Authorship Attribution From Black Magic to Theory?(Proceedings of the 3rd Workshop on Computational Linguistics for Literature) Association for Computational Linguistics. 2014.
  14. Patrick Juola. Authorship Attribution. Foundations and Trends in Information Retrieval. 2006. 1(3). https://doi.org/10.1561/1500000005
  15. Qing-Xiang Yu. Applications of Statistical methods to Dream of the Red Chamber. Journal of National Cheng-Chi University. 1998. 76.
  16. Shlomo Argamon, Shlomo Levitan. Measuring the Usefulness of Function Words for Authorship Attribution. ACH/ALLC 2005 Conference Abstracts book. 2005.
  17. Smita Nirkhi, R.V.Dharaskar, V.M.Thakare. Authorship Identification using Generalized Features and Analysis of Computational Method. Transactions on Machine Learning and Artificial Intelligence. 2015. 3(2).
  18. MEDICLASSICS [homepage on the Internet]. Korea Institute of Oriental Medicine; 2015 [cited 30 Jan 2020]. Available from: https://mediclassics.kr/books/149