DOI QR코드

DOI QR Code

A Study on Automatic Classification Model of Documents Based on Korean Standard Industrial Classification

한국표준산업분류를 기준으로 한 문서의 자동 분류 모델에 관한 연구

  • Lee, Jae-Seong (University of Science & Technology) ;
  • Jun, Seung-Pyo (Div. of Data Analysis, Korea Institute of Science & Technology Information/University of Science & Technology) ;
  • Yoo, Hyoung Sun (Div. of Data Analysis, Korea Institute of Science & Technology Information/University of Science & Technology)
  • 이재성 (과학기술연합대학원대학교 과학기술경영정책학과) ;
  • 전승표 (한국과학기술정보연구원 데이터분석본부/과학기술연합대학원대학교 과학기술경영정책학과) ;
  • 유형선 (한국과학기술정보연구원 데이터분석본부/과학기술연합대학원대학교 과학기술경영정책학과)
  • Received : 2018.05.31
  • Accepted : 2018.08.20
  • Published : 2018.09.30

Abstract

As we enter the knowledge society, the importance of information as a new form of capital is being emphasized. The importance of information classification is also increasing for efficient management of digital information produced exponentially. In this study, we tried to automatically classify and provide tailored information that can help companies decide to make technology commercialization. Therefore, we propose a method to classify information based on Korea Standard Industry Classification (KSIC), which indicates the business characteristics of enterprises. The classification of information or documents has been largely based on machine learning, but there is not enough training data categorized on the basis of KSIC. Therefore, this study applied the method of calculating similarity between documents. Specifically, a method and a model for presenting the most appropriate KSIC code are proposed by collecting explanatory texts of each code of KSIC and calculating the similarity with the classification object document using the vector space model. The IPC data were collected and classified by KSIC. And then verified the methodology by comparing it with the KSIC-IPC concordance table provided by the Korean Intellectual Property Office. As a result of the verification, the highest agreement was obtained when the LT method, which is a kind of TF-IDF calculation formula, was applied. At this time, the degree of match of the first rank matching KSIC was 53% and the cumulative match of the fifth ranking was 76%. Through this, it can be confirmed that KSIC classification of technology, industry, and market information that SMEs need more quantitatively and objectively is possible. In addition, it is considered that the methods and results provided in this study can be used as a basic data to help the qualitative judgment of experts in creating a linkage table between heterogeneous classification systems.

지식사회에 들어서며 새로운 형태의 자본으로서 정보의 중요성이 강조되고 있다. 그리고 기하급수적으로 생산되는 디지털 정보의 효율적 관리를 위해 정보 분류의 중요성도 증가하고 있다. 본 연구에서는 기업의 기술사업화 의사결정에 도움이 될 수 있는 맞춤형 정보를 자동으로 분류하여 제공하기 위하여, 기업의 사업 성격을 나타내는 한국표준산업분류(이하 'KSIC')를 기준으로 정보를 분류하는 방법을 제안하였다. 정보 혹은 문서의 분류 방법은 대체로 기계학습을 기반으로 연구되어 왔으나 KSIC를 기준으로 분류된 충분한 학습데이터가 없어, 본 연구에서는 문서간 유사도를 계산하는 방식을 적용하였다. 구체적으로 KSIC 각 코드별 설명문을 수집하고 벡터 공간 모델을 이용하여 분류 대상 문서와의 유사도를 계산하여 가장 적합한 KSIC 코드를 제시하는 방법과 모델을 제시하였다. 그리고 IPC 데이터를 수집한 후 KSIC를 기준으로 분류하고, 이를 특허청에서 제공하는 KSIC-IPC 연계표와 비교함으로써 본 방법론을 검증하였다. 검증 결과 TF-IDF 계산식의 일종인 LT 방식을 적용하였을 때 가장 높은 일치도를 보였는데, IPC 설명문에 대해 1순위 매칭 KSIC의 일치도는 53%, 5순위까지의 누적 일치도는 76%를 보였다. 이를 통해 보다 정량적이고 객관적으로 중소기업이 필요로 할 기술, 산업, 시장정보에 대한 KSIC 분류 작업이 가능하다는 점을 확인할 수 있었다. 또한 이종 분류체계 간 연계표를 작성함에 있어서도 본 연구에서 제공하는 방법과 결과물이 전문가의 정성적 판단에 도움이 될 기초 자료로 활용될 수 있을 것으로 판단된다.

Keywords

References

  1. Aha, D. W., D. Kibler, and M. K. Albert, "Instance-based learning algorithms," Machine learning, Vol.6, No.1(1991), 37-66. https://doi.org/10.1007/BF00153759
  2. Beel, J., B. Gipp, S. Langer, and C. Breitinger, "paper recommender systems: a literature survey," International Journal on Digital Libraries, Vol.17, No.4(2016), 305-338. https://doi.org/10.1007/s00799-015-0156-0
  3. Byun, S., Lee, D., and Kim, N,. "Methodology for Identifying Issues of User Reviews from the Perspective of Evaluation Criteria: Focus on a Hotel Information Site," Journal of Intelligence and Information Systems, Vol.22, No.3(2016), 23-43. https://doi.org/10.13088/jiis.2016.22.3.023
  4. Chang, J., "Using the MeSH Hierarchy to Index Bioinformatics Articles," CS224N/Ling237 Final Projects, (2000), 1-10.
  5. Chang, J. Y., "A Study on Research Trends of Graph-Based Text Representations for Text Mining," The Journal of The Institute of Internet, Broadcasting and Communication, Vol.13, No.5(2013), 37-47. https://doi.org/10.7236/JIIBC.2013.13.5.37
  6. Choi, H. B., "An Artificial Neural Network for Local Library's Book Recommender System," Journal of Korean Institute of Information Technology, Vol.14, No.9(2016), 109-118.
  7. Cleverdon, C., "Optimizing Convenient Online Access to Bibliographic Databases," Information Services and Use, Vol.4, No.12(1983), 37-47.
  8. Cooper, W. S. "Getting beyond boole," Information Processing & Management, Vol.24, No.3(1988), 243-248. https://doi.org/10.1016/0306-4573(88)90091-X
  9. Craven, M., et al., Learning to extract symbolic knowledge from the World Wide Web, Carnegie-mellon univ pittsburgh pa school of computer Science, 1998.
  10. Craven, M., et al. "Learning to construct knowledge bases from the World Wide Web," Artificial intelligence, Vol.118, No.1(2000), 69-113. https://doi.org/10.1016/S0004-3702(00)00004-7
  11. Dillon, M. "Introduction to modern information retrieval: G. Salton and M. McGill", McGraw-Hill, New York, 1983.
  12. Drucker, P., Post-capitalist society, Routledge, 2012.
  13. Gudivada, V. N., V. V. Raghavan, W. I. Grosky, and R. Kasanagottu, "Information retrieval on the world wide web", IEEE Internet Computing, Vol.1, No.5(1997), 58-68. https://doi.org/10.1109/4236.623969
  14. Guide to the International Patent Classification, WIPO, 2017.
  15. Hamedani, M. R., and S. W. Kim, "A Comparative Study of Vector Space and Probabilistic Models in Computing Similarity of Scientific Papers," Communications of the Korean Institute of Information Scientists and Engineers, Vol.20, No.3(2014), 186-190.
  16. Hong, J. S., Kim, N., and Lee, S., "A Methodology for Automatic Multi-Categorization of Single-Categorized Documents," Journal of Intelligence and Information Systems, Vol.20, No.3(2014), 77-92. https://doi.org/10.13088/jiis.2014.20.3.077
  17. Jeon, H. C., and J. M. Choi, "PIRS : Personalized Information Retrieval System using Adaptive User Profiling and Real-time Filtering for Search Results," Journal of Intelligence and Information Systems, Vol.16, No.4(2010), 21-41.
  18. Jeong, Y. M., "Information Retrieval Theory", Gumi Trade Publishing Department, 1993.
  19. Joachims, T., "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," European Conference on Machine Learning(ECML), 1988.
  20. Kim, D. and Yu, S. J., "Reliability Analysis of VOC Data for Opinion Mining," Journal of Intelligence and Information Systems, Vol.22, No.4(2016), 217-245. https://doi.org/10.13088/jiis.2016.22.4.217
  21. Kim, G., "Data Mining for Spam Email Classification," Asia-pacific Journal of Multimedia Services Convergent with Art, Humanities, and Sociology, Vol.6, No.7(2016), 37-47.
  22. Kim, H. J., and J. Y. Chang, "A Semantic Text Model with Wikipedia-based Concept Space," The Journal of Society for e-Business Studies, Vol.19, No.3(2014), 107-123. https://doi.org/10.7838/jsebs.2014.19.3.107
  23. Kim, S. I., and H. S. Kim, "An Automatic Web Page Classification System Using Meta-Tag," The Korean Institute of Communications and Informaion Sciences, Vol.38, No.4(2013), 291-297.
  24. Korea Standard Industry Classification(KSIC) 9th Amendment, Statistics Korea, 2007.
  25. Lang, K., "Newsweeder: Learning to filter netnews," Machine Learning Proceedings 1995, (1995), 331-339.
  26. Lee, H. K., S. Yang, and Y. J. Ko, "Feature Expansion based on LDA Word Distribution for Performance Improvement of Informal Document Classification," Korea Institute of Information Scientists and Engineers, Vol.43, No.9(2016), 1008-1014.
  27. Lee, J. M., "UN's Sustainable Development Goals (SDGs) Oriented Research Trend in publications of Korean Society of Rural Planning, 1995-2016: quantitatively analyzed with the Vector Space Model," Journal of Korean Society of Rural Planning, Vol.23, No.2(2017), 29-42. https://doi.org/10.7851/Ksrp.2017.23.2.029
  28. Lee, J. H., M. H. Kim, and Y. J. Lee, "Ranking documents in thesaurus-based Boolean retrieval systems," Information Processing & Management, Vol.30 No.1(1994), 79-91. https://doi.org/10.1016/0306-4573(94)90025-6
  29. Lee, S., and H. J. Kim, "Keyword Extraction from News Corpus using Modified TF-IDF," The Journal of Society for e-Business Studies, Vol.14, No.4(2009), 59-73.
  30. Lee, S., G. Lee, O. Hwang, and S. Noh, "Developing Movie Recommendation System Reflecting Movie Viewers' Preferences," Journal of Intelligence and Information Systems 2007 Fall Conference, (2007), 507-513.
  31. Lewis, D. D., and W. A. Gale, "A sequential algorithm for training text classifiers," Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. Springer-Verlag New York, Inc., 1994.
  32. Lewis, D. D., and K. A. Knowles, "Threading electronic mail: A preliminary study," Information processing & management, Vol.33, No.2(1997), 209-217. https://doi.org/10.1016/S0306-4573(96)00063-5
  33. Luhn, H. P., "A statistical approach to mechanized encoding and searching of literary information" IBM Journal of research and development, Vol.1, No.4(1957), 309-317. https://doi.org/10.1147/rd.14.0309
  34. Manning, C. D., P. Raghavan, and H. Schtze, "Document and query weighting schemes," Introduction to Information Retrieval, (2008), 128.
  35. Ministry of SMEs and Startups, "Status of SMEs in Korea", 2014.
  36. Mooney, R. J., and L. Roy, "Content-based book recommending using learning for text categorization," Proceedings of the fifth ACM conference on Digital libraries, ACM, (2000).
  37. National Information Society Agency, "2016 The Report on the Digital Divide", 2016.
  38. Noh, Y., J. Lim, K. Bok, J. Yoo, "Hot Topic Prediction Scheme Using Modified TF-IDF in Social Network Environments," KIISE Transactions on Computing Practices, Vol.23, No.4(2017), 217-225. https://doi.org/10.5626/KTCP.2017.23.4.217
  39. Park, C. H., S. S. Youm, and J. M. Lee, "The Effect of User-Centered Categorization System of Homepages on Directory Search," Korean Journal of Cognitive Science, Vol.11, No.1(2000), 47-65.
  40. Pazzani, M. J., J. Muramatsu, and D. Billsus, "Syskill & Webert: Identifying interesting web sites," AAAI/IAAI, Vol. 1. 1996.
  41. Ponte, J. M., and W. B. Croft, "A language modeling approach to information retrieval," Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, ACM, (1998).
  42. Radecki, T. "Trends in research on information retrieval-the potential for improvements in conventional boolean retrieval systems," Information Processing & Management, Vol.24, No.3(1988), 219-227. https://doi.org/10.1016/0306-4573(88)90089-1
  43. Ruiz, M. E., and P. Srinivasan, "Hierarchical text categorization using neural networks," Information Retrieval, Vol.5, No.1(2002), 87-118. https://doi.org/10.1023/A:1012782908347
  44. Salton, G., A. Wong, and C. S. Yang, "A vector space model for automatic indexing," Communications of the ACM, Vol.18, No.11(1975), 613-620. https://doi.org/10.1145/361219.361220
  45. Salton, G., "Historical Note: The Past Thirty Years in Information Retrieval," Jounal of the American Society for Information Science, Vol.38, No.5(1987).
  46. Salton, G. "Automatic text processing: The transformation, analysis, and retrieval of," Reading: Addison-Wesley, (1989).
  47. Sebastiani, F., "Machine learning in automated text categorization," ACM computing surveys (CSUR), Vol.34, No.1(2002), 1-47. https://doi.org/10.1145/505282.505283
  48. Shavlik, J., and T. Eliassi-Rad, "Intelligent agents for web-based tasks: An advice-taking approach," AAAI/ICML Workshop on Learning for Text Categorization, 1998.
  49. Sparck Jones, K., "A statistical interpretation of term specificity and its application in retrieval," Journal of documentation, Vol.28, No.1(1972), 11-21. https://doi.org/10.1108/eb026526
  50. Vapnik, V. Statistical learning theory, 1998, Wiley, New York, 1998.
  51. Witten, I. H., A. Moffat, and T. C. Bell, Managing gigabytes: compressing and indexing documents and images, Morgan Kaufmann, 1999.
  52. Yang, Y. "Expert network: Effective and efficient learning from human decisions in text categorization and retrieval," Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, Springer-Verlag, New York, Inc., 1994.
  53. Yang, Y., and J. O. Pedersen. "A comparative study on feature selection in text categorization," Icml, Vol. 97, (1997).
  54. Yang, Y. and X. Liu, "A Re-examination of Text Categorization Methods," Proceedings of the 22h Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 99), (1999), 42-49.
  55. Yang, Y. "An Evaluation of Statistical Approaches to Text Categorization," Journal of Information Retrieval, Vol.1, No.1(1999), 67-88.
  56. Yoo, H. S, J. H. Seo, S.-P. Jun, J. Seo, "A Study on an Estimation Method of Domestic Market Size by Using the Standard Statistical Classfications," Journal of Korea Technology Innovation Society, Vol. 18, No. 3(2015), 387-415.

Cited by

  1. Doc2Vec 모형에 기반한 자기소개서 분류 모형 구축 및 실험 vol.19, pp.1, 2018, https://doi.org/10.9716/kits.2020.19.1.103
  2. 딥러닝 기법을 활용한 산업/직업 자동코딩 시스템 vol.12, pp.4, 2018, https://doi.org/10.15207/jkcs.2021.12.4.023