DOI QR코드

DOI QR Code

A Study on Knowledge Entity Extraction Method for Individual Stocks Based on Neural Tensor Network

뉴럴 텐서 네트워크 기반 주식 개별종목 지식개체명 추출 방법에 관한 연구

  • Yang, Yunseok (Department of Investment Information Engineering, Yonsei University) ;
  • Lee, Hyun Jun (Department of Industrial Engineering, Yonsei University) ;
  • Oh, Kyong Joo (Department of Industrial Engineering, Yonsei University)
  • 양윤석 (연세대학교 투자정보공학과) ;
  • 이현준 (연세대학교 산업공학과) ;
  • 오경주 (연세대학교 산업공학과)
  • Received : 2019.02.19
  • Accepted : 2019.05.24
  • Published : 2019.06.30

Abstract

Selecting high-quality information that meets the interests and needs of users among the overflowing contents is becoming more important as the generation continues. In the flood of information, efforts to reflect the intention of the user in the search result better are being tried, rather than recognizing the information request as a simple string. Also, large IT companies such as Google and Microsoft focus on developing knowledge-based technologies including search engines which provide users with satisfaction and convenience. Especially, the finance is one of the fields expected to have the usefulness and potential of text data analysis because it's constantly generating new information, and the earlier the information is, the more valuable it is. Automatic knowledge extraction can be effective in areas where information flow is vast, such as financial sector, and new information continues to emerge. However, there are several practical difficulties faced by automatic knowledge extraction. First, there are difficulties in making corpus from different fields with same algorithm, and it is difficult to extract good quality triple. Second, it becomes more difficult to produce labeled text data by people if the extent and scope of knowledge increases and patterns are constantly updated. Third, performance evaluation is difficult due to the characteristics of unsupervised learning. Finally, problem definition for automatic knowledge extraction is not easy because of ambiguous conceptual characteristics of knowledge. So, in order to overcome limits described above and improve the semantic performance of stock-related information searching, this study attempts to extract the knowledge entity by using neural tensor network and evaluate the performance of them. Different from other references, the purpose of this study is to extract knowledge entity which is related to individual stock items. Various but relatively simple data processing methods are applied in the presented model to solve the problems of previous researches and to enhance the effectiveness of the model. From these processes, this study has the following three significances. First, A practical and simple automatic knowledge extraction method that can be applied. Second, the possibility of performance evaluation is presented through simple problem definition. Finally, the expressiveness of the knowledge increased by generating input data on a sentence basis without complex morphological analysis. The results of the empirical analysis and objective performance evaluation method are also presented. The empirical study to confirm the usefulness of the presented model, experts' reports about individual 30 stocks which are top 30 items based on frequency of publication from May 30, 2017 to May 21, 2018 are used. the total number of reports are 5,600, and 3,074 reports, which accounts about 55% of the total, is designated as a training set, and other 45% of reports are designated as a testing set. Before constructing the model, all reports of a training set are classified by stocks, and their entities are extracted using named entity recognition tool which is the KKMA. for each stocks, top 100 entities based on appearance frequency are selected, and become vectorized using one-hot encoding. After that, by using neural tensor network, the same number of score functions as stocks are trained. Thus, if a new entity from a testing set appears, we can try to calculate the score by putting it into every single score function, and the stock of the function with the highest score is predicted as the related item with the entity. To evaluate presented models, we confirm prediction power and determining whether the score functions are well constructed by calculating hit ratio for all reports of testing set. As a result of the empirical study, the presented model shows 69.3% hit accuracy for testing set which consists of 2,526 reports. this hit ratio is meaningfully high despite of some constraints for conducting research. Looking at the prediction performance of the model for each stocks, only 3 stocks, which are LG ELECTRONICS, KiaMtr, and Mando, show extremely low performance than average. this result maybe due to the interference effect with other similar items and generation of new knowledge. In this paper, we propose a methodology to find out key entities or their combinations which are necessary to search related information in accordance with the user's investment intention. Graph data is generated by using only the named entity recognition tool and applied to the neural tensor network without learning corpus or word vectors for the field. From the empirical test, we confirm the effectiveness of the presented model as described above. However, there also exist some limits and things to complement. Representatively, the phenomenon that the model performance is especially bad for only some stocks shows the need for further researches. Finally, through the empirical study, we confirmed that the learning method presented in this study can be used for the purpose of matching the new text information semantically with the related stocks.

정보화 시대의 넘쳐나는 콘텐츠들 속에서 사용자의 관심과 요구에 맞는 양질의 정보를 선별해내는 과정은 세대를 거듭할수록 더욱 중요해지고 있다. 정보의 홍수 속에서 사용자의 정보 요구를 단순한 문자열로 인식하지 않고, 의미적으로 파악하여 검색결과에 사용자 의도를 더 정확하게 반영하고자 하는 노력이 이루어지고 있다. 구글이나 마이크로소프트와 같은 대형 IT 기업들도 시멘틱 기술을 기반으로 사용자에게 만족도와 편의성을 제공하는 검색엔진 및 지식기반기술의 개발에 집중하고 있다. 특히 금융 분야는 끊임없이 방대한 새로운 정보가 발생하며 초기의 정보일수록 큰 가치를 지녀 텍스트 데이터 분석과 관련된 연구의 효용성과 발전 가능성이 기대되는 분야 중 하나이다. 따라서, 본 연구는 주식 관련 정보검색의 시멘틱 성능을 향상시키기 위해 주식 개별종목을 대상으로 뉴럴 텐서 네트워크를 활용한 지식 개체명 추출과 이에 대한 성능평가를 시도하고자 한다. 뉴럴 텐서 네트워크 관련 기존 주요 연구들이 추론을 통해 지식 개체명들 사이의 관계 탐색을 주로 목표로 하였다면, 본 연구는 주식 개별종목과 관련이 있는 지식 개체명 자체의 추출을 주목적으로 한다. 기존 관련 연구의 문제점들을 해결하고 모형의 실효성과 현실성을 높이기 위한 다양한 데이터 처리 방법이 모형설계 과정에서 적용되며, 객관적인 성능 평가를 위한 실증 분석 결과와 분석 내용을 제시한다. 2017년 5월 30일부터 2018년 5월 21일 사이에 발생한 전문가 리포트를 대상으로 실증 분석을 진행한 결과, 제시된 모형을 통해 추출된 개체명들은 개별종목이 이름을 약 69% 정확도로 예측하였다. 이러한 결과는 본 연구에서 제시하는 모형의 활용 가능성을 보여주고 있으며, 후속 연구와 모형 개선을 통한 성과의 제고가 가능하다는 것을 의미한다. 마지막으로 종목명 예측 테스트를 통해 본 연구에서 제시한 학습 방법이 새로운 텍스트 정보를 의미적으로 접근하여 관련주식 종목과 매칭시키는 목적으로 사용될 수 있는 가능성을 확인하였다.

Keywords

JJSHBB_2019_v25n2_25_f0001.png 이미지

Neural Tensor Network

JJSHBB_2019_v25n2_25_f0002.png 이미지

Result of Empirical Study by Stocks

Illustration for Graph Generation

JJSHBB_2019_v25n2_25_t0001.png 이미지

Result of Empirical Study

JJSHBB_2019_v25n2_25_t0002.png 이미지

References

  1. Banko, M., M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni, "Open information extraction from the web," IJCAI, Vol.7, (2007), 2670-2676.
  2. Bollacker, K., C. Evans, P. Paritosh, T. Sturge, and J. Taylor, "Freebase: a collaboratively created graph database for structuring human knowledge," Proceedings of the 2008 ACM SIGMOD international conference on Management of data, (2008), 1247-1250.
  3. Del Corro, L., and R. Gemulla, "Clausie: clause-based open information extraction," Proceedings of the 22nd international conference on World Wide Web, (2013), 355-366.
  4. Dong, X., E. Gabrilovich, G. Heitz, W. Horn, N. Lao, L. Murphy, T. Strohmann, S. Sun, and W. Zhang, "Knowledge vault: A web-scale approach to probabilistic knowledge fusion," Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, (2014), 601-610.
  5. Epstein, R., G. Roberts, and G. Beber, Parsing the Turing test, Springer, Dordrecht, 2009.
  6. Etzioni, O., A. Fader, J. Christensen, S. Soderland, and M. Mausam, "Open information extraction: The second generation," IJCAI, Vol.11, (2011), 3-10.
  7. Fader, A., S. Soderland, and O. Etzioni, "Identifying relations for open information extraction," Proceedings of the conference on empirical methods in natural language processing, (2011), 1535-1545.
  8. Kim, H., Knowledge Graph, Communication Books, 2017.
  9. Kim, J. H., and M. Lee, "Knowledge Extraction Methodology and Framework from Wikipedia Articles for Construction of Knowledge-Base," Journal of Intelligence and Information Systems, Vol.25, No.1(2019), 43-61. https://doi.org/10.13088/JIIS.2019.25.1.043
  10. Kim, Y., N. Kim, and S. R. Jeong, "Stock-Index Invest Model Using New Big Data Opinion Mining," Journal of Intelligence and Information Systems, Vol.18, No.2(2012), 143-156. https://doi.org/10.13088/JIIS.2012.18.2.143
  11. Lee, D., J. Yeon, I. Hwang, and S. Lee, "KKMA : A Tool for Utilizing Sejong Corpus based on Relational Database," Journal of KIISE : Computing Practices and Letters, Vol.16, No.11(2010), 1046-1050.
  12. Lee, H. J., and M, Sohn, "Dynamic Virtual Ontology using Tags with Semantic Relationship on Social-web to Support Effective Search," Journal of Intelligence and Information Systems, Vol.19, No.1(2013), 19-33. https://doi.org/10.13088/jiis.2013.19.1.033
  13. Liu, P., X. Qiu, and X. Huang, "Learning Context-Sensitive Word Embeddings with Neural Tensor Skip-Gram Model," IJCAI, (2015), 1284-1290.
  14. Mausam, M, "Open information extraction systems and downstream applications," Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, (2016), 4074-4077.
  15. Mesquita, F., J. Schmidek, and D. Barbosa, "Effectiveness and efficiency of open relation extraction," Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, (2013), 447-457.
  16. Navigli, R., and P. Velardi, "Learning domain ontologies from document warehouses and dedicated web sites," Computational Linguistics, Vol.30, No.2(2004), 151-179. https://doi.org/10.1162/089120104323093276
  17. Nair, S., "A Biomedical Information Extraction Primer for NLP Researchers," arXiv preprint arXiv:1705.05437, (2017).
  18. Nickel, M., K. Murphy, V. Tresp, and E. Gabrilovich, "A review of relational machine learning for knowledge graphs," Proceedings of the IEEE, Vol.104, No.1(2016), 11-33. https://doi.org/10.1109/JPROC.2015.2483592
  19. Schmidek, J., and D. Barbosa, "Improving Open Relation Extraction via Sentence Re-Structuring," LREC, (2014), 3720-3723.
  20. Schmitz, M., R. Bart, S. Soderland, and O. Etzioni, "Open language learning for information extraction," Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, (2012), 523-534.
  21. Socher, R., D. Chen, C. D. Manning, and A. Ng, "Reasoning with neural tensor networks for knowledge base completion," Advances in neural information processing systems, (2013), 926-934.
  22. Turian, J., L. Ratinov, and Y. Bengio, "Word representations: a simple and general method for semi-supervised learning," Proceedings of the 48th annual meeting of the association for computational linguistics, (2010), 384-394.
  23. Zhang, X., J. Zhao, and Y. LeCun, "Character-level convolutional networks for text classification," Advances in neural information processing systems, (2015), 649-657.