DOI QR코드

DOI QR Code

An Automatically Extracting Formal Information from Unstructured Security Intelligence Report

비정형 Security Intelligence Report의 정형 정보 자동 추출

  • Hur, Yuna (Division of Computer Science and Engineering, Korea University) ;
  • Lee, Chanhee (Division of Computer Science and Engineering, Korea University) ;
  • Kim, Gyeongmin (Division of Computer Science and Engineering, Korea University) ;
  • Jo, Jaechoon (Division of Smart Information Communication Engineering, Sangmyung University) ;
  • Lim, Heuiseok (Division of Computer Science and Engineering, Korea University)
  • 허윤아 (고려대학교 컴퓨터학과) ;
  • 이찬희 (고려대학교 컴퓨터학과) ;
  • 김경민 (고려대학교 컴퓨터학과) ;
  • 조재춘 (상명대학교 스마트정보통신공학과) ;
  • 임희석 (고려대학교 컴퓨터학과)
  • Received : 2019.10.02
  • Accepted : 2019.11.20
  • Published : 2019.11.28

Abstract

In order to predict and respond to cyber attacks, a number of security companies quickly identify the methods, types and characteristics of attack techniques and are publishing Security Intelligence Reports(SIRs) on them. However, the SIRs distributed by each company are huge and unstructured. In this paper, we propose a framework that uses five analytic techniques to formulate a report and extract key information in order to reduce the time required to extract information on large unstructured SIRs efficiently. Since the SIRs data do not have the correct answer label, we propose four analysis techniques, Keyword Extraction, Topic Modeling, Summarization, and Document Similarity, through Unsupervised Learning. Finally, has built the data to extract threat information from SIRs, analysis applies to the Named Entity Recognition (NER) technology to recognize the words belonging to the IP, Domain/URL, Hash, Malware and determine if the word belongs to which type We propose a framework that applies a total of five analysis techniques, including technology.

사이버 공격을 예측하고 대응하기 위해서 수많은 보안 기업 회사에서는 공격기법의 특성, 수법 유형을 빠르게 파악하고, 이에 대한 Security Intelligence Report(SIR)들을 배포한다. 하지만 각 기업에서 배포하는 SIR들은 방대하며, 형식이 맞춰져 있지 않다. 본 논문은 대량의 비정형한 SIR들에서 정보를 추출하는데 소요되는 시간을 줄이고 효율적으로 파악하기 위해 SIR들에 대해 정형화하고 주요 정보를 추출하기 위해 5가지 분석기술이 적용된 프레임워크를 제안한다. SIR들의 데이터는 정답 라벨이 없기 때문에 비지도 학습방식을 통해 키워드 추출, 토픽 모델링, 문서 요약, 유사문서 검색 총 4가지 분석기술을 제안한다. 마지막으로 SIR들에서 위협 정보 추출하기 위해 데이터를 구축하였으며, 개체명 인식 기술에 적용하여 IP, Domain/URL, Hash, Malware에 속하는 단어를 인식하고 그 단어가 어떤 유형에 속하는지 판단하는 분석기술을 포함한 총 5가지 분석기술이 적용된 프레임워크를 제안한다.

Keywords

References

  1. M. E. Kuhl, J. Kistner, K. Costantini & M. Sudit. (2007). Cyber attack modeling and simulation for network security analysis. In Proceedings of the 39th Conference on Winter Simulation, 1180-1188.
  2. Y. A Hur, C. H. Lee, G. M. Kim & H. S. Lim. (2019). Topic Automatic Extraction Model based on Unstructured Security Intelligence Report. Journal of the Korea Convergence Society, 10(6), 33-39. DOI : 10.15207/JKCS.2019.10.6.033
  3. S. Hassanpour, C. P. Langlotz, T. J. Amrhei, N. T. Befera & M. P. Lungren. (2017). Performance of a machine learning classifier of knee MRI reports in two large academic radiology practices: a tool to estimate diagnostic yield. American Journal of Roentgenology, 208(4), 750-753. DOI: 10.2214/AJR.16.16128
  4. A. Opera, Z. Li, R. Norris & K. Bowers. (2018). MADE: Security Analytics for Enterprise Threat Detection. In Proceedings of the 34th Annual Computer Security Applications Conference (pp. 124-136). ACM. DOI: 10.1145/3274694.3274710
  5. Endgame. (2016). Using Deep Learning To Detect DGAs. [Online] https://www.endgame.com/blog/technical-blog/using-deep-learning-detect-dgas
  6. Amazon. (2018). GuardDuty Intelligent Threat Detection AWS. [Online]. https://aws.amazon.com/guardduty.
  7. J. H. Hur, J. H. Choi, J. H. Lee, J. B. Kim, & K. W. Rim. (2001). An Automatic Classification System of Korean Documents Using Weight for Keywords of Document and Word Cluster. The KIPS Transactions: PartB, 8(5), 447-454.
  8. T. K. Kim, H. R. Choi, & H. C. Lee. (2016). A Study on the Research Trends in Fintech using Topic Modeling. Journal of the Korea Academia-Industrial cooperation Society, 17, 11, 670-681. DOI: 10.5762/KAIS.2016.17.11.670
  9. A. Conneau, D. Kiela, H. Schwenk, L. Barrault, & A. Bordes. (2017). Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.
  10. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, & I. Polosukhin. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
  11. L. V. D. Maaten & G. Hinton. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), 2579-2605.
  12. D. Cer et al. (2018). Universal sentence encoder. arXiv preprint arXiv:1803.11175.
  13. L. V. D. Maaten & G. Hinton. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), 2579-2605.
  14. Z. Huang, W. Xu & K. Yu. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.
  15. C. H. Lee, Y. B. Kim, D. Y. Lee, & H. S. Lim. (2018). Character-Level Feature Extraction with Densely Connected Networks. In Proceedings of the 27th International Conference on Computational Linguistics (August. 2018). pp. 3228-3239. Conference Name:ACM Woodstock conference