DOI QR코드

DOI QR Code

A Proposal of Methods for Extracting Temporal Information of History-related Web Document based on Historical Objects Using Machine Learning Techniques

역사객체 기반의 기계학습 기법을 활용한 웹 문서의 시간정보 추출 방안 제안

  • Lee, Jun (Dept. of Telecomm. and Info. Engineering, Korea Aerospace University) ;
  • KWON, YongJin (Dept. of Telecomm. and Info. Engineering, Korea Aerospace University)
  • Received : 2015.04.26
  • Accepted : 2015.06.19
  • Published : 2015.08.31

Abstract

In information retrieval process through search engine, some users want to retrieve several documents that are corresponding with specific time period situation. For example, if user wants to search a document that contains the situation before 'Japanese invasions of Korea era', he may use the keyword 'Japanese invasions of Korea' by using searching query. Then, search engine gives all of documents about 'Japanese invasions of Korea' disregarding time period in order. It makes user to do an additional work. In addition, a large percentage of cases which is related to historical documents have different time period between generation date of a document and record time of contents. If time period in document contents can be extracted, it may facilitate effective information for retrieval and various applications. Consequently, we pursue a research extracting time period of Joseon era's historical documents by using historic literature for Joseon era in order to deduct the time period corresponding with document content in this paper. We define historical objects based on historic literature that was collected from web and confirm a possibility of extracting time period of web document by machine learning techniques. In addition to the machine learning techniques, we propose and apply the similarity filtering based on the comparison between the historical objects. Finally, we'll evaluate the result of temporal indexing accuracy and improvement.

최근 검색엔진을 통한 정보검색 과정에서 특정 시구간 상황에 대응하는 문서를 검색하고자 하는 경우가 있다. 예를 들면, 임진왜란 이전의 시대적 상황과 관련된 문서를 검색하기 위해, 키워드 '임진왜란'으로 검색하면 시간에 관계없이 임진왜란 당시나 전후의 모든 문서가 검색되어 추가적인 작업이 요구된다. 또한, 역사관련 문서의 경우는 문서내용에 대응하는 시간 정보가 문서 생성시간과 일치하지 않는 경우가 대부분이다. 만약 웹 문서의 내용에 대응하는 시간 정보를 추출 할 수 있다면 효과적인 정보검색은 물론 다양한 응용에 적용 가능할 것이다. 따라서 본 논문은 문서 내용에 대응하는 시간정보 추출을 목적으로, 조선시대를 대상으로 한 역사문헌을 활용하여 조선시대 역사관련 문서의 시간추출에 대한 연구를 진행한다. 역사 문헌과 웹으로부터 수집된 역사관련 문서를 바탕으로 역사객체를 정의하고, 이를 기반으로 다양한 기계학습 기법을 활용하여 웹 문서의 시간정보 추출에 대한 가능성을 확인한다. 또한 기계학습 과정에 있어서 객체의 유사도에 기반 한 여과과정을 제안하고 이를 적용한 효율적인 시간정보 추출 및 정확도 향상에 대한 결과를 비교 분석한다.

Keywords

References

  1. J. F. Allen. Maintaining Knowledge about Temporal Intervals. In Communications of the ACM, 26(11):832-843, 1983. http://dx.doi.org/10.1145/182.358434
  2. O. Alonso, M. Gertz, and R. Baeza-Yates. On the value of temporal information in information retrieval. SIGIR Forum, 41(2):35-41, 2007. http://dx.doi.org/10.1145/1328964.1328968
  3. J. Pustejovsky, J. M. Castano, et al. TimeML: Robust Specification of Event and Temporal Expressions in Text. In Proceedings of the AAAI Spring Symposium on New Directions in Question Answering, pages 28-34, 2008. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.161.8972
  4. O.Alonso, J. Strotgen, R. Baeza-Yates, and M. Gertz. Temporal information retrieval: Challenges and opportunities. In International Temporal Web Analytics Workshop, pages 1-8, 2011 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.363.4483
  5. O. Kolomiyets and M.-F. Moens. Meeting TempEval-2: Shallow Approach for Temporal Tagger. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW '09), pages 52-57, 2009. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.164.9479
  6. O. Alonso, M. Gertz, and R. Baeza-Yates. Clustering and Exploring Search Results Using Timeline Constructions. In Proceedings of the 18th ACM International Conference on Information and Knowledge Management (CIKM '09), pages 97-106,2009. http://dx.doi.org/10.1145/1645953.1645968
  7. J. Makkonen and H. Ahonen-Myka. Utilizing Temporal Information in Topic Detection and Tracking. In Proceedings of 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL '03), pages 393-404, 2003. http://dx.doi.org/10.1007/978-3-540-45175-4_36
  8. O. Alonso, R. Baeza-Yates, and M. Gertz. Effectiveness of Temporal Snippets. In Proceedings of the Workshop on Web Search Result Summarization and Presentation (WSSP 09), pages 1-4, 2009. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.160.5485
  9. A. Qamra, B. Tseng, and E. Chang. Mining Blog Stories Using Community-based and Temporal Clustering. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM '06), pages 58-67, 2006. http://dx.doi.org/10.1145/1183614.1183627
  10. R. Swan and J. Allan. TimeMine: Visualizing Automatically Constructed Timelines. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '00), page 393, 2000. http://dx.doi.org/10.1145/345508.345674
  11. A. Jatowt, K. Kanazawa, S. Oyama, and K. Tanaka. Supporting Analysis of Future-related Information in News Archives and the Web. In Proceedings of the 9th Joint Conference on Digital Libraries (JCDL '09), 2009. http://dx.doi.org/10.1145/1555400.1555420
  12. B. Shaparenko, R. Caruana, J. Gehrke, and T. Joachims. Identifying Temporal Patterns and Key Players in Document Collections. In Proceedings of the IEEE ICDM Workshop on Temporal Data Mining: Algorithms, Theory and Applications (TDM '05), pages 165-174, 2005. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.64.8382
  13. Toyoda, M., & Kitsuregawa, M. What's Really New on the Web? Identifying New Pages from a Series of Unstable Web Snapshots. In WWW2006: Proceedings of the 15th International World Wide Web Conference (pp. 233-241). Edinburgh, Scotland. May 23-26: ACM Press. http://dx.doi.org/10.1145/1135777.1135815
  14. J. Strotgen, M. Gertz, and P. Popov. Extraction and Exploration of Spatio-temporal Information in Documents. In Proceedings of the 6th Workshop on Geographic Information Retrieval (GIR '10), pages 1-8, 2010. http://dx.doi.org/10.1145/1722080.1722101
  15. Seung-Shik Kang, Byoung-Tak Zhang, A General Morphological Analyzer and Spelling Checker for the Korean Language Using Syllable Characteristics, Journal of KIISE (B), Vol.23. No.5, 1996 http://www.dbpia.co.kr/Journal/ArticleDetail/444487
  16. K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of cotraining.In Proceedings of the Workshop on Information and Knowledge Management, 2000. http://dx.doi.org/10.1145/354756.354805

Cited by

  1. Development of Supervised Machine Learning based Catalog Entry Classification and Recommendation System vol.20, pp.1, 2015, https://doi.org/10.7472/jksii.2019.20.1.57
  2. Suggestion of Visualization Types of Historical Information based on the Relation of Multi-layered Information vol.21, pp.12, 2020, https://doi.org/10.9728/dcs.2020.21.12.2109