Change Detection of Structured Documents using Path-Matching Algorithm

경로 매칭 알고리즘을 이용한 구조화된 문서의 변화 탐지

  • 이경호 (미국 국립표준기술연구소) ;
  • 변창원 ((주) 프리챌) ;
  • 최윤철 (연세대학교 컴퓨터과학과) ;
  • 고견 (청주대학교 컴퓨터정보공학과)
  • Published : 2001.12.01

Abstract

This paper presents an efficient algorithm to compute difference between old and new versions of an SGML/XML document. The difference between the two versions can be considered to be an edit script that transforms some document tree into another The proposed algorithm is based on hybridization of bottom-up and top-down methods: matching relationships between nodes in the two versions are producted in a bottom-up manner and top-down breadth -first search computes an edit script. Because the algorithm does not need to investigate possible existence of matchings for all nodes, faster matching can be achieved . Furthermore, it can detect more structurally meaningful changes such as subtree move and copy as well as simple changes to the node itself like insert, delete, and update.

본 논문에서는 SGML/XML 문서의 구 버전과 신 버전간의 타이를 계산할 수 있는 효율적인 알고리즘을 제안한다. 타이는 구 버전의 문서를 신 버전으로 변환하는데 소요되는 편집 스크립트로 간주할 수 있다. 제안된 알고리즘은 상향식과 하향식의 복합적인 접근 방식을 적용한다. 먼저 두 버전을 구성하는 노드간의 대응관계를 상향식으로 생성하며 하향시 너비 우선 탐색을 적용하여 편집 스크립트를 계산한다. 제안된 알고리즘은 모든 노드간의 대응 여부를 모두 조사할 필요가 없기 때문에 대응관계를 보다 빠르게 생성할 수 있다. 또한 삽입, 삭제, 그리고 갱신의 단순한 변화는 물론이고 부트리 이동과 복사의 구조적으로 보다 의미 있는 변화를 탐지할 수 있다.

Keywords

References

  1. ISO/IEC 8879, Information Processing - Text and Office Systems - Standard Generalized Markup Language (SGML). International Organization for Standardization, 1986
  2. W3C Recommendation REC-xml-19980210, Extensible Markup Language (XML) 1.0. World Wide Web Consortium. 1998. http://www.x3c.org/TR/1998/REC-xml-19980210
  3. Mil-M-28001A, Markup requirements and generic style specification for electronic printed output and exchange of text. Department of Defense CALS Office, Sep. 1990
  4. H. Brown, Standards for Structured Documents, The Computer Journal, vol. 32, no. 6, pp.505-514, Jun. 1989
  5. Y. Marcoux and M. Sevingny, Why SGML? Why Not?, Journal of the American Society for Information Science, vol. 48, no. 7, pp.584-592, Jul. 1997 https://doi.org/10.1002/(SICI)1097-4571(199707)48:7<584::AID-ASI4>3.0.CO;2-P
  6. A. Sengupta and A. Dillon, Extending SGML to Accommodate Database Functions: A Methodological Overview, Journal of Amerirnn Society For Information Science, vol. 48, no. 7, pp,629-637, Jul. 1997 https://doi.org/10.1002/(SICI)1097-4571(199707)48:7<629::AID-ASI10>3.0.CO;2-Q
  7. A. Bruggemannklein and D. Wood, The Validation of SGML Content Models, Mathematical & Computer Modelling, vol. 25, no. 4, pp,73-84, 1999 https://doi.org/10.1016/S0895-7177(97)00025-3
  8. C. F. Goldfarb, The SGML Handbook. Oxford: Clarendon Press, 1990
  9. C. F. Goldfarb and P. Prescod, The XML Handbook. Upper Saddle River, NJ Prentice Hall, 1998
  10. A. Haake, Cover: A Contextual Version Server for Hypertext Applications, Proc. Fourth ACM Conf. Hypertext, pp.43-52, Milan, Italy, Nov. 1992 https://doi.org/10.1145/168466.168488
  11. K. Osterbye, Structural and Cognitive Problems in Providing Version Control for Hypertext, Proc. Fourth ACM Conf. Hypertext, pp.33-42, Milan, Italy, Nov. 1992 https://doi.org/10.1145/168466.168484
  12. H. Moeller, Versioning Structured Technical Document, Proc. Workshop on Versioning in Hypertext Systems (held in connection with ECHT '94), Edinburgh, Sep. 1994
  13. S. J. Yoo, P. B. Berra, Y. K. Lee, and K. Yoon, Version Management in Structured Document Retrieval System, Proc. Eighth Int'l Conf. Software Engineering and Knowledge Engineering, pp.537-544, Lake Tahoe, Nevada, Jun. 1996
  14. M. A, Noronha, I.. G. Golcndziner, and C. S. D. Santos, Extending a Structured Document Model with Version Control, Proc. Int'l Database Engineering & Applications Symposium, pp.234-243, Jul. 1998 https://doi.org/10.1109/IDEAS.1998.694383
  15. W. Labio and H. G. Monila, Efficient Snapshot Differential Algorithms for Data Warehousing, Proc. Twentieth Conf. Very Large Data Bases, pp.63-74, Bombay, India, Sep. 1996
  16. J. Widom and S. Ceri, Active Database Systems: Triggers and Rules for Advanced Database Processing. San Francisco, CA: Morgan Kaufmann, 1995
  17. R. Wagner and M. Fischer, The String-to-String Correction Problem, Journal of the Association of Computing Machinery, vol. 21, no, 1, pp.168-173, Jan. 1974 https://doi.org/10.1145/321796.321811
  18. R. Wagner, On the Complexity of the Extended String-to-String Correction Problem, Proc. Seventh ACM Symposium on the Theory of Computation, 1975 https://doi.org/10.1145/800116.803771
  19. E. Myers, An O(ND) Difference Algorithm and Its Variations, Algorithmica. vol. 1, no. 2, pp.251-266, 1986
  20. S. Wu, U. Manber, and C. Myers, An O(NP) Sequence Comparison Algorithm, Information Processing Letters, vol. 35, pp.317-323, Sep. 1990 https://doi.org/10.1016/0020-0190(90)90035-V
  21. W. Labio and H. G. Manila, Efficient Snapshot Differential Algorithms for Data Warehousing, Proc. Twentieth Conf. Very Large Data Bases, pp.63-74, Bombay, India, Sep. 1996
  22. K. Zhang and D. Shasha, Simple Fast Algorithms for the Editing Distance between Trees and Related Problems, SIAM Journal on Computing. vol. 18, no. 6, pp.1245-1262, 1989 https://doi.org/10.1137/0218082
  23. D. Shasha and K. Zhang, Fast Algorithms for the Unit Cost Editing Distance between Trees, Journal of Algorithms, vol. 11, pp.581-621, 1990 https://doi.org/10.1016/0196-6774(90)90011-3
  24. K. Zhang, 'Algorithms for the Constrained Editing Distance between Ordered Labeled Trees and Related Problems,' Pattern Recognition. vol. 28, no. 3, pp.463-474, Mar. 1995 https://doi.org/10.1016/0031-3203(94)00109-Y
  25. G. J. S. Chang, G. Patel, L. Relihan, and J. T. J. Wang, 'A Graphical Environment for Change Detection in Structured Documents,' Proc. Twenty-First Annual Int'l Computer Software and Applications Conference (COMPSAC'97), pp.536-541, Los Alamitos, CA, Aug. 1997 https://doi.org/10.1109/CMPSAC.1997.625064
  26. J. T. L. Wang, D. Shasha, G. J. S. Chang, L. Reljhan, K. Zhang, and G. Patel, Structural Matching and Discovery in Document Databases. Proc. ACM SIGMOD Int'l Conf. Management of Data, pp.560-563, Tucson, AZ, May 1997 https://doi.org/10.1145/253260.253406
  27. S. Chawathe, A. Rajaraman. H. G. Molina, and J. Widom, Change Detection in Hierarchically Structured Information, Proc. ACM SIGMOD Int'l Conf. Management of Data, pp.493-504, Montreal, Canada, Jun. 1996 https://doi.org/10.1145/233269.233366
  28. S. Chawathe and H. G. Molina, Meaningful Change Detection in Structured Data, Proc. ACM SIGMOD Int'l Conf. Management of Data, pp.26 -37, Tucson, AZ, May 1997 https://doi.org/10.1145/253260.253266