A Method for Automatic Detection of Character Encoding of Multi Language Document File

Seo, Min Ji;Kim, Myung Ho;

doi:10.5626/KTCP.2016.22.4.170

KIISE Transactions on Computing Practices (정보과학회 컴퓨팅의 실제 논문지)

Volume 22 Issue 4
/
Pages.170-177
/
2016
/
2383-6318(pISSN)
/
2383-6326(eISSN)

Korean Institute of Information Scientists and Engineers (한국정보과학회)

DOI QR Code

A Method for Automatic Detection of Character Encoding of Multi Language Document File

다중 언어로 작성된 문서 파일에 적용된 문자 인코딩 자동 인식 기법

Seo, Min Ji ;
Kim, Myung Ho (Soonsil Univ.)

서민지 (숭실대학교 융합소프트웨어학과) ;
김명호 (숭실대학교 융합소프트웨어학과)

Received : 2015.11.24
Accepted : 2016.01.19
Published : 2016.04.15

https://doi.org/10.5626/KTCP.2016.22.4.170 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Character encoding is a method for changing a document to a binary document file using the code table for storage in a computer. When people decode a binary document file in a computer to be read, they must know the code table applied to the file at the encoding stage in order to get the original document. Identifying the code table used for encoding the file is thus an essential part of decoding. In this paper, we propose a method for detecting the character code of the given binary document file automatically. The method uses many techniques to increase the detection rate, such as a character code range detection, escape character detection, character code characteristic detection, and commonly used word detection. The commonly used word detection method uses multiple word database, which means this method can achieve a much higher detection rate for multi-language files as compared with other methods. If the proportion of language is 20% less than in the document, the conventional method has about 50% encoding recognition. In the case of the proposed method, regardless of the proportion of language, there is up to 96% encoding recognition.

문자 인코딩은 문서를 컴퓨터에서 이용할 수 있도록 문자 코드 테이블을 이용하여 이진화하는 방법이다. 이진화된 문서를 읽기 위해서는, 문서에 적용된 문자 코드를 이용하여 문자 인코딩을 알아내야 한다. 본 논문에서는 문서의 문자 인코딩을 자동으로 판별하는 방법을 제시한다. 제안하는 방법은 이스케이프 문자를 이용한 판별법, 문서에 나타난 코드 값 범위 판별법, 문서에 나타난 코드 값의 특징 판별법, 각 언어별 자주 사용하는 단어를 이용한 판별법과 같은 여러 단계를 걸쳐 문서에 적용된 문자 인코딩을 판별한다. 자주 사용하는 단어를 이용한 방법은 문서를 언어별로 분류하여 문자 인코딩을 판별하기 때문에, 다국어 문서에서 기존의 방법보다 높은 문자 인코딩 인식률을 보인다. 주로 표현하는 언어의 비중이 20% 미만일 경우, 기존의 방법은 약 50%의 문자 인코딩 인식률을 보였으나, 제안하는 방법은 문자 인코딩에서 표현하는 언어의 비중과는 상관없이 96% 이상의 문자 인코딩 인식률을 보였다.

Keywords

References

N. H.F.Beebe, "Character set encoding," TUGboat, Vol. 11, No. 2, pp. 171-175, 1990.
J. Bettels and F.A. Bishop, "Unicode: A universal character code," Digital Technical Journal, Vol. 5, No. 3, pp. 21-31, 1993.
S. Hussain, N. Durrani and S. Gul, "Survey of Language Computing in Asia 2005," Proc. of PAN Localization, pp. 37-46, 2005.
N. N. Karanikolas and P. Ousranos, "Uncovering Languages from written documents," Proc. of the 18th Panhellenic Conference on Informatics, pp. 1-4, 2014.
M. Durst and A. Freytag, "Unicode in XML and other Markup Languages," Unicode Technical Report #20, 2013.
S. Li and K. Momoi, "A composite approach to language/encoding detection ," Netscape Communications Corp, 2002.
C. Y. Suen, "N-Gram Statics for Natural Language Understanding and Text Processing," IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. PAMI-1, No. 2, pp. 164-172, 1979. https://doi.org/10.1109/TPAMI.1979.4766902
H. B. Kang, H. C. Jang, and C. S. Jang, "Automatic Recognition of Encoding on the Server for Preventing Mojibake," Journal of KIlT, Vol. 13, No.6, pp. 105-112, 2015.
S. J. Searle. (2004). "A Brief History of Character Codes," TRON Web. [Online]. Available: http//tronweb.super-nova.co.jp/characcodehist.html

KIISE Transactions on Computing Practices (정보과학회 컴퓨팅의 실제 논문지)

A Method for Automatic Detection of Character Encoding of Multi Language Document File

다중 언어로 작성된 문서 파일에 적용된 문자 인코딩 자동 인식 기법

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)