LSTM based sequence-to-sequence Model for Korean Automatic Word-spacing

Lee, Tae Seok;Kang, Seung Shik;

doi:10.30693/SMJ.2018.7.4.17

Smart Media Journal (스마트미디어저널)

Volume 7 Issue 4
/
Pages.17-23
/
2018
/
2287-1322(pISSN)
/
2288-9671(eISSN)

THE KOREAN INSTITUTE OF SMART MEDIA (한국스마트미디어학회)

DOI QR Code

LSTM based sequence-to-sequence Model for Korean Automatic Word-spacing

LSTM 기반의 sequence-to-sequence 모델을 이용한 한글 자동 띄어쓰기

Lee, Tae Seok (KISTI) ;
Kang, Seung Shik

이태석 ;
강승식 (국민대학교 컴퓨터공학부)

Received : 2018.09.27
Accepted : 2018.11.08
Published : 2018.12.31

https://doi.org/10.30693/SMJ.2018.7.4.17 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

We proposed a LSTM-based RNN model that can effectively perform the automatic spacing characteristics. For those long or noisy sentences which are known to be difficult to handle within Neural Network Learning, we defined a proper input data format and decoding data format, and added dropout, bidirectional multi-layer LSTM, layer normalization, and attention mechanism to improve the performance. Despite of the fact that Sejong corpus contains some spacing errors, a noise-robust learning model developed in this study with no overfitting through a dropout method helped training and returned meaningful results of Korean word spacing and its patterns. The experimental results showed that the performance of LSTM sequence-to-sequence model is 0.94 in F1-measure, which is better than the rule-based deep-learning method of GRU-CRF.

자동 띄어쓰기 특성을 효과적으로 처리할 수 있는 LSTM(Long Short-Term Memory Neural Networks) 기반의 RNN 모델을 제시하고 적용한 결과를 분석하였다. 문장이 길거나 일부 노이즈가 포함된 경우에 신경망 학습이 쉽지 않은 문제를 해결하기 위하여 입력 데이터 형식과 디코딩 데이터 형식을 정의하고, 신경망 학습에서 드롭아웃, 양방향 다층 LSTM 셀, 계층 정규화 기법, 주목 기법(attention mechanism)을 적용하여 성능을 향상시키는 방법을 제안하였다. 학습 데이터로는 세종 말뭉치 자료를 사용하였으며, 학습 데이터가 부분적으로 불완전한 띄어쓰기가 포함되어 있었음에도 불구하고, 대량의 학습 데이터를 통해 한글 띄어쓰기에 대한 패턴이 의미 있게 학습되었다. 이것은 신경망에서 드롭아웃 기법을 통해 학습 모델의 오버피팅이 되지 않도록 함으로써 노이즈에 강한 모델을 만들었기 때문이다. 실험결과로 LSTM sequence-to-sequence 모델이 재현율과 정확도를 함께 고려한 평가 점수인 F1 값이 0.94로 규칙 기반 방식과 딥러닝 GRU-CRF보다 더 높은 성능을 보였다.

Keywords

References

Van Khien Phan, Soo-Hyung Kim, Hyung-Jeong Yang, Guee-Sang Lee, "Text Detection based on Edge Enhanced Contrast Extremal Region and Tensor Voting in Natural Scene Images," Smartmedia Journal, vol.6, no. 4, pp.32-40., Dec. 2017.
Abhijeet Boragule, Guee Sang Lee, "Text Line Segmentation of Handwritten Documents by Area Mapping," Smartmedia Journal, vol.4, no. 3, pp.44-49., Sep. 2015.
Xinchi Chen, Xipeng Qiu, Chenxi Zhu, and Xuanjing, "Gated recursive neural network for chinese word segmentation," In Proceedings of the 53rd Annual Metting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 1744-1753, Jul. 2015.
Xinchi Chen, Xipeng Qiu, Chenxi Zhu, and Xuanjing Huang, "Long short-term memory neural networks for chinese word segmentation," In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1197-1206, Sep. 2015.
Deng Cai and Hai Zhao, "Neural Word Segmentation Learning for Chinese," Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 409-420, Aug. 2016.
Peilu Wang, Yao Qian, Hai Zhao, Frank K. Soong, Lei He, and Ke Wu, "Learning distributed word representations for bidirectional lstm recurrent neural network," In Proceeding of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologise, pp. 527-533, Jun. 2016.
강승식, "음절 bigram를 이용한 띄어쓰기 오류의 자동 교정," 음성과학, 제8권, 제2호, 83-90쪽, 2001년 6월
심광섭, "CRF를 이용한 한국어 자동 띄어쓰기," 인지과학, 제22권, 제2호, 217-233쪽, 2011년 6월
이창기, 김현기, "Structural SVM 을 이용한 한국어 자동 띄어쓰기," 한국정보과학회 2012 한국컴퓨터종합학술대회 논문집, 제39권, 제1호(B), 270-272쪽, 2012년 6월
황현선, 이창기, "딥러닝을 이용한 한국어 자동 띄어쓰기," 한국컴퓨터종합학술대회, 738-740쪽, 2016년 6월
Ilya Sutskever, Oriol Vinyals and Quoc V. Le, "Sequence to Sequence Learning with Neural Networks," arXiv preprint, arXiv:1409.3215, Dec. 2014.
Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever and Geoffrey Hinton, "Grammar as a Foreign Language," arXiv preprint, arXiv:1412.7449, Jun. 2015.
Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate," arXiv preprint, arXiv:1409.0473, May 2014.
Nitish Srivastava, Georey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov, "Dropout: A Simple Way to Prevent Neural Networks from Overtting," Journal of Machine Learing Research pp. 1929-1958, Jan. 2014.
Jimmy Lei Ba, Jamie Ryan Kiros and Geoffrey E. Hinton, "Layer Normalization," arXiv preprint, arXiv:1607.06450, Jul. 2016.
Matthew D. Zeiler, "ADADELTA an adaptive learning rate method," arXiv preprint, arXiv:1212.5701, Dec. 2012.
Chin-Yew Lin, "ROUGE: A Package for Automatic Evaluation of Summaries," In Proceedings of Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Jul. 2004.

Smart Media Journal (스마트미디어저널)

LSTM based sequence-to-sequence Model for Korean Automatic Word-spacing

LSTM 기반의 sequence-to-sequence 모델을 이용한 한글 자동 띄어쓰기

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)