• Title/Summary/Keyword: 음절 표현

Search Result 69, Processing Time 0.023 seconds

SMS Text Messages Filtering using Word Embedding and Deep Learning Techniques (워드 임베딩과 딥러닝 기법을 이용한 SMS 문자 메시지 필터링)

  • Lee, Hyun Young;Kang, Seung Shik
    • Smart Media Journal
    • /
    • v.7 no.4
    • /
    • pp.24-29
    • /
    • 2018
  • Text analysis technique for natural language processing in deep learning represents words in vector form through word embedding. In this paper, we propose a method of constructing a document vector and classifying it into spam and normal text message, using word embedding and deep learning method. Automatic spacing applied in the preprocessing process ensures that words with similar context are adjacently represented in vector space. Additionally, the intentional word formation errors with non-alphabetic or extraordinary characters are designed to avoid being blocked by spam message filter. Two embedding algorithms, CBOW and skip grams, are used to produce the sentence vector and the performance and the accuracy of deep learning based spam filter model are measured by comparing to those of SVM Light.

The Differences of Naming by Word Frequency, Length, and Animacy in Nonfluent Aphasic Patients (비유창성 실어증 환자의 단어빈도 및 길이, 생물성에 따른 이름대기 수행의 차이)

  • Kwon, Jung Hee;Choi, Hyun Joo
    • 재활복지
    • /
    • v.20 no.1
    • /
    • pp.171-188
    • /
    • 2016
  • The purpose of this study is to investigate effects of three conditions-words frequency, word length, and animacy-on the performance of naming in nonfluent aphasic patients. 15 nonfluent aphasic patients and 15 normal adults were participated in this study. The words consist only of concrete nouns and confrontational naming test was used. The test consisted of 40 questions and the condition of word were frequency(low-frequency/high-frequency), length(1 syllable/3 syllables) and animacy(animate/inanimate). The result of the study was as follows. First, naming was performed better with high-frequency words than with low-frequency words in both groups. Second, naming was performed better with 1 syllable words than with 3 syllable words in both groups. Third, naming performance depending on animacy did not show significant differences in both groups. These results indicate that compared to animacy of word, word frequency and length have bigger influence on the naming, and the difference by word frequency was more pronounced for nonfluent aphasic patients than for normal adults. The results of this study suggest that target word for the assessment and intervention of nonfluent aphasic patients, words frequency should be considered first in clinical setting.

Visualization of Korean Speech Based on the Distance of Acoustic Features (음성특징의 거리에 기반한 한국어 발음의 시각화)

  • Pok, Gou-Chol
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.13 no.3
    • /
    • pp.197-205
    • /
    • 2020
  • Korean language has the characteristics that the pronunciation of phoneme units such as vowels and consonants are fixed and the pronunciation associated with a notation does not change, so that foreign learners can approach rather easily Korean language. However, when one pronounces words, phrases, or sentences, the pronunciation changes in a manner of a wide variation and complexity at the boundaries of syllables, and the association of notation and pronunciation does not hold any more. Consequently, it is very difficult for foreign learners to study Korean standard pronunciations. Despite these difficulties, it is believed that systematic analysis of pronunciation errors for Korean words is possible according to the advantageous observations that the relationship between Korean notations and pronunciations can be described as a set of firm rules without exceptions unlike other languages including English. In this paper, we propose a visualization framework which shows the differences between standard pronunciations and erratic ones as quantitative measures on the computer screen. Previous researches only show color representation and 3D graphics of speech properties, or an animated view of changing shapes of lips and mouth cavity. Moreover, the features used in the analysis are only point data such as the average of a speech range. In this study, we propose a method which can directly use the time-series data instead of using summary or distorted data. This was realized by using the deep learning-based technique which combines Self-organizing map, variational autoencoder model, and Markov model, and we achieved a superior performance enhancement compared to the method using the point-based data.

Expansion of Word Representation for Named Entity Recognition Based on Bidirectional LSTM CRFs (Bidirectional LSTM CRF 기반의 개체명 인식을 위한 단어 표상의 확장)

  • Yu, Hongyeon;Ko, Youngjoong
    • Journal of KIISE
    • /
    • v.44 no.3
    • /
    • pp.306-313
    • /
    • 2017
  • Named entity recognition (NER) seeks to locate and classify named entities in text into pre-defined categories such as names of persons, organizations, locations, expressions of times, etc. Recently, many state-of-the-art NER systems have been implemented with bidirectional LSTM CRFs. Deep learning models based on long short-term memory (LSTM) generally depend on word representations as input. In this paper, we propose an approach to expand word representation by using pre-trained word embedding, part of speech (POS) tag embedding, syllable embedding and named entity dictionary feature vectors. Our experiments show that the proposed approach creates useful word representations as an input of bidirectional LSTM CRFs. Our final presentation shows its efficacy to be 8.05%p higher than baseline NERs with only the pre-trained word embedding vector.

Context Based Real-time Korean Writing Correction for Foreigners (외국인 학습자를 위한 문맥 기반 실시간 국어 문장 교정)

  • Park, Young-Keun;Kim, Jae-Min;Lee, Seong-Dong;Lee, Hyun Ah
    • Journal of KIISE
    • /
    • v.44 no.10
    • /
    • pp.1087-1093
    • /
    • 2017
  • Educating foreigners in Korean language is attracting increasing attention with the growing number of foreigners who want to learn Korean or want to reside in Korea. Existing spell checkers mostly focus on native Korean speakers, so they are inappropriate for foreigners. In this paper, we propose a correction method for the Korean language that reflects the contextual characteristics of Korean and writing characteristics of foreigners. Our method can extract frequently used expressions by Koreans by constructing syllable reverse-index for eojeol bi-gram extracted from corpus as correction candidates, and generate ranked Korean corrections for foreigners with upgraded edit distance calculation. Our system provides a user interface based on keyboard hooking, so a user can easily use the correction system along with other applications. Our system improves the detection rate for foreign language users by about 45% compared to other systems in foreign language writing environments. This will help foreign users to judge and correct their own writing errors.

Recognition of Korean Implicit Citation Sentences Using Machine Learning with Lexical Features (어휘 자질 기반 기계 학습을 사용한 한국어 암묵 인용문 인식)

  • Kang, In-Su
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.16 no.8
    • /
    • pp.5565-5570
    • /
    • 2015
  • Implicit citation sentence recognition is to locate citation sentences which lacks explicit citation markers, from articles' full-text. State-of-the-art approaches exploit word ngrams, clue words, researcher's surnames, mentions of previous methods, and distance relative to nearest explicit citation sentences, etc., reaching over 50% performance. However, most previous works have been conducted on English. As for Korean, a rule-based method using positive/negative clue patterns was reported to attain the performance of 42%, requiring further improvement. This study attempted to learn to recognize implicit citation sentences from Korean literatures' full-text using Korean lexical features. Different lexical feature units such as Eojeol, morpheme, and Eumjeol were evaluated to determine proper lexical features for Korean implicit citation sentence recognition. In addition, lexical features were combined with the position features representing backward/forward proximities to explicit citation sentences, improving the performance up to over 50%.

Development of Sensor System for Finger Gesture (수화 인식에 대한 센서 시스템)

  • Lee, Jaehong;Jeong, Eunseok;Kim, DaeEun
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2011.07a
    • /
    • pp.4-5
    • /
    • 2011
  • 수화는 몸 동작 또는 손가락의 움직임을 통하여 상호 커뮤니케이션을 하는 하나의 언어이며, 이 언어의 디지탈 미디어로의 소통을 위해서는 동작을 하나의 의미 있는 단어, 음절로의 표현이 가능해야 한다. 여기서는 몸 동작, 팔 다리의 움직임 보다는 손가락의 움직임에 초점을 맞추어 지문자 인식에 필요한 센서 시스템에 대하여 고찰한다. 우선 연속적인 지문자, 지숫자의 입력을 인식하기 위해서는 각 문자 절음 인식이 가장 중요한 문제가 된다. 절음 위치를 인식하는 것은 현재 입력된 패턴과 다음 패턴을 구분지어 각각 다른 지문자 혹은 지숫자로 인식할 수 있게 하는 기반이 된다. 손가락 구부러짐의 구분 및 인식을 위한 방법의 개발을 위해, 언어별 수화의 특징 분석을 토대로 다양한 적용 가능한 센서를 탐색하고 수화 장갑을 위한 원천기술을 개발, 수화 장갑 프로토타입을 제작하였다.

  • PDF

A Noun Extractor using Connectivity Information (좌우접속정보를 이용한 명사추출기)

  • An, Dong-Un
    • Annual Conference on Human and Language Technology
    • /
    • 1999.10d
    • /
    • pp.173-178
    • /
    • 1999
  • 본 논문의 명사추출기는 정보검색시스템을 위한 색인어 추출기로 좌우접속정보를 이용한 형태소해석을 통하여 얻어진 형태소들 중에서 명사를 추출한다. 본 형태소해석기는 형태소해석을 위한 언어지식과 어절 분리 엔진을 분리하여 수정과 확장이 용이하게 하였다. 사용한 언어지식은 좌우접속정보로서 한 어절을 이루는 형태소들의 품사간의 접속여부를 행렬로 표현한 것이다. 어절 분리 엔진은 사전을 참조하여 한 어절에서 최장일치법에 의해 형태소를 분리하고 좌우접속정보를 참조하여 형태소 분리가 올바른지를 판단한다. 형태소들의 품사분류는 표준 태그셋을 기반으로 음절 정보를 추가하여 확장하였다. 형태소를 해석한 결과 미등록어가 발생하였을 때 미등록어에서 명사를 추정하는 모듈이 없기 때문에 재현율은 좋지 않았다.

  • PDF

Sequence-to-sequence Autoencoder based Korean Text Error Correction using Syllable-level Multi-hot Vector Representation (음절 단위 Multi-hot 벡터 표현을 활용한 Sequence-to-sequence Autoencoder 기반 한글 오류 보정기)

  • Song, Chisung;Han, Myungsoo;Cho, Hoonyoung;Lee, Kyong-Nim
    • Annual Conference on Human and Language Technology
    • /
    • 2018.10a
    • /
    • pp.661-664
    • /
    • 2018
  • 온라인 게시판 글과 채팅창에서 주고받는 대화는 실제 사용되고 있는 구어체 특성이 잘 반영된 텍스트 코퍼스로 음성인식의 언어 모델 재료로 활용하기 좋은 학습 데이터이다. 하지만 온라인 특성상 노이즈가 많이 포함되어 있기 때문에 학습에 직접 활용하기가 어렵다. 본 논문에서는 사용자 입력오류가 다수 포함된 문장에서의 한글 오류 보정을 위한 sequence-to-sequence Denoising Autoencoder 모델을 제안한다.

  • PDF

Study on the Hangul typeface of the decentralized density through the horizontal disposition of phoneme. (Hangul typeface for New Hangul Code) (음소의 가로선형 배열을 통한 밀도 분산형 한글꼴연구 ( 새로운 음소형 코드체계를 위한 한글꼴 ))

  • Moon, Souk-Bae
    • Annual Conference on Human and Language Technology
    • /
    • 1994.11a
    • /
    • pp.223-230
    • /
    • 1994
  • 본 한글꼴은 음절 및 음소의 가시성을 높이고자 한글 음소의 이중 가로선형 배열을 통한 밀도 분산형 한글꼴과 음소 나열형 한글꼴 등의 새로운 한글꼴의 다양한 표현의 실험 연구이다. 일도 분산형 한글꼴은 새로운 음소형 한글코드(닿소리, 홑소리, 받침 조합형)와 서로 대응하드록 일원화 한글꼴로 한글 및 옛 한글의 음소 조합형의 입.출력이 가능하다. 이러한 시도는 1바이트 이내에서 현대한글 및 옛한글을 구현하며, 이는 한글의 구현원리에 따른 음소형 코드체계의 실현 가능성으로 한글 코드체계의 최적화에 대한 새로운 가설을 제시 한다.

  • PDF