지식베이스를 이용한 임베디드용 연속음성인식의 어휘 적용률 개선

Vocabulary Coverage Improvement for Embedded Continuous Speech Recognition Using Knowledgebase

  • 김광호 (서강대학교 컴퓨터공학과) ;
  • 임민규 (서강대학교 컴퓨터공학과) ;
  • 김지환 (서강대학교 컴퓨터공학과)
  • 발행 : 2008.12.30

초록

In this paper, we propose a vocabulary coverage improvement method for embedded continuous speech recognition (CSR) using knowledgebase. A vocabulary in CSR is normally derived from a word frequency list. Therefore, the vocabulary coverage is dependent on a corpus. In the previous research, we presented an improved way of vocabulary generation using part-of-speech (POS) tagged corpus. We analyzed all words paired with 101 among 152 POS tags and decided on a set of words which have to be included in vocabularies of any size. However, for the other 51 POS tags (e.g. nouns, verbs), the vocabulary inclusion of words paired with such POS tags are still based on word frequency counted on a corpus. In this paper, we propose a corpus independent word inclusion method for noun-, verb-, and named entity(NE)-related POS tags using knowledgebase. For noun-related POS tags, we generate synonym groups and analyze their relative importance using Google search. Then, we categorize verbs by lemma and analyze relative importance of each lemma from a pre-analyzed statistic for verbs. We determine the inclusion order of NEs through Google search. The proposed method shows better coverage for the test short message service (SMS) text corpus.

키워드