• Title/Summary/Keyword: word Weighting

Search Result 51, Processing Time 0.023 seconds

Style-Specific Language Model Adaptation using TF*IDF Similarity for Korean Conversational Speech Recognition

  • Park, Young-Hee;Chung, Min-Hwa
    • The Journal of the Acoustical Society of Korea
    • /
    • v.23 no.2E
    • /
    • pp.51-55
    • /
    • 2004
  • In this paper, we propose a style-specific language model adaptation scheme using n-gram based tf*idf similarity for Korean spontaneous speech recognition. Korean spontaneous speech shows especially different style-specific characteristics such as filled pauses, word omission, and contraction, which are related to function words and depend on preceding or following words. To reflect these style-specific characteristics and overcome insufficient data for training language model, we estimate in-domain dependent n-gram model by relevance weighting of out-of-domain text data according to their n-. gram based tf*idf similarity, in which in-domain language model include disfluency model. Recognition results show that n-gram based tf*idf similarity weighting effectively reflects style difference.

Design of Big Data Preference Analysis System (빅데이터 선호도 분석 시스템 설계)

  • Son, Sung Il;Park, Chan Khon
    • Journal of Korea Multimedia Society
    • /
    • v.17 no.11
    • /
    • pp.1286-1295
    • /
    • 2014
  • This paper suggests the way that it could improve the reliability about preference of user's feedback by adding weighting factor on sentiment analysis, and efficiently make a sentiment analysis of users' emotional perspective on the big data massively generated on twitter. To solve errors on earlier studies, this paper has improved recall and precision of sensibility determination by using sensibility dictionary subdivided sentiment polarity based on the level of sensibility and given impotance to sensibility determination by populating slang, new words, emoticons and idiomatic expressions not in the system dictionary. It has considered the context through conjunctive adverbs fixed in korean characteristics which are free to the word order. It also recognize sensibility words such as TF(Term Frequency), RT(Retweet), Follower which are weighting factors of preference and has increased reliability of preference analysis considering weight on 'a very emotional tweet', 'a recognised tweet from users' and 'a tweeter influencer'

Thematic Word Extraction from Book Based on Keyword Weighting Method (키워드 가중치 방식에 근거한 도서 본문 주제어 추출)

  • Ahn, Hee-Jeong;Choi, Gun-Hee;Kim, Seung-Hoon
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2015.01a
    • /
    • pp.19-22
    • /
    • 2015
  • 본 논문에서는 문장 및 문단에서 키워드의 역할에 따른 가중치에 근거하여 도서 본문에서 주제어를 추출하는 방법을 제안한다. 기존의 주제어 추출 방식은 도서 본문이 아닌 신문이나 논문에 대한 방식이므로 도서 본문에서의 주제어 추출에 그대로 적용하기에는 어려움이 있다. 따라서 본 논문에서는 빈도수뿐만 아니라 문장 내 중요 요소에 대한 가중치와 중요 문장에 대한 가중치를 후보 키워드에 부여하는 방식을 제안하였다. 제안한 계산 방식을 비문학 도서에 대하여 실험한 결과, 빈도수만으로 주제어를 추출한 기존 방식보다 본 논문에서 제안한 방식의 주제어 추출 결과의 정확도가 향상되는 것을 확인하였다.

  • PDF

A study on the speech recognition by HMM based on multi-observation sequence (다중 관측열을 토대로한 HMM에 의한 음성 인식에 관한 연구)

  • 정의봉
    • Journal of the Korean Institute of Telematics and Electronics S
    • /
    • v.34S no.4
    • /
    • pp.57-65
    • /
    • 1997
  • The purpose of this paper is to propose the HMM (hidden markov model) based on multi-observation sequence for the isolated word recognition. The proosed model generates the codebook of MSVQ by dividing each word into several sections followed by dividing training data into several sections. Then, we are to obtain the sequential value of multi-observation per each section by weighting the vectors of distance form lower values to higher ones. Thereafter, this the sequential with high probability value while in recognition. 146 DDD area names are selected as the vocabularies for the target recognition, and 10LPC cepstrum coefficients are used as the feature parameters. Besides the speech recognition experiments by way of the proposed model, for the comparison with it, the experiments by DP, MSVQ, and genral HMM are made with the same data under the same condition. The experiment results have shown that HMM based on multi-observation sequence proposed in this paper is proved superior to any other methods such as the ones using DP, MSVQ and general HMM models in recognition rate and time.

  • PDF

An Automatic Classification System of Official Documents in Middle Schools Using Term Weighting of Titles (제목의 단어 가중치를 이용한 중등학교 공문서 자동분류시스템)

  • Kang, Hyun-Hee;Jin, Min
    • Journal of The Korean Association of Information Education
    • /
    • v.7 no.2
    • /
    • pp.219-226
    • /
    • 2003
  • It takes a lot of time to classify official documents in schools and educational institutions. In order to reduce the overhead, we propose an automatic document classification method using word information of the titles of documents in this paper. At first, meaningful words are extracted from titles of existing documents and Inverse Document Frequency(IDF) weights of words are calculated against each category. Then we build a word weight dictionary. Documents are automatically classified into the appropriate category of which the sum of weights of words of the title is the highest by using the word weight dictionary. We also evaluate the performance of the proposed method using a real dataset of a middle school.

  • PDF

A Study of Efficiency Information Filtering System using One-Hot Long Short-Term Memory

  • Kim, Hee sook;Lee, Min Hi
    • International Journal of Advanced Culture Technology
    • /
    • v.5 no.1
    • /
    • pp.83-89
    • /
    • 2017
  • In this paper, we propose an extended method of one-hot Long Short-Term Memory (LSTM) and evaluate the performance on spam filtering task. Most of traditional methods proposed for spam filtering task use word occurrences to represent spam or non-spam messages and all syntactic and semantic information are ignored. Major issue appears when both spam and non-spam messages share many common words and noise words. Therefore, it becomes challenging to the system to filter correct labels between spam and non-spam. Unlike previous studies on information filtering task, instead of using only word occurrence and word context as in probabilistic models, we apply a neural network-based approach to train the system filter for a better performance. In addition to one-hot representation, using term weight with attention mechanism allows classifier to focus on potential words which most likely appear in spam and non-spam collection. As a result, we obtained some improvement over the performances of the previous methods. We find out using region embedding and pooling features on the top of LSTM along with attention mechanism allows system to explore a better document representation for filtering task in general.

A Document Sentiment Classification System Based on the Feature Weighting Method Improved by Measuring Sentence Sentiment Intensity (문장 감정 강도를 반영한 개선된 자질 가중치 기법 기반의 문서 감정 분류 시스템)

  • Hwang, Jae-Won;Ko, Young-Joong
    • Journal of KIISE:Software and Applications
    • /
    • v.36 no.6
    • /
    • pp.491-497
    • /
    • 2009
  • This paper proposes a new feature weighting method for document sentiment classification. The proposed method considers the difference of sentiment intensities among sentences in a document. Sentiment features consist of sentiment vocabulary words and the sentiment intensity scores of them are estimated by the chi-square statistics. Sentiment intensity of each sentence can be measured by using the obtained chi-square statistics value of each sentiment feature. The calculated intensity values of each sentence are finally applied to the TF-IDF weighting method for whole features in the document. In this paper, we evaluate the proposed method using support vector machine. Our experimental results show that the proposed method performs about 2.0% better than the baseline which doesn't consider the sentiment intensity of a sentence.

Cognitive Modeling of Unusual Association with Declarative Knowledge by Positive Affect (긍정적 감정에 따른 선언적 지식에 관한 비전형적 연상 과정에 대한 인지모델링)

  • Park, Sung-Jin;Myung, Ro-Hae
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.41 no.1
    • /
    • pp.43-49
    • /
    • 2015
  • The aim of this study was to model unusual association with declarative knowledge by positive affect using ACT-R cognitive architecture. Existing research related with cognitive modeling tends to pay a lot of attention to strong and negative cognitive moderator. Mild positive affect, however, has far-reaching effects on problem solving and decision making. Typically, subjects with positive affect were more likely to respond to unusual associates in a word association task than subjects with neutral affect. In this study, a cognitive model using ACT-R cognitive architecture was developed to show the effect of positive affect on the cognitive organization related with memory. First, we organized the memory structure of stimulus word 'palm' based on published results in a word association task. Then, we decreased an ACT-R parameter that reflects the amount of weighting given to the dissimilarity between the stimulus word and the associate word to represent reorganized memory structure of the model by positive affect. As a result, no significant associate probability difference between model prediction and existing empirical data was found. The ACT-R cognitive architecture could be used to model the effect of positive affect on the unusual association by decreasing (manipulating) the weight of the dissimilarity. This study is useful in conducting model-based evaluation of the effects of positive affect in complex tasks involving memory, such as creative problem solving.

Knowledge-poor Term Translation using Common Base Axis with application to Korean-English Cross-Language Information Retrieval (과도한 지식을 요구하지 않는 공통기반축에 의한 용어 번역과 한영 교차정보검색에의 응용)

  • 최용석;최기선
    • Korean Journal of Cognitive Science
    • /
    • v.14 no.1
    • /
    • pp.29-40
    • /
    • 2003
  • Cross-Language Information Retrieval (CLIR) deals with the documents in various languages by one language query. A user who uses one language can retrieve the documents in another language through CLIR system. In CLIR, query translation method is known to be more efficient. For the better performance of query translation, we need more resources like dictionary, ontology, and parallel/comparable corpus but usually not available. This paper proposes a new concept called the Common Base Axis which is adapted to Korean-English Query translation ann a new weighting method in dictionary based query translation. The essential idea is that we can express Korean and English word in one vector space by Common Base Axis and use it in calculating sense distance for query weighting. The experiments show that Common Base Axis gives us good performance without ontology and is especially good for one word query translation.

  • PDF

A Study on the Deduction of Social Issues Applying Word Embedding: With an Empasis on News Articles related to the Disables (단어 임베딩(Word Embedding) 기법을 적용한 키워드 중심의 사회적 이슈 도출 연구: 장애인 관련 뉴스 기사를 중심으로)

  • Choi, Garam;Choi, Sung-Pil
    • Journal of the Korean Society for information Management
    • /
    • v.35 no.1
    • /
    • pp.231-250
    • /
    • 2018
  • In this paper, we propose a new methodology for extracting and formalizing subjective topics at a specific time using a set of keywords extracted automatically from online news articles. To do this, we first extracted a set of keywords by applying TF-IDF methods selected by a series of comparative experiments on various statistical weighting schemes that can measure the importance of individual words in a large set of texts. In order to effectively calculate the semantic relation between extracted keywords, a set of word embedding vectors was constructed by using about 1,000,000 news articles collected separately. Individual keywords extracted were quantified in the form of numerical vectors and clustered by K-means algorithm. As a result of qualitative in-depth analysis of each keyword cluster finally obtained, we witnessed that most of the clusters were evaluated as appropriate topics with sufficient semantic concentration for us to easily assign labels to them.