• Title/Summary/Keyword: 어휘 필터

Search Result 24, Processing Time 0.026 seconds

A Bloom filter-based Sentiment-aware Web Crawling Algorithm (블룸 필터를 이용한 감성 웹 문서 크롤링 알고리즘)

  • Na, Chul-Won;On, Byung-Won
    • Annual Conference on Human and Language Technology
    • /
    • 2018.10a
    • /
    • pp.69-74
    • /
    • 2018
  • 최근 빅 데이터와 인공지능의 발달과 함께 감성 분석에 대한 연구가 활발해지고 있다. 더불어 감성 분석을 위한 긍/부정 어휘가 풍부한 텍스트 문서들에 대한 수집의 필요성도 높아지고 있다. 본 논문은 긍/부정어휘가 풍부한 텍스트 문서들을 수집하는 기존의 수집 방법에 대한 문제점에 대하여 해결방안을 제시한다. 기존의 수집 방법으로 일단 모든 URL들을 저장하고 필터링 과정을 거쳐 긍/부정 어휘가 풍부한 텍스트 문서들을 수집하고자 한다면 불필요한 텍스트 문서 저장과 필터링 과정에서 메모리와 시간을 낭비하게 된다. 기존의 수집 방법에 블룸 필터라는 자료구조를 적용시켜 메모리와 시간을 낭비하게 되는 문제점을 해결하고자 한다.

  • PDF

Analysis of filtering performance of Korean and English spam-mails (한국어와 영어 스팸메일의 필터링 성능 분석)

  • Hwang Wun-Ho;Kang Sin-Jae;Kim Tae-Hee;Kim Hee-Jae;Kim Jong-Wan
    • Proceedings of the Korea Society for Industrial Systems Conference
    • /
    • 2006.05a
    • /
    • pp.389-396
    • /
    • 2006
  • 본 연구에서는 한국어와 영어 메일을 대상으로 2단계 스팸 메일 필터링 시스템을 구축하여 성능평가를 수행한다. 2단계 스팸 메일 필터링 시스템은 블랙리스트를 활용하는 1단계와 기계학습을 통한 지능적인 분류를 하는 2단계로 구성된다. 만약 새로 도착한 메일이 블랙리스트의 내용을 포함한다면 이 메일은 스팸 메일로 분류되고 그렇지 않은 메일은 2단계로 넘어가서 스팸 메일 여부를 판단하게 된다. 메일의 본문이 영어로 작성된 영어 스팸 메일을 일반 메일로부터 분류해내기 위해서는 우선 Stemming과 Stopping 기법을 이용하여 본문에서 정형화된 어휘정보들을 추출한다. 추출된 어휘정보들을 대상으로 속성벡터를 구축한 후 SVM 기계 학습을 시켜 SVM 분류기를 생성하여 지능적인 스팸 메일 필터링을 수행한다. 속성벡터를 구축할 때 기준이 되는 자질을 어떻게 선택하느냐에 따라 스팸 메일 필터링 시스템의 성능이 좌우된다. 따라서 SYM 기계 학습을 위한 속성벡터를 구축할 때 기준이 되는 자질을 선택하는 여러 알고리즘들을 적용하여 성능을 비교 분석한다. 그리고 한국어 스팸 메일 필터링 시스템과 비교하여 영어 스팸 메일 필터링 시스템의 전체적인 성능을 비교 분석한다.

  • PDF

Spam-mail Filtering based on Lexical Information and Thesaurus (어휘정보와 시소러스에 기반한 스팸메일 필터링)

  • Kang Shin-Jae;Kim Jong-Wan
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.11 no.1
    • /
    • pp.13-20
    • /
    • 2006
  • In this paper, we constructed a spam-mail filtering system based on the lexical and conceptual information. There are two kinds of information that can distinguish the spam mail from the legitimate mil. The definite information is the mail sender's information, URL, a certain spam keyword list, and the less definite information is the word lists and concept codes extracted from the mail body. We first classified the spam mail by using the definite information, and then used the less definite information. We used the lexical information and concept codes contained in the email body for SVM learning. According to our results the spam precision was increased if more lexical information was used as features, and the spam recall was increased when the concept codes were included in features as well.

  • PDF

A study of speaker dependent speech recognition using neural network (신경회로망을 이용한 화자종속 음성인식 성능에 관한 연구)

  • 윤지원;이종수
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 2003.05a
    • /
    • pp.153-156
    • /
    • 2003
  • 본 연구는 화자종속 소어휘 음성인식의 성능을 개선하는 데 그 목적이 있다. 인식에 사용될 음성의 특징을 얻기 위해 Winer 필터와 LPC&Cepstrum을 이용하여 프레임 당 12차 패턴을 추출하였다. 추출된 특징패턴을 인식하는 인식부는 특히 소어휘 음성인식에 우수한 성능을 보이는 기존의 역전파 신경회로망(Backpropagation Neural Network)에 인식율 개선을 위하여 퍼지추론시스템을 결합한 형태로 구현되었다. 실험결과 신경망만을 사용한 경우에 비하여 인식율이 향상됨을 연구하였다.

  • PDF

Contents-Based Korean SMS Spam Filtering Using Morpheme Unit Features (형태소 단위 자질을 이용한 콘텐츠 기반 한국어 SMS 스팸 필터링)

  • Sohn, Dae-Neung;Shin, Joong-Hwi;Lee, Jung-Tae;Lee, Seung-Wook;Rim, Hae-Chang
    • Annual Conference on Human and Language Technology
    • /
    • 2008.10a
    • /
    • pp.195-200
    • /
    • 2008
  • 본 논문에서는 형태소 분석을 이용한 확률 기반 한국어 SMS 스팸 필터링 기법을 제안한다. 기존 연구에서는 단어 및 문자 단위 어휘 정보를 자질로 이용한 영어 및 스페인어 SMS 스팸 필터링 방법들이 있다. 하지만 교착어인 한국어의 경우, 어근과 접사의 조합에 의해서 다양한 어절이 형성될 수 있다. 따라서 어절단위 어휘 정보를 자질로 사용할 경우, 미등록어(out of vocabulary) 문제가 발생한다. 특히, 매우 적은 수의 단어들로 구성된 SMS 메시지의 경우에는 이 문제가 매우 심각하다. 본 논문에서는 형태소 분석을 이용하여 이러한 문제점을 해결하고자 하였다. 실험 결과, 제안하는 방법은 기존 연구와 비교하여 10.6%의 스팸 분류 정확률 향상을 보였다. 또한 미등록어만을 포함하는 SMS 메시지의 수는 약 77% 감소하였다.

  • PDF

Vocabulary Recognition Model using a convergence of Likelihood Principla Bayesian methode and Bhattacharyya Distance Measurement based on Vector Model (벡터모델 기반 바타챠랴 거리 측정 기법과 우도 원리 베이시안을 융합한 어휘 인식 모델)

  • Oh, Sang-Yeob
    • Journal of Digital Convergence
    • /
    • v.13 no.11
    • /
    • pp.165-170
    • /
    • 2015
  • The Vocabulary Recognition System made by recognizing the standard vocabulary is seen as a decline of recognition when out of the standard or similar words. The vector values of the existing system to the model created by configuring the database was used in the recognition vocabulary. The model to be formed during the search for the recognition vocabulary is recognizable because there is a disadvantage not configured with a database. In this paper, it induced to recognize the vector model is formed by the search and configuration using a Bayesian model recognizes the Bhattacharyya distance measurement based on the vector model, by applying the Wiener filter improves the recognition rate. The result of Convergence of two method's are improved reliability experiments for distance measurement. Using a proposed measurement are compared to the conventional method exhibited a performance of 98.2%.

A Fashion Design Recommender Agent System using Collaborative Filtering and Sensibilities related to Textile Design Factors (텍스타일 기반의 협력적 필터링 기술과 디자인 요소에 따른 감성 분석을 이용한 패션 디자인 추천 에이전트 시스템)

  • 정경용;나영주;이정현
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.10 no.2
    • /
    • pp.174-188
    • /
    • 2004
  • In the life environment changed with not only the quality and the price of the products but also the material abundance, it is the most crucial factor for the strategy of product sales to investigate consumer's sensibility and preference degree. In this perspective, it is necessary to design and merchandise the products in cope with each consumer's sensibility and needs as well as its functional aspects. In this paper, we propose the Fashion Design Recommender Agent System (FDRAS-pro) for textile design applying collaborative filtering personalization technique as one of the methods of material development centered on consumer's sensibility and preference. For a collaborative filtering system based on textile, Representative-Attribute Neighborhood is adopted to determine the number or neighbors that will be used for preferences estimation. Pearson's Correlation Coefficient is used to calculate similarity weights among users. We build a database founded on the sensibility adjectives to develop textile designs by extracting the representative sensibility adjectives from users' sensibility and preferences about textile designs. FDRAS-pro recommends textile designs to a customer who has a similar propensity about textile. To investigate the sensibility and emotion according to the effect of design factors, fertile designs were analyzed in terms of 9 design factors, such as, motif source, motif-background ratio, motif variation, motif interpretation, motif arrangement, motif articulation, hue contrast, value contrast, chroma contrast. Finally, we plan to conduct empirical applications to verify the adequacy and the validity of our system.

Recognition of Answer Type for WiseQA (WiseQA를 위한 정답유형 인식)

  • Heo, Jeong;Ryu, Pum Mo;Kim, Hyun Ki;Ock, Cheol Young
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.4 no.7
    • /
    • pp.283-290
    • /
    • 2015
  • In this paper, we propose a hybrid method for the recognition of answer types in the WiseQA system. The answer types are classified into two categories: the lexical answer type (LAT) and the semantic answer type (SAT). This paper proposes two models for the LAT detection. One is a rule-based model using question focuses. The other is a machine learning model based on sequence labeling. We also propose two models for the SAT classification. They are a machine learning model based on multiclass classification and a filtering-rule model based on the lexical answer type. The performance of the LAT detection and the SAT classification shows F1-score of 82.47% and precision of 77.13%, respectively. Compared with IBM Watson for the performance of the LAT, the precision is 1.0% lower and the recall is 7.4% higher.

A Study on the Retrieval Effectiveness of KoreaMed using MeSH Search Filter and Word-Proximity Search (검색용 MeSH 필터와 단어인접탐색 기법을 활용한 KoreaMed 검색 효율성 향상 연구)

  • Jeong, So-Na;Jeong, Ji-Na
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.18 no.5
    • /
    • pp.596-607
    • /
    • 2017
  • This study examined the method for adding related to "stomach neoplasms" as filters to the Medical Subject Headings (MeSH) for search as well as a method for improving the search efficiency through a word-proximity search by measuring the distance of co-occurring terms. A total of 8,625 articles published between 2007 and 2016 with the major topic terms "stomach neoplasms" were downloaded from PubMed article titles. The vocabulary to be added to the MeSH for search were analyzed. The search efficiency was verified by 277 articles that had "Stomach Neoplasms" indexed as MEDLINE MeSH in KoreaMed. As a result, 973 terms were selected as the candidate vocabulary. "Gastric Cancer" (2,780 appearances) was the most frequent term and 7,376 compound words (88.51%) combined the histological terms of "stomach" and "neoplasm", such as "gastric adenocarcinoma" and "gastric MALT lymphoma". A total of 5,234 compounds words (70.95%), in which the co-occurring distance was two words, were found. The matching rate through the MEDLINE MeSH and KoreaMed MeSH Indexer was 209 articles (75.5%). The search efficiency improved to 263 articles (94.9%) when the search filters were added, and to 268 articles (96.7%) when the 13 word-proximity search technique of the co-occurring terms was applied. This study showed that the use of a thesaurus as a means of improving the search efficiency in a natural language search could maintain the advantages of controlled vocabulary. The search accuracy can be improved using the word-proximity search instead of a Boolean search.

Speech Recognition Performance Improvement using a convergence of GMM Phoneme Unit Parameter and Vocabulary Clustering (GMM 음소 단위 파라미터와 어휘 클러스터링을 융합한 음성 인식 성능 향상)

  • Oh, SangYeob
    • Journal of Convergence for Information Technology
    • /
    • v.10 no.8
    • /
    • pp.35-39
    • /
    • 2020
  • DNN error is small compared to the conventional speech recognition system, DNN is difficult to parallel training, often the amount of calculations, and requires a large amount of data obtained. In this paper, we generate a phoneme unit to estimate the GMM parameters with each phoneme model parameters from the GMM to solve the problem efficiently. And it suggests ways to improve performance through clustering for a specific vocabulary to effectively apply them. To this end, using three types of word speech database was to have a DB build vocabulary model, the noise processing to extract feature with Warner filters were used in the speech recognition experiments. Results using the proposed method showed a 97.9% recognition rate in speech recognition. In this paper, additional studies are needed to improve the problems of improved over fitting.