• Title/Summary/Keyword: Conditional Entropy

Search Result 36, Processing Time 0.055 seconds

A probabilistic information retrieval model by document ranking using term dependencies (용어간 종속성을 이용한 문서 순위 매기기에 의한 확률적 정보 검색)

  • You, Hyun-Jo;Lee, Jung-Jin
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.5
    • /
    • pp.763-782
    • /
    • 2019
  • This paper proposes a probabilistic document ranking model incorporating term dependencies. Document ranking is a fundamental information retrieval task. The task is to sort documents in a collection according to the relevance to the user query (Qin et al., Information Retrieval Journal, 13, 346-374, 2010). A probabilistic model is a model for computing the conditional probability of the relevance of each document given query. Most of the widely used models assume the term independence because it is challenging to compute the joint probabilities of multiple terms. Words in natural language texts are obviously highly correlated. In this paper, we assume a multinomial distribution model to calculate the relevance probability of a document by considering the dependency structure of words, and propose an information retrieval model to rank a document by estimating the probability with the maximum entropy method. The results of the ranking simulation experiment in various multinomial situations show better retrieval results than a model that assumes the independence of words. The results of document ranking experiments using real-world datasets LETOR OHSUMED also show better retrieval results.

Word Sense Disambiguation using Korean Word Space Model (한국어 단어 공간 모델을 이용한 단어 의미 중의성 해소)

  • Park, Yong-Min;Lee, Jae-Sung
    • The Journal of the Korea Contents Association
    • /
    • v.12 no.6
    • /
    • pp.41-47
    • /
    • 2012
  • Various Korean word sense disambiguation methods have been proposed using small scale of sense-tagged corpra and dictionary definitions to calculate entropy information, conditional probability, mutual information and etc. for each method. This paper proposes a method using Korean Word Space model which builds word vectors from a large scale of sense-tagged corpus and disambiguates word senses with the similarity calculation between the word vectors. Experiment with Sejong morph sense-tagged corpus showed 94% precision for 200 sentences(583 word types), which is much superior to the other known methods.

A High Order Product Approximation Method based on the Minimization of Upper Bound of a Bayes Error Rate and Its Application to the Combination of Numeral Recognizers (베이스 에러율의 상위 경계 최소화에 기반한 고차 곱 근사 방법과 숫자 인식기 결합에의 적용)

  • Kang, Hee-Joong
    • Journal of KIISE:Software and Applications
    • /
    • v.28 no.9
    • /
    • pp.681-687
    • /
    • 2001
  • In order to raise a class discrimination power by combining multiple classifiers under the Bayesian decision theory, the upper bound of a Bayes error rate bounded by the conditional entropy of a class variable and decision variables obtained from training data samples should be minimized. Wang and Wong proposed a tree dependence first-order approximation scheme of a high order probability distribution composed of the class and multiple feature pattern variables for minimizing the upper bound of the Bayes error rate. This paper presents an extended high order product approximation scheme dealing with higher order dependency more than the first-order tree dependence, based on the minimization of the upper bound of the Bayes error rate. Multiple recognizers for unconstrained handwritten numerals from CENPARMI were combined by the proposed approximation scheme using the Bayesian formalism, and the high recognition rates were obtained by them.

  • PDF

A Lower Bound for Performance of Group Testing Problems (그룹검사 문제에 대한 성능 하한치)

  • Seong, Jin-Taek
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.11 no.5
    • /
    • pp.572-578
    • /
    • 2018
  • This paper considers Group Testing as one of combinatorial problems. The group testing first began to inspect soldier's syphilis infection during World War II and have long established an academic basis. Recently, there has been much interest in related areas because of the rediscovery of the value of the group testing. The group testing is the same as finding a few defect samples out of a large number of samples, which is similar to the inverse problem of Compressed Sensing. In this paper, we introduce the definition of the group testing, and specify the classes of the group testing and the bounds on performance of the group testing. In addition, we show a lower bound for the number of tests required to find defective samples using the theoretical theorem which is mainly used for relationship between conditional entropy and the probability of error in the information theory. We see how our result can be different from other related results.

Discretization of Numerical Attributes and Approximate Reasoning by using Rough Membership Function) (러프 소속 함수를 이용한 수치 속성의 이산화와 근사 추론)

  • Kwon, Eun-Ah;Kim, Hong-Gi
    • Journal of KIISE:Databases
    • /
    • v.28 no.4
    • /
    • pp.545-557
    • /
    • 2001
  • In this paper we propose a hierarchical classification algorithm based on rough membership function which can reason a new object approximately. We use the fuzzy reasoning method that substitutes fuzzy membership value for linguistic uncertainty and reason approximately based on the composition of membership values of conditional sttributes Here we use the rough membership function instead of the fuzzy membership function It can reduce the process that the fuzzy algorithm using fuzzy membership function produces fuzzy rules In addition, we transform the information system to the understandable minimal decision information system In order to do we, study the discretization of continuous valued attributes and propose the discretization algorithm based on the rough membership function and the entropy of the information theory The test shows a good partition that produce the smaller decision system We experimented the IRIS data etc. using our proposed algorithm The experimental results with IRIS data shows 96%~98% rate of classification.

  • PDF

Association-based Unsupervised Feature Selection for High-dimensional Categorical Data (고차원 범주형 자료를 위한 비지도 연관성 기반 범주형 변수 선택 방법)

  • Lee, Changki;Jung, Uk
    • Journal of Korean Society for Quality Management
    • /
    • v.47 no.3
    • /
    • pp.537-552
    • /
    • 2019
  • Purpose: The development of information technology makes it easy to utilize high-dimensional categorical data. In this regard, the purpose of this study is to propose a novel method to select the proper categorical variables in high-dimensional categorical data. Methods: The proposed feature selection method consists of three steps: (1) The first step defines the goodness-to-pick measure. In this paper, a categorical variable is relevant if it has relationships among other variables. According to the above definition of relevant variables, the goodness-to-pick measure calculates the normalized conditional entropy with other variables. (2) The second step finds the relevant feature subset from the original variables set. This step decides whether a variable is relevant or not. (3) The third step eliminates redundancy variables from the relevant feature subset. Results: Our experimental results showed that the proposed feature selection method generally yielded better classification performance than without feature selection in high-dimensional categorical data, especially as the number of irrelevant categorical variables increase. Besides, as the number of irrelevant categorical variables that have imbalanced categorical values is increasing, the difference in accuracy between the proposed method and the existing methods being compared increases. Conclusion: According to experimental results, we confirmed that the proposed method makes it possible to consistently produce high classification accuracy rates in high-dimensional categorical data. Therefore, the proposed method is promising to be used effectively in high-dimensional situation.