• Title/Summary/Keyword: Retrieved Documents

Search Result 99, Processing Time 0.022 seconds

A Hangul Document Classification System using Case-based Reasoning (사례기반 추론을 이용한 한글 문서분류 시스템)

  • Lee, Jae-Sik;Lee, Jong-Woon
    • Asia pacific journal of information systems
    • /
    • v.12 no.2
    • /
    • pp.179-195
    • /
    • 2002
  • In this research, we developed an efficient Hangul document classification system for text mining. We mean 'efficient' by maintaining an acceptable classification performance while taking shorter computing time. In our system, given a query document, k documents are first retrieved from the document case base using the k-nearest neighbor technique, which is the main algorithm of case-based reasoning. Then, TFIDF method, which is the traditional vector model in information retrieval technique, is applied to the query document and the k retrieved documents to classify the query document. We call this procedure 'CB_TFIDF' method. The result of our research showed that the classification accuracy of CB_TFIDF was similar to that of traditional TFIDF method. However, the average time for classifying one document decreased remarkably.

A Study on Rankin Decision of Retrieved Documents Using User Profile (사용자 프로파일을 이용한 문서 검색순위 결정에 관한 연구)

  • Kim, Hyeong-Gyun;Kim, Yong-Ho;Lee, Sang-Beom
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • v.9 no.2
    • /
    • pp.993-996
    • /
    • 2005
  • 본 논문에서는 동일한 분야의 검색된 문서가 갖는 하나의 성향을 중심으로 문서들 자체가 가지고 있는 관계성을 분석하여 용어의 가중치를 결정하였다. 그리고 사용자의 관심분야와 선호도를 적절히 표현하기 위하여 질의가 아닌 사용자 프로파일을 구축하여 이용하였다. 사용자 프로파일은 관심 분야별로 용어열과 선호도 벡터로 구성하고, ‘사용자접근에 의한 갱신’, ‘사용자 프로파일을 이용한 갱신’ 방법을 이용하여 사용자 프로파일을 사용자 위주로 학습시킨다. ‘사용자 접근에 의한 갱신’ 방법은 주제 분야에 대한 지식이 있는 경우에 적용할 수 있는 방법으로서 실험 결과, 사용자 프로파일이사용자의 선호도를 제대로 표현하기까지의 갱신 회수를 상당히 감소시킬 수 있었다. ‘사용자 프로파일을 이용한 갱신’ 방법은 갱신초기에 수행하는 방법으로서 선호도 값의 차이를 명확히 해주는 결과를 가져온다.

  • PDF

Classification of Documents using Automatic Indexing (자동 색인을 이용한 문서의 분류)

  • 신진섭;장수진
    • Journal of the Korea Society of Computer and Information
    • /
    • v.4 no.1
    • /
    • pp.21-27
    • /
    • 1999
  • In this paper. we propose a new method for automatic classification of documents using the degree of similarity between words. First, we seek relevance terms using automatic indexing. Second, we found frequency in use words in documents and the degree of relevance between the words using probability model. Continuously, we extracted the set of words which is connected the relevance closely and created the profiles characterizing each classification And, with the profile we finally classified them. We experimented on classifying two groups of documents. Some documents were about Genetic Algorithm. The others were about Neural Network. The results of the experiments indicated that automatic classification with word accordance of degree enable us to manage the retrieved documents structurally.

  • PDF

Known-Item Retrieval Performance of a PICO-based Medical Question Answering Engine

  • Vong, Wan-Tze;Then, Patrick Hang Hui
    • Asia pacific journal of information systems
    • /
    • v.25 no.4
    • /
    • pp.686-711
    • /
    • 2015
  • The performance of a novel medical question-answering engine called CliniCluster and existing search engines, such as CQA-1.0, Google, and Google Scholar, was evaluated using known-item searching. Known-item searching is a document that has been critically appraised to be highly relevant to a therapy question. Results show that, using CliniCluster, known-items were retrieved on average at rank 2 ($MRR@10{\approx}0.50$), and most of the known-items could be identified from the top-10 document lists. In response to ill-defined questions, the known-items were ranked lower by CliniCluster and CQA-1.0, whereas for Google and Google Scholar, significant difference in ranking was not found between well- and ill-defined questions. Less than 40% of the known-items could be identified from the top-10 documents retrieved by CQA-1.0, Google, and Google Scholar. An analysis of the top-ranked documents by strength of evidence revealed that CliniCluster outperformed other search engines by providing a higher number of recent publications with the highest study design. In conclusion, the overall results support the use of CliniCluster in answering therapy questions by ranking highly relevant documents in the top positions of the search results.

Document Classification Model Using Web Documents for Balancing Training Corpus Size per Category

  • Park, So-Young;Chang, Juno;Kihl, Taesuk
    • Journal of information and communication convergence engineering
    • /
    • v.11 no.4
    • /
    • pp.268-273
    • /
    • 2013
  • In this paper, we propose a document classification model using Web documents as a part of the training corpus in order to resolve the imbalance of the training corpus size per category. For the purpose of retrieving the Web documents closely related to each category, the proposed document classification model calculates the matching score between word features and each category, and generates a Web search query by combining the higher-ranked word features and the category title. Then, the proposed document classification model sends each combined query to the open application programming interface of the Web search engine, and receives the snippet results retrieved from the Web search engine. Finally, the proposed document classification model adds these snippet results as Web documents to the training corpus. Experimental results show that the method that considers the balance of the training corpus size per category exhibits better performance in some categories with small training sets.

AutoCor: A Query Based Automatic Acquisition of Corpora of Closely-related Languages

  • Dimalen, Davis Muhajereen D.;Roxas, Rachel Edita O.
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2007.11a
    • /
    • pp.146-154
    • /
    • 2007
  • AutoCor is a method for the automatic acquisition and classification of corpora of documents in closely-related languages. It is an extension and enhancement of CorpusBuilder, a system that automatically builds specific minority language corpora from a closed corpus, since some Tagalog documents retrieved by CorpusBuilder are actually documents in other closely-related Philippine languages. AutoCor used the query generation method odds ratio, and introduced the concept of common word pruning to differentiate between documents of closely-related Philippine languages and Tagalog. The performance of the system using with and without pruning are compared, and common word pruning was found to improve the precision of the system.

  • PDF

A Document Ranking Method by Document Clustering Using Bayesian SoM and Botstrap (베이지안 SOM과 붓스트랩을 이용한 문서 군집화에 의한 문서 순위조정)

  • Choe, Jun-Hyeok;Jeon, Seong-Hae;Lee, Jeong-Hyeon
    • The Transactions of the Korea Information Processing Society
    • /
    • v.7 no.7
    • /
    • pp.2108-2115
    • /
    • 2000
  • The conventional Boolean retrieval systems based on vector spae model can provide the results of retrieval fast, they can't reflect exactly user's retrieval purpose including semantic information. Consequently, the results of retrieval process are very different from those users expected. This fact forces users to waste much time for finding expected documents among retrieved documents. In his paper, we designed a bayesian SOM(Self-Organizing feature Maps) in combination with bayesian statistical method and Kohonen network as a kind of unsupervised learning, then perform classifying documents depending on the semantic similarity to user query in real time. If it is difficult to observe statistical characteristics as there are less than 30 documents for clustering, the number of documents must be increased to at least 50. Also, to give high rank to the documents which is most similar to user query semantically among generalized classifications for generalized clusters, we find the similarity by means of Kohonen centroid of each document classification and adjust the secondary rank depending on the similarity.

  • PDF

Topical Clustering Techniques of Twitter Documents Using Korean Wikipedia (한글 위키피디아를 이용한 트위터 문서의 주제별 클러스터링 기법)

  • Chang, Jae-Young
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.14 no.5
    • /
    • pp.189-196
    • /
    • 2014
  • Recently, the need for retrieving documents is growing in SNS environment such as twitter. For supporting the twitter search, a clustering technique classifying the massively retrieved documents in terms of topics is required. However, due to the nature of twitter, there is a limit in applying previous simple techniques to clustering the twitter documents. To overcome such problem, we propose in this paper a new clustering technique suitable to twitter environment. In proposed method, we augment new terms to feature vectors representing the twitter documents, and recalculate the weights of features using Korean Wikipedia. In addition, we performed the experiments with Korean twitter documents, and proved the usability of proposed method through performance comparison with the previous techniques.

WebDBs : A User oriented Web Search Engine (WebDBs: 사용자 중심의 웹 검색 엔진)

  • 김홍일;임해철
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.24 no.7B
    • /
    • pp.1331-1341
    • /
    • 1999
  • This paper propose WebDBs(Web Database system) which retrieves information registered in web using query language similar to SQL. This proposed system automatically extracts information which is needed to retrieve from HTML documents dispersed in web. Also, it has an ability to process SQL based query intended for the extracted information. Web database system takes the most of query processing time for capturing documents going through network line. And so, the information previously retrieved is reused in similar applications after stored in cache in perceiving that most of the web retrieval depends on web locality. In this case, we propose cache mechanism adapted to user applications by storing cached information associated with retrieved query. And, Web search engine is implemented based on these concepts.

  • PDF

Ranking Decision Method of Retrieved Documents Using User Profile from Searching Engine (검색 엔진에서 사용자 프로파일을 이용한 문서 순위결정 방법)

  • Kim Yong-Ho;Kim Hyeong-Gyun
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.10 no.9
    • /
    • pp.1590-1595
    • /
    • 2006
  • This paper proposes a technique of user oriented document ranking using user refile to provide more satisfied results which reflect preference of specific users. User profile is constructed to represent his or her preference. User pfofile consists of 'term array' and 'preference vector' according to the interest field of one. And the User profile for a particular person is updated by 'user access', 'latent relaeon', 'User Profile' proposed in this paper. The latent structures of documents in same domain are analysed by singular value decomposition(SVD). Then, the rank of documents is determined by comparison of user profile with analyzed document on the basis of relevance.