• Title/Summary/Keyword: Document clustering

Search Result 224, Processing Time 0.024 seconds

Comparison of graph clustering methods for analyzing the mathematical subject classification codes

  • Choi, Kwangju;Lee, June-Yub;Kim, Younjin;Lee, Donghwan
    • Communications for Statistical Applications and Methods
    • /
    • v.27 no.5
    • /
    • pp.569-578
    • /
    • 2020
  • Various graph clustering methods have been introduced to identify communities in social or biological networks. This paper studies the entropy-based and the Markov chain-based methods in clustering the undirected graph. We examine the performance of two clustering methods with conventional methods based on quality measures of clustering. For the real applications, we collect the mathematical subject classification (MSC) codes of research papers from published mathematical databases and construct the weighted code-to-document matrix for applying graph clustering methods. We pursue to group MSC codes into the same cluster if the corresponding MSC codes appear in many papers simultaneously. We compare the MSC clustering results based on the several assessment measures and conclude that the Markov chain-based method is suitable for clustering the MSC codes.

Generic Document Summarization using Coherence of Sentence Cluster and Semantic Feature (문장군집의 응집도와 의미특징을 이용한 포괄적 문서요약)

  • Park, Sun;Lee, Yeonwoo;Shim, Chun Sik;Lee, Seong Ro
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.16 no.12
    • /
    • pp.2607-2613
    • /
    • 2012
  • The results of inherent knowledge based generic summarization are influenced by the composition of sentence in document set. In order to resolve the problem, this papser propses a new generic document summarization which uses clustering of semantic feature of document and coherence of document cluster. The proposed method clusters sentences using semantic feature deriving from NMF(non-negative matrix factorization), which it can classify document topic group because inherent structure of document are well represented by the sentence cluster. In addition, the method can improve the quality of summarization because the importance sentences are extracted by using coherence of sentence cluster and the cluster refinement by re-cluster. The experimental results demonstrate appling the proposed method to generic summarization achieves better performance than generic document summarization methods.

Fuzzy Clustering Algorithm for Web-mining (웹마이닝을 위한 퍼지 클러스터링 알고리즘)

  • Lim, Young-Hee;Song, Ji-Young;Park, Dai-Hee
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.12 no.3
    • /
    • pp.219-227
    • /
    • 2002
  • The post-clustering algorithms, which cluster the result of Web search engine, have some different requirements from conventional clustering algorithms. In this paper, we propose the new post-clustering algorithm satisfying those of requirements as many as possible. The proposed fuzzy Concept ART is the form of combining the concept vector having several advantages in document clustering with fuzzy ART known as real time clustering algorithms on the basis of fuzzy set theory. Moreover we show that it can be applicable to general-purpose clustering as well as post clustering.

Building Topic Hierarchy of e-Documents using Text Mining Technology

  • Kim, Han-Joon
    • Proceedings of the CALSEC Conference
    • /
    • 2004.02a
    • /
    • pp.294-301
    • /
    • 2004
  • ·Text-mining approach to e-documents organization based on topic hierarchy - Machine-Learning & information Theory-based ㆍ 'Category(topic) discovery' problem → document bundle-based user-constraint document clustering ㆍ 'Automatic categorization' problem → Accelerated EM with CU-based active learning → 'Hierarchy Construction' problem → Unsupervised learning of category subsumption relation

  • PDF

Topic-based Multi-document Summarization Using Non-negative Matrix Factorization and K-means (비음수 행렬 분해와 K-means를 이용한 주제기반의 다중문서요약)

  • Park, Sun;Lee, Ju-Hong
    • Journal of KIISE:Software and Applications
    • /
    • v.35 no.4
    • /
    • pp.255-264
    • /
    • 2008
  • This paper proposes a novel method using K-means and Non-negative matrix factorization (NMF) for topic -based multi-document summarization. NMF decomposes weighted term by sentence matrix into two sparse non-negative matrices: semantic feature matrix and semantic variable matrix. Obtained semantic features are comprehensible intuitively. Weighted similarity between topic and semantic features can prevent meaningless sentences that are similar to a topic from being selected. K-means clustering removes noises from sentences so that biased semantics of documents are not reflected to summaries. Besides, coherence of document summaries can be enhanced by arranging selected sentences in the order of their ranks. The experimental results show that the proposed method achieves better performance than other methods.

Multi-document Summarization Based on Cluster using Term Co-occurrence (단어의 공기정보를 이용한 클러스터 기반 다중문서 요약)

  • Lee, Il-Joo;Kim, Min-Koo
    • Journal of KIISE:Software and Applications
    • /
    • v.33 no.2
    • /
    • pp.243-251
    • /
    • 2006
  • In multi-document summarization by means of salient sentence extraction, it is important to remove redundant information. In the removal process, the similarities and differences of sentences are considered. In this paper, we propose a method for multi-document summarization which extracts salient sentences without having redundant sentences by way of cohesive term clustering method that utilizes co-occurrence Information. In the cohesive term clustering method, we assume that each term does not exist independently, but rather it is related to each other in meanings. To find the relations between terms, we cluster sentences according to topics and use the co-occurrence information oi terms in the same topic. We conduct experimental tests with the DUC(Document Understanding Conferences) data. In the tests, our method shows better performance of summarization than other summarization methods which use term co-occurrence information based on term cohesion of document or sentence unit, and simple statistical information.

Semantic Clustering Model for Analytical Classification of Documents in Cloud Environment (클라우드 환경에서 문서의 유형 분류를 위한 시맨틱 클러스터링 모델)

  • Kim, Young Soo;Lee, Byoung Yup
    • The Journal of the Korea Contents Association
    • /
    • v.17 no.11
    • /
    • pp.389-397
    • /
    • 2017
  • Recently semantic web document is produced and added in repository in a cloud computing environment and requires an intelligent semantic agent for analytical classification of documents and information retrieval. The traditional methods of information retrieval uses keyword for query and delivers a document list returned by the search. Users carry a heavy workload for examination of contents because a former method of the information retrieval don't provide a lot of semantic similarity information. To solve these problems, we suggest a key word frequency and concept matching based semantic clustering model using hadoop and NoSQL to improve classification accuracy of the similarity. Implementation of our suggested technique in a cloud computing environment offers the ability to classify and discover similar document with improved accuracy of the classification. This suggested model is expected to be use in the semantic web retrieval system construction that can make it more flexible in retrieving proper document.

A Study on Keyword Extraction From a Single Document Using Term Clustering (용어 클러스터링을 이용한 단일문서 키워드 추출에 관한 연구)

  • Han, Seung-Hee
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.44 no.3
    • /
    • pp.155-173
    • /
    • 2010
  • In this study, a new keyword extraction algorithm is applied to a single document with term clustering. A single document is divided by multiple passages, and two ways of calculating similarities between two terms are investigated; the first-order similarity and the second-order distributional similarity. In this experiment, the best cluster performance is achieved with a 50-term passage from the second-order distributional similarity. From the results of first experiment, the second-order distribution similarity was also applied to various keyword extraction methods using statistic information of terms. In the second experiment, pf(paragraph frequency) and $tf{\times}ipf$(term frequency by inverse paragraph frequency) were found to improve the overall performance of keyword extraction. Therefore, it showed that the algorithm fulfills the necessary conditions which good keywords should have.

Research on Function and Policy for e-Government System using Semantic Technology (전자정부내 의미기반 기술 도입에 따른 기능 및 정책 연구)

  • Go, Gwang-Seop;Jang, Yeong-Cheol;Lee, Chang-Hun
    • 한국디지털정책학회:학술대회논문집
    • /
    • 2007.06a
    • /
    • pp.79-87
    • /
    • 2007
  • This paper aims to offer a solution based on semantic document classification to improve e-Government utilization and efficiency for people using their own information retrieval system and linguistic expression Generally, semantic document classification method is an approach that classifies documents based on the diverse relationships between keywords in a document without fully describing hierarchial concepts between keywords. Our approach considers the deep meanings within the context of the document and radically enhances the information retrieval performance. Concept Weight Document Classification(CoWDC) method, which goes beyond using exist ing keyword and simple thesaurus/ontology methods by fully considering the concept hierarchy of various concepts is proposed, experimented, and evaluated. With the recognition that in order to verify the superiority of the semantic retrieval technology through test results of the CoWDC and efficiently integrate it into the e-Government, creation of a thesaurus, management of the operating system, expansion of the knowledge base and improvements in search service and accuracy at the national level were needed.

  • PDF

Selection of Cluster Topic Words in Hierarchical Clustering using K-Means Algorithm

  • Lee Shin Won;Yi Sang Seon;An Dong Un;Chung Sung Jong
    • Proceedings of the IEEK Conference
    • /
    • 2004.08c
    • /
    • pp.885-889
    • /
    • 2004
  • Fast and high-quality document clustering algorithms play an important role in providing data exploration by organizing large amounts of information into a small number of meaningful clusters. Hierarchical clustering improves the performance of retrieval and makes that users can understand easily. For outperforming of clustering, we implemented hierarchical structure with variety and readability, by careful selection of cluster topic words and deciding the number of clusters dynamically. It is important to select topic words because hierarchical clustering structure is summarizes result of searching. We made choice of noun word as a cluster topic word. The quality of topic words is increased $33\%$ as follows. As the topic word of each cluster, the only noun word is extracted for the top-level cluster and the used topic words for the children clusters were not reused.

  • PDF