• Title/Summary/Keyword: 용어추출

Search Result 365, Processing Time 0.025 seconds

Automatic Determination of Usenet News Groups from User Profile (사용자 프로파일에 기초한 유즈넷 뉴스그룹 자동 결정 방법)

  • Kim, Jong-Wan;Cho, Kyu-Cheol;Kim, Hee-Jae;Kim, Byeong-Man
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.14 no.2
    • /
    • pp.142-149
    • /
    • 2004
  • It is important to retrieve exact information coinciding with user's need from lots of Usenet news and filter desired information quickly. Differently from email system, we must previously register our interesting news group if we want to get the news information. However, it is not easy for a novice to decide which news group is relevant to his or her interests. In this work, we present a service classifying user preferred news groups among various news groups by the use of Kohonen network. We first extract candidate terms from example documents and then choose a number of representative keywords to be used in Kohonen network from them through fuzzy inference. From the observation of training patterns, we could find the sparsity problem that lots of keywords in training patterns are empty. Thus, a new method to train neural network through reduction of unnecessary dimensions by the statistical coefficient of determination is proposed in this paper. Experimental results show that the proposed method is superior to the method using every dimension in terms of cluster overlap defined by using within cluster distance and between cluster distance.

An Effective Incremental Text Clustering Method for the Large Document Database (대용량 문서 데이터베이스를 위한 효율적인 점진적 문서 클러스터링 기법)

  • Kang, Dong-Hyuk;Joo, Kil-Hong;Lee, Won-Suk
    • The KIPS Transactions:PartD
    • /
    • v.10D no.1
    • /
    • pp.57-66
    • /
    • 2003
  • With the development of the internet and computer, the amount of information through the internet is increasing rapidly and it is managed in document form. For this reason, the research into the method to manage for a large amount of document in an effective way is necessary. The document clustering is integrated documents to subject by classifying a set of documents through their similarity among them. Accordingly, the document clustering can be used in exploring and searching a document and it can increased accuracy of search. This paper proposes an efficient incremental cluttering method for a set of documents increase gradually. The incremental document clustering algorithm assigns a set of new documents to the legacy clusters which have been identified in advance. In addition, to improve the correctness of the clustering, removing the stop words can be proposed and the weight of the word can be calculated by the proposed TF$\times$NIDF function.

Medicine Ontology Building based on Semantic Relation and Its Application (의미관계 정보를 이용한 약품 온톨로지의 구축과 활용)

  • Lim Soo-Yeon;Park Seong-Bae;Lee Sang-Jo
    • Journal of KIISE:Software and Applications
    • /
    • v.32 no.5
    • /
    • pp.428-437
    • /
    • 2005
  • An ontology consists of a set and definition of concepts that represents the characteristics of a given domain and relationship between the elements. To reduce time-consuming and cost in building ontology, this paper proposes a semiautomatic method to build a domain ontology using the results of text analysis. To do this, we Propose a terminology processing method and use the extracted concepts and semantic relations between them to build ontology. An experiment domain is selected by the pharmacy field and the built ontology is applied to document retrieval. In order to represent usefulness for retrieving a document using the hierarchical relations in ontology, we compared a typical keyword based retrieval method with an ontology based retrieval method, which uses related information in an ontology for a related feedback. As a result, the latter shows the improvement of precision and recall by $4.97\%$ and $0.78\%$ respectively.

Implementation of Ontology-based Service by Exploiting Massive Crime Investigation Records: Focusing on Intrusion Theft (대규모 범죄 수사기록을 활용한 온톨로지 기반 서비스 구현 - 침입 절도 범죄 분야를 중심으로 -)

  • Ko, Gun-Woo;Kim, Seon-Wu;Park, Sung-Jin;No, Yoon-Joo;Choi, Sung-Pil
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.53 no.1
    • /
    • pp.57-81
    • /
    • 2019
  • An ontology is a complex structure dictionary that defines the relationship between terms and terms related to specific knowledge in a particular field. There have been attempts to construct various ontologies in Korea and abroad, but there has not been a case in which a large scale crime investigation record is constructed as an ontology and a service is implemented through the ontology. Therefore, this paper describes the process of constructing an ontology based on information extracted from instrusion theft field of unstructured data, a crime investigation document, and implementing an ontology-based search service and a crime spot recommendation service. In order to understand the performance of the search service, we have tested Top-K accuracy measurement, which is one of the accuracy measurement methods for event search, and obtained a maximum accuracy of 93.52% for the experimental data set. In addition, we have obtained a suitable clue field combination for the entire experimental data set, and we can calibrate the field location information in the database with the performance of F1-measure 76.19% Respectively.

Designing and Evaluating Digital Video Storyboard Surrogates (디지털 영상 초록의 설계와 평가에 관한 연구)

  • Kim, Hyun-Hee;Kim, Yong-Ho;Ko, Su-Hyun
    • Journal of Korean Library and Information Science Society
    • /
    • v.38 no.4
    • /
    • pp.463-480
    • /
    • 2007
  • This study examines the design and utilization of video storyboard surrogates in the digital video libraries. To do this, first we constructed the arrangement model of key-frames for storyboard based on the FRBR model, image communication and PRECIS Indexing theories and evaluated the model using 6 sample videos and 26 participants. The study results show that the video storyboard surrogates based on the arrangement model has a higher accuracy value in terms of summary extraction than that of the sequential video storyboard. Moreover, watching both types of video storyboard one after another, especially browsing the sequential video storyboard first and then the arrangement model-based one, produces a remarkable increase in accuracy value of summary extraction. The study proposes two methods of utilizing the video storyboard surrogates in the digital video libraries: Designing a video browsing interface where users can use the sequential storyboard as a default and then the arrangement model-based one for re-watching; and utilizing the arrangement model-based storyboard as structured match sources of image-based queries.

  • PDF

A Study of automatic indexing based on the linguistic analysis for newspaper articles (언어학적 분석기법에 의한 신문기사 자동색인시스팀 설계에 관한 연구)

  • Seo, Gyeong-Ju;SaGong, Cheol
    • Journal of the Korean Society for information Management
    • /
    • v.8 no.1
    • /
    • pp.78-99
    • /
    • 1991
  • So far, most of Korea's newspapers indexing have been done manually using tesaurus. In recent years, however, the need for automatic indexing system has grown stronger so as for indexers to save time, efforts and money. And some newspapers have started establishing their databases along with introducing electronic newspapers and CTS. This thesis is on establishing and automatic indexing system for the full-text of the Korea Economic Daily's articles, which have been accumulated in its database, KETEL. In my thesis, I suggest methods to create a keyword file, a stopword list, an auxiliary word list and an infected word list by applying linguistic analysis methods to Hangul, taking advantage of the language's morphological peculiarity. Through these studies, I was able to reach four conclusions as follows. First, we can obtain satisfactory keywords by automatic indexing methods that were made through morphological analysis. Second, an indexer can improve the efficiency of indexing work by controlling extracted vocabulary, as syntax analysis and semantic analysis is not complete in Hangul. Third, The keyword file in this system which is made of about 20,000 most-frequently-used newspaper terms can be used in the future in compiling a thesaurus. Finally, the suggested methods to prepare an auxiliary word list and an infected word list can be applicable to designing other automatic systems.

  • PDF

Feature-selection algorithm based on genetic algorithms using unstructured data for attack mail identification (공격 메일 식별을 위한 비정형 데이터를 사용한 유전자 알고리즘 기반의 특징선택 알고리즘)

  • Hong, Sung-Sam;Kim, Dong-Wook;Han, Myung-Mook
    • Journal of Internet Computing and Services
    • /
    • v.20 no.1
    • /
    • pp.1-10
    • /
    • 2019
  • Since big-data text mining extracts many features and data, clustering and classification can result in high computational complexity and low reliability of the analysis results. In particular, a term document matrix obtained through text mining represents term-document features, but produces a sparse matrix. We designed an advanced genetic algorithm (GA) to extract features in text mining for detection model. Term frequency inverse document frequency (TF-IDF) is used to reflect the document-term relationships in feature extraction. Through a repetitive process, a predetermined number of features are selected. And, we used the sparsity score to improve the performance of detection model. If a spam mail data set has the high sparsity, detection model have low performance and is difficult to search the optimization detection model. In addition, we find a low sparsity model that have also high TF-IDF score by using s(F) where the numerator in fitness function. We also verified its performance by applying the proposed algorithm to text classification. As a result, we have found that our algorithm shows higher performance (speed and accuracy) in attack mail classification.

A Study on Patent Literature Classification Using Distributed Representation of Technical Terms (기술용어 분산표현을 활용한 특허문헌 분류에 관한 연구)

  • Choi, Yunsoo;Choi, Sung-Pil
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.53 no.2
    • /
    • pp.179-199
    • /
    • 2019
  • In this paper, we propose optimal methodologies for classifying patent literature by examining various feature extraction methods, machine learning and deep learning models, and provide optimal performance through experiments. We compared the traditional BoW method and a distributed representation method (word embedding vector) as a feature extraction, and compared the morphological analysis and multi gram as the method of constructing the document collection. In addition, classification performance was verified using traditional machine learning model and deep learning model. Experimental results show that the best performance is achieved when we apply the deep learning model with distributed representation and morphological analysis based feature extraction. In Section, Class and Subclass classification experiments, We improved the performance by 5.71%, 18.84% and 21.53%, respectively, compared with traditional classification methods.

Analysis of the Earth Science Vocabularies Used in the 11th Grade Science Textbooks (지구과학 I 교과서 어휘 등급 분석 - 살아있는 지구 단원을 중심으로-)

  • Im, Young-Goo;Park, Hye-Jin;Lee, Hyonyong;Kim, Taesu;Oh, Heejin
    • Journal of Science Education
    • /
    • v.32 no.2
    • /
    • pp.87-102
    • /
    • 2008
  • The purposes of this study were to analyze vocabularies used the section of 'Living Earth' in 11-grade Earth science textbooks with the Science Word Analysis (SWA) program and to investigate the vocabularies selected by the 11th grade students as difficult ones. For the purpose, we extracted the Earth science vocabularies from six textbooks, and classified into the scientific and non-scientific vocabularies with SWA program based on the standard Korean language dictionary. Also, we investigated the difficulty of each vocabulary by using questionnaire to three hundred sixty students. From the results analyzed with the program, it was found that the frequency of the scientific vocabularies out of the level was the largest any other level in all textbooks. And from the survey, most of the vocabularies selected by students as difficult to understand were classified into out of the level. From these results, it were suggested that the students' cognitive level should be considered in developing science textbooks and difficult vocabularies should be replaced to easy ones within the limits of changeless in the meanings.

  • PDF

Big Data Analysis for Strategic Use of Urban Brands: Case Study Seoul city brand "I SEOUL U" (도시 브랜드의 전략적 활용을 위한 빅데이터 분석 : 서울시 도시 브랜드 "I SEOUL U" 사례)

  • Lim, Haewen
    • The Journal of the Korea Contents Association
    • /
    • v.22 no.1
    • /
    • pp.197-213
    • /
    • 2022
  • In this study, text mining analysis was performed on online big data for recognition and assessment of urban brand I Seoul U. To this end, TEXTOM, a processing program for data acquisition and analysis was used, and the 'I SEOUL U' keyword was selected as an analysis keyword. Keyword analysis shows the keywords associated with I Seoul U to be as follows: First, as a business and marketing term, keywords include pop-up store, gallery, co-branding, (festival, etc.), commodities, private companies and online. Second, as an event-related term, keywords include Han River, tree-planting day, tree planting, Hongdae, Christmas, Mapo, Jung-gu, Sejong University, and festival. Third, as a promotional term, keywords include robotics engineer Dr. Dennis Hong, Government, Art and Korea. In the N Gram analysis, as the city brand of Seoul, I Seoul U, in the public interest, was found to contribute to the commercial activities of private companies. In connection-oriented analysis, business and marketing, events, and promotions have been derived as categories. In matrix analysis, it was found that the products of the pop-up store are mainly developed, and products in the form of co-branding were being developed. In the topic modeling, a total of 10 topics were extracted and needs for commercial utilization and information for event festivals were mostly found.