• Title/Summary/Keyword: document topic

Search Result 190, Processing Time 0.028 seconds

An Automated Topic Specific Web Crawler Calculating Degree of Relevance (연관도를 계산하는 자동화된 주제 기반 웹 수집기)

  • Seo Hae-Sung;Choi Young-Soo;Choi Kyung-Hee;Jung Gi-Hyun;Noh Sang-Uk
    • Journal of Internet Computing and Services
    • /
    • v.7 no.3
    • /
    • pp.155-167
    • /
    • 2006
  • It is desirable if users surfing on the Internet could find Web pages related to their interests as closely as possible. Toward this ends, this paper presents a topic specific Web crawler computing the degree of relevance. collecting a cluster of pages given a specific topic, and refining the preliminary set of related web pages using term frequency/document frequency, entropy, and compiled rules. In the experiments, we tested our topic specific crawler in terms of the accuracy of its classification, crawling efficiency, and crawling consistency. First, the classification accuracy using the set of rules compiled by CN2 was the best, among those of C4.5 and back propagation learning algorithms. Second, we measured the classification efficiency to determine the best threshold value affecting the degree of relevance. In the third experiment, the consistency of our topic specific crawler was measured in terms of the number of the resulting URLs overlapped with different starting URLs. The experimental results imply that our topic specific crawler was fairly consistent, regardless of the starting URLs randomly chosen.

  • PDF

Hybrid Word-Character Neural Network Model for the Improvement of Document Classification (문서 분류의 개선을 위한 단어-문자 혼합 신경망 모델)

  • Hong, Daeyoung;Shim, Kyuseok
    • Journal of KIISE
    • /
    • v.44 no.12
    • /
    • pp.1290-1295
    • /
    • 2017
  • Document classification, a task of classifying the category of each document based on text, is one of the fundamental areas for natural language processing. Document classification may be used in various fields such as topic classification and sentiment classification. Neural network models for document classification can be divided into two categories: word-level models and character-level models that treat words and characters as basic units respectively. In this study, we propose a neural network model that combines character-level and word-level models to improve performance of document classification. The proposed model extracts the feature vector of each word by combining information obtained from a word embedding matrix and information encoded by a character-level neural network. Based on feature vectors of words, the model classifies documents with a hierarchical structure wherein recurrent neural networks with attention mechanisms are used for both the word and the sentence levels. Experiments on real life datasets demonstrate effectiveness of our proposed model.

Multi-document Summarization Based on Cluster using Term Co-occurrence (단어의 공기정보를 이용한 클러스터 기반 다중문서 요약)

  • Lee, Il-Joo;Kim, Min-Koo
    • Journal of KIISE:Software and Applications
    • /
    • v.33 no.2
    • /
    • pp.243-251
    • /
    • 2006
  • In multi-document summarization by means of salient sentence extraction, it is important to remove redundant information. In the removal process, the similarities and differences of sentences are considered. In this paper, we propose a method for multi-document summarization which extracts salient sentences without having redundant sentences by way of cohesive term clustering method that utilizes co-occurrence Information. In the cohesive term clustering method, we assume that each term does not exist independently, but rather it is related to each other in meanings. To find the relations between terms, we cluster sentences according to topics and use the co-occurrence information oi terms in the same topic. We conduct experimental tests with the DUC(Document Understanding Conferences) data. In the tests, our method shows better performance of summarization than other summarization methods which use term co-occurrence information based on term cohesion of document or sentence unit, and simple statistical information.

A Study on Ontology and Topic Modeling-based Multi-dimensional Knowledge Map Services (온톨로지와 토픽모델링 기반 다차원 연계 지식맵 서비스 연구)

  • Jeong, Hanjo
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.4
    • /
    • pp.79-92
    • /
    • 2015
  • Knowledge map is widely used to represent knowledge in many domains. This paper presents a method of integrating the national R&D data and assists of users to navigate the integrated data via using a knowledge map service. The knowledge map service is built by using a lightweight ontology and a topic modeling method. The national R&D data is integrated with the research project as its center, i.e., the other R&D data such as research papers, patents, and reports are connected with the research project as its outputs. The lightweight ontology is used to represent the simple relationships between the integrated data such as project-outputs relationships, document-author relationships, and document-topic relationships. Knowledge map enables us to infer further relationships such as co-author and co-topic relationships. To extract the relationships between the integrated data, a Relational Data-to-Triples transformer is implemented. Also, a topic modeling approach is introduced to extract the document-topic relationships. A triple store is used to manage and process the ontology data while preserving the network characteristics of knowledge map service. Knowledge map can be divided into two types: one is a knowledge map used in the area of knowledge management to store, manage and process the organizations' data as knowledge, the other is a knowledge map for analyzing and representing knowledge extracted from the science & technology documents. This research focuses on the latter one. In this research, a knowledge map service is introduced for integrating the national R&D data obtained from National Digital Science Library (NDSL) and National Science & Technology Information Service (NTIS), which are two major repository and service of national R&D data servicing in Korea. A lightweight ontology is used to design and build a knowledge map. Using the lightweight ontology enables us to represent and process knowledge as a simple network and it fits in with the knowledge navigation and visualization characteristics of the knowledge map. The lightweight ontology is used to represent the entities and their relationships in the knowledge maps, and an ontology repository is created to store and process the ontology. In the ontologies, researchers are implicitly connected by the national R&D data as the author relationships and the performer relationships. A knowledge map for displaying researchers' network is created, and the researchers' network is created by the co-authoring relationships of the national R&D documents and the co-participation relationships of the national R&D projects. To sum up, a knowledge map-service system based on topic modeling and ontology is introduced for processing knowledge about the national R&D data such as research projects, papers, patent, project reports, and Global Trends Briefing (GTB) data. The system has goals 1) to integrate the national R&D data obtained from NDSL and NTIS, 2) to provide a semantic & topic based information search on the integrated data, and 3) to provide a knowledge map services based on the semantic analysis and knowledge processing. The S&T information such as research papers, research reports, patents and GTB are daily updated from NDSL, and the R&D projects information including their participants and output information are updated from the NTIS. The S&T information and the national R&D information are obtained and integrated to the integrated database. Knowledge base is constructed by transforming the relational data into triples referencing R&D ontology. In addition, a topic modeling method is employed to extract the relationships between the S&T documents and topic keyword/s representing the documents. The topic modeling approach enables us to extract the relationships and topic keyword/s based on the semantics, not based on the simple keyword/s. Lastly, we show an experiment on the construction of the integrated knowledge base using the lightweight ontology and topic modeling, and the knowledge map services created based on the knowledge base are also introduced.

A Study on the Thesaurus Construction Using the Topic Map (토픽맵을 이용한 시소러스의 구조화 연구)

  • Nam, Young-Joon
    • Journal of the Korean Society for information Management
    • /
    • v.22 no.3 s.57
    • /
    • pp.37-53
    • /
    • 2005
  • The terminology management is absolutely necessary for maintaining the efficiency of thesaurus. This is because the creating, differentiating, disappearing, and other processes of the descriptor become accomplished dynamically, making effective management of thesaurus a very difficult task. Therefore, a device is required for accomplishing methods to construct and maintain the thesaurus. This study proposes the methods to construct the thesaurus management using the basic elements of a topic map which are topic, occurrence, and association. Second, the study proposes the methods to represent the basic and specific instances using the systematic mapping algorithm and merging algorithm. Also, using a hub document as a standard, this study gives the methods to expand and subsitute the descriptors using the topic type. The new method applying fixed concept for double layer management on terms is developed, too. The purpose of this method is to fix the conceptual term which represents independent concept of time and space, and to select the descriptor freely by external information circumstance.

Futures Price Prediction based on News Articles using LDA and LSTM (LDA와 LSTM를 응용한 뉴스 기사 기반 선물가격 예측)

  • Jin-Hyeon Joo;Keun-Deok Park
    • Journal of Industrial Convergence
    • /
    • v.21 no.1
    • /
    • pp.167-173
    • /
    • 2023
  • As research has been published to predict future data using regression analysis or artificial intelligence as a method of analyzing economic indicators. In this study, we designed a system that predicts prospective futures prices using artificial intelligence that utilizes topic probability data obtained from past news articles using topic modeling. Topic probability distribution data for each news article were obtained using the Latent Dirichlet Allocation (LDA) method that can extract the topic of a document from past news articles via unsupervised learning. Further, the topic probability distribution data were used as the input for a Long Short-Term Memory (LSTM) network, a derivative of Recurrent Neural Networks (RNN) in artificial intelligence, in order to predict prospective futures prices. The method proposed in this study was able to predict the trend of futures prices. Later, this method will also be able to predict the trend of prices for derivative products like options. However, because statistical errors occurred for certain data; further research is required to improve accuracy.

Exploring trends in U.N. Peacekeeping Activities in Korea through Topic Modeling and Social Network Analysis (토픽모델링과 사회연결망 분석을 통한 우리나라 유엔 평화유지활동 동향 탐색)

  • Donghyeon Jung;Chansong Kim;Kangmin Lee;Soeun Bae;Yeon Seo;Hyeonju Seol
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.46 no.4
    • /
    • pp.246-262
    • /
    • 2023
  • The purpose of this study is to identify the major peacekeeping activities that the Korean armed forces has performed from the past to the present. To do this, we collected 692 press releases from the National Defense Daily over the past 20 years and performed topic modeling and social network analysis. As a result of topic modeling analysis, 112 major keywords and 8 topics were derived, and as a result of examining the Korean armed forces's peacekeeping activities based on the topics, 6 major activities and 2 related matters were identified. The six major activities were 'Northeast Asian defense cooperation', 'multinational force activities', 'civil operations', 'defense diplomacy', 'ceasefire monitoring group', and 'pro-Korean activities', and 'general troop deployment' related to troop deployment in general. Next, social network analysis was performed to examine the relationship between keywords and major keywords related to topic decision, and the keywords 'overseas', 'dispatch', and 'high level' were derived as key words in the network. This study is meaningful in that it first examined the topic of the Korean armed forces's peacekeeping activities over the past 20 years by applying big data techniques based on the National Defense Daily, an unstructured document. In addition, it is expected that the derived topics can be used as a basis for exploring the direction of development of Korea's peacekeeping activities in the future.

A Text Mining Study on Endangered Wildlife Complaints - Discovery of Key Issues through LDA Topic Modeling and Network Analysis - (멸종위기 야생생물 민원 텍스트 마이닝 연구 - LDA 토픽 모델링과 네트워크 분석을 통한 주요 이슈 발굴 -)

  • Kim, Na-Yeong;Nam, Hee-Jung;Park, Yong-Su
    • Journal of the Korean Society of Environmental Restoration Technology
    • /
    • v.26 no.6
    • /
    • pp.205-220
    • /
    • 2023
  • This study aimed to analyze the needs and interests of the public on endangered wildlife using complaint big data. We collected 1,203 complaints and their corresponding text data on endangered wildlife, pre-processed them, and constructed a document-term matrix for 1,739 text data. We performed LDA (Latent Dirichlet Allocation) topic modeling and network analysis. The results revealed that the complaints on endangered wildlife peaked in June-August, and the interest shifted from insects to various endangered wildlife in the living area, such as mammals, birds, and amphibians. In addition, the complaints on endangered wildlife could be categorized into 8 topics and 5 clusters, such as discovery report, habitat protection and response request, information inquiry, investigation and action request, and consultation request. The co-occurrence network analysis for each topic showed that the keywords reflecting the call center reporting procedure, such as photo, send, and take, had high centrality in common, and other keywords such as dung beetle, know, absence and think played an important role in the network. Through this analysis, we identified the main keywords and their relationships within each topic and derived the main issues for each topic. This study confirmed the increasing and diversifying public interest and complaints on endangered wildlife and highlighted the need for professional response. We also suggested developing and extending participatory conservation plans that align with the public's preferences and demands. This study demonstrated the feasibility of using complaint big data on endangered wildlife and its implications for policy decision-making and public promotion on endangered wildlife.

Analyzing the Effect of Characteristics of Dictionary on the Accuracy of Document Classifiers (용어 사전의 특성이 문서 분류 정확도에 미치는 영향 연구)

  • Jung, Haegang;Kim, Namgyu
    • Management & Information Systems Review
    • /
    • v.37 no.4
    • /
    • pp.41-62
    • /
    • 2018
  • As the volume of unstructured data increases through various social media, Internet news articles, and blogs, the importance of text analysis and the studies are increasing. Since text analysis is mostly performed on a specific domain or topic, the importance of constructing and applying a domain-specific dictionary has been increased. The quality of dictionary has a direct impact on the results of the unstructured data analysis and it is much more important since it present a perspective of analysis. In the literature, most studies on text analysis has emphasized the importance of dictionaries to acquire clean and high quality results. However, unfortunately, a rigorous verification of the effects of dictionaries has not been studied, even if it is already known as the most essential factor of text analysis. In this paper, we generate three dictionaries in various ways from 39,800 news articles and analyze and verify the effect each dictionary on the accuracy of document classification by defining the concept of Intrinsic Rate. 1) A batch construction method which is building a dictionary based on the frequency of terms in the entire documents 2) A method of extracting the terms by category and integrating the terms 3) A method of extracting the features according to each category and integrating them. We compared accuracy of three artificial neural network-based document classifiers to evaluate the quality of dictionaries. As a result of the experiment, the accuracy tend to increase when the "Intrinsic Rate" is high and we found the possibility to improve accuracy of document classification by increasing the intrinsic rate of the dictionary.

DEELOPMENTS IN ROBUST STOCHASTIC CONTROL;RISK-SENSITIVE AND MINIMAL COST VARIANCE CONTROL

  • Won, Chang-Hee
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 1996.10a
    • /
    • pp.107-110
    • /
    • 1996
  • Continuing advances in the formulation and solution of risk-sensitive control problems have reached a point at which this topic is becoming one of the more intriguing modern paradigms of feedback thought. Despite a prevailing atmosphere of close scrutiny of theoretical studies, the risk-sensitive body of knowledge is growing. Moreover, from the point of view of applications, the detailed properties of risk-sensitive design are only now beginning to be worked out. Accordingly, the time seems to be right for a survey of the historical underpinnings of the subject. This paper addresses the beginnings and the evolution, over the first quarter-century or so, and points out the close relationship of the topic with the notion of optimal cost cumulates, in particular the cost variance. It is to be expected that, in due course, some duality will appear between these notions and those in estimation and filtering. The purpose of this document is to help to lay a framework for that eventuality.

  • PDF