• Title/Summary/Keyword: information extraction

Search Result 330, Processing Time 0.133 seconds

An Efficient Damage Information Extraction from Government Disaster Reports

  • Shin, Sungho;Hong, Seungkyun;Song, Sa-Kwang
    • Journal of Internet Computing and Services
    • /
    • v.18 no.6
    • /
    • pp.55-63
    • /
    • 2017
  • One of the purposes of Information Technology (IT) is to support human response to natural and social problems such as natural disasters and spread of disease, and to improve the quality of human life. Recent climate change has happened worldwide, natural disasters threaten the quality of life, and human safety is no longer guaranteed. IT must be able to support tasks related to disaster response, and more importantly, it should be used to predict and minimize future damage. In South Korea, the data related to the damage is checked out by each local government and then federal government aggregates it. This data is included in disaster reports that the federal government discloses by disaster case, but it is difficult to obtain raw data of the damage even for research purposes. In order to obtain data, information extraction may be applied to disaster reports. In the field of information extraction, most of the extraction targets are web documents, commercial reports, SNS text, and so on. There is little research on information extraction for government disaster reports. They are mostly text, but the structure of each sentence is very different from that of news articles and commercial reports. The features of the government disaster report should be carefully considered. In this paper, information extraction method for South Korea government reports in the word format is presented. This method is based on patterns and dictionaries and provides some additional ideas for tokenizing the damage representation of the text. The experiment result is F1 score of 80.2 on the test set. This is close to cutting-edge information extraction performance before applying the recent deep learning algorithms.

Distributed Information Extraction in Wireless Sensor Networks using Multiple Software Agents with Dynamic Itineraries

  • Gupta, Govind P.;Misra, Manoj;Garg, Kumkum
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.8 no.1
    • /
    • pp.123-144
    • /
    • 2014
  • Wireless sensor networks are generally deployed for specific applications to accomplish certain objectives over a period of time. To fulfill these objectives, it is crucial that the sensor network continues to function for a long time, even if some of its nodes become faulty. Energy efficiency and fault tolerance are undoubtedly the most crucial requirements for the design of an information extraction protocol for any sensor network application. However, most existing software agent based information extraction protocols are incapable of satisfying these requirements because of static agent itineraries and large agent sizes. This paper proposes an Information Extraction protocol based on Multiple software Agents with Dynamic Itineraries (IEMADI), where multiple software agents are dispatched in parallel to perform tasks based on the query assigned to them. IEMADI decides the itinerary for an agent dynamically at each hop using local information. Through mathematical analysis and simulation, we compare the performance of IEMADI with a well known static itinerary based protocol with respect to energy consumption and response time. The results show that IEMADI provides better performance than the static itinerary based protocols.

Survey of Temporal Information Extraction

  • Lim, Chae-Gyun;Jeong, Young-Seob;Choi, Ho-Jin
    • Journal of Information Processing Systems
    • /
    • v.15 no.4
    • /
    • pp.931-956
    • /
    • 2019
  • Documents contain information that can be used for various applications, such as question answering (QA) system, information retrieval (IR) system, and recommendation system. To use the information, it is necessary to develop a method of extracting such information from the documents written in a form of natural language. There are several kinds of the information (e.g., temporal information, spatial information, semantic role information), where different kinds of information will be extracted with different methods. In this paper, the existing studies about the methods of extracting the temporal information are reported and several related issues are discussed. The issues are about the task boundary of the temporal information extraction, the history of the annotation languages and shared tasks, the research issues, the applications using the temporal information, and evaluation metrics. Although the history of the tasks of temporal information extraction is not long, there have been many studies that tried various methods. This paper gives which approach is known to be the better way of extracting a particular part of the temporal information, and also provides a future research direction.

Development of Information Extraction System from Multi Source Unstructured Documents for Knowledge Base Expansion (지식베이스 확장을 위한 멀티소스 비정형 문서에서의 정보 추출 시스템의 개발)

  • Choi, Hyunseung;Kim, Mintae;Kim, Wooju;Shin, Dongwook;Lee, Yong Hun
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.4
    • /
    • pp.111-136
    • /
    • 2018
  • In this paper, we propose a methodology to extract answer information about queries from various types of unstructured documents collected from multi-sources existing on web in order to expand knowledge base. The proposed methodology is divided into the following steps. 1) Collect relevant documents from Wikipedia, Naver encyclopedia, and Naver news sources for "subject-predicate" separated queries and classify the proper documents. 2) Determine whether the sentence is suitable for extracting information and derive the confidence. 3) Based on the predicate feature, extract the information in the proper sentence and derive the overall confidence of the information extraction result. In order to evaluate the performance of the information extraction system, we selected 400 queries from the artificial intelligence speaker of SK-Telecom. Compared with the baseline model, it is confirmed that it shows higher performance index than the existing model. The contribution of this study is that we develop a sequence tagging model based on bi-directional LSTM-CRF using the predicate feature of the query, with this we developed a robust model that can maintain high recall performance even in various types of unstructured documents collected from multiple sources. The problem of information extraction for knowledge base extension should take into account heterogeneous characteristics of source-specific document types. The proposed methodology proved to extract information effectively from various types of unstructured documents compared to the baseline model. There is a limitation in previous research that the performance is poor when extracting information about the document type that is different from the training data. In addition, this study can prevent unnecessary information extraction attempts from the documents that do not include the answer information through the process for predicting the suitability of information extraction of documents and sentences before the information extraction step. It is meaningful that we provided a method that precision performance can be maintained even in actual web environment. The information extraction problem for the knowledge base expansion has the characteristic that it can not guarantee whether the document includes the correct answer because it is aimed at the unstructured document existing in the real web. When the question answering is performed on a real web, previous machine reading comprehension studies has a limitation that it shows a low level of precision because it frequently attempts to extract an answer even in a document in which there is no correct answer. The policy that predicts the suitability of document and sentence information extraction is meaningful in that it contributes to maintaining the performance of information extraction even in real web environment. The limitations of this study and future research directions are as follows. First, it is a problem related to data preprocessing. In this study, the unit of knowledge extraction is classified through the morphological analysis based on the open source Konlpy python package, and the information extraction result can be improperly performed because morphological analysis is not performed properly. To enhance the performance of information extraction results, it is necessary to develop an advanced morpheme analyzer. Second, it is a problem of entity ambiguity. The information extraction system of this study can not distinguish the same name that has different intention. If several people with the same name appear in the news, the system may not extract information about the intended query. In future research, it is necessary to take measures to identify the person with the same name. Third, it is a problem of evaluation query data. In this study, we selected 400 of user queries collected from SK Telecom 's interactive artificial intelligent speaker to evaluate the performance of the information extraction system. n this study, we developed evaluation data set using 800 documents (400 questions * 7 articles per question (1 Wikipedia, 3 Naver encyclopedia, 3 Naver news) by judging whether a correct answer is included or not. To ensure the external validity of the study, it is desirable to use more queries to determine the performance of the system. This is a costly activity that must be done manually. Future research needs to evaluate the system for more queries. It is also necessary to develop a Korean benchmark data set of information extraction system for queries from multi-source web documents to build an environment that can evaluate the results more objectively.

Automatic Generation of Information Extraction Rules Through User-interface Agents (사용자 인터페이스 에이전트를 통한 정보추출 규칙의 자동 생성)

  • 김용기;양재영;최중민
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.4
    • /
    • pp.447-456
    • /
    • 2004
  • Information extraction is a process of recognizing and fetching particular information fragments from a document. In order to extract information uniformly from many heterogeneous information sources, it is necessary to produce information extraction rules called a wrapper for each source. Previous methods of information extraction can be categorized into manual wrapper generation and automatic wrapper generation. In the manual method, since the wrapper is manually generated by a human expert who analyzes documents and writes rules, the precision of the wrapper is very high whereas it reveals problems in scalability and efficiency In the automatic method, the agent program analyzes a set of example documents and produces a wrapper through learning. Although it is very scalable, this method has difficulty in generating correct rules per se, and also the generated rules are sometimes unreliable. This paper tries to combine both manual and automatic methods by proposing a new method of learning information extraction rules. We adopt the scheme of supervised learning in which a user-interface agent is designed to get information from the user regarding what to extract from a document, and eventually XML-based information extraction rules are generated through learning according to these inputs. The interface agent is used not only to generate new extraction rules but also to modify and extend existing ones to enhance the precision and the recall measures of the extraction system. We have done a series of experiments to test the system, and the results are very promising. We hope that our system can be applied to practical systems such as information-mediator agents.

A Distance Approach for Open Information Extraction Based on Word Vector

  • Liu, Peiqian;Wang, Xiaojie
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.12 no.6
    • /
    • pp.2470-2491
    • /
    • 2018
  • Web-scale open information extraction (Open IE) plays an important role in NLP tasks like acquiring common-sense knowledge, learning selectional preferences and automatic text understanding. A large number of Open IE approaches have been proposed in the last decade, and the majority of these approaches are based on supervised learning or dependency parsing. In this paper, we present a novel method for web scale open information extraction, which employs cosine distance based on Google word vector as the confidence score of the extraction. The proposed method is a purely unsupervised learning algorithm without requiring any hand-labeled training data or dependency parse features. We also present the mathematically rigorous proof for the new method with Bayes Inference and Artificial Neural Network theory. It turns out that the proposed algorithm is equivalent to Maximum Likelihood Estimation of the joint probability distribution over the elements of the candidate extraction. The proof itself also theoretically suggests a typical usage of word vector for other NLP tasks. Experiments show that the distance-based method leads to further improvements over the newly presented Open IE systems on three benchmark datasets, in terms of effectiveness and efficiency.

Visualization for Digesting a High Volume of the Biomedical Literature

  • Lee, Chang-Su;Park, Jin-Ah;Park, Jong-C.
    • Bioinformatics and Biosystems
    • /
    • v.1 no.1
    • /
    • pp.51-60
    • /
    • 2006
  • The paradigm in biology is currently changing from that of conducting hypothesis-driven individual experiments to that of utilizing the results of a massive data analysis with appropriate computational tools. We present LayMap, an implemented visualization system that helps the user to deal with a high volume of the biomedical literature such as MEDLINE, through the layered maps that are constructed on the results of an information extraction system. LayMap also utilizes filtering and granularity for an enhanced view of the results. Since a biomedical information extraction system gives rise to a focused and effective way of slicing up the data space, the combined use of LayMap with such an information extraction system can help the user to navigate the data space in a speedy and guided manner. As a case study, we have applied the system to datasets of journal abstracts on 'MAPK pathway' and 'bufalin' from MEDLINE. With the proposed visualization, we have successfully rediscovered pathway maps of a reasonable quality for ERK, p38 and JNK. Furthermore, with respect to bufalin, we were able to identify the potentially interesting relation between the Chinese medicine Chan su and apoptosis with a high level of detail.

  • PDF

An Ontology-based Knowledge Management System - Integrated System of Web Information Extraction and Structuring Knowledge -

  • Mima, Hideki;Matsushima, Katsumori
    • Proceedings of the CALSEC Conference
    • /
    • 2005.03a
    • /
    • pp.55-61
    • /
    • 2005
  • We will introduce a new web-based knowledge management system in progress, in which XML-based web information extraction and our structuring knowledge technologies are combined using ontology-based natural language processing. Our aim is to provide efficient access to heterogeneous information on the web, enabling users to use a wide range of textual and non textual resources, such as newspapers and databases, effortlessly to accelerate knowledge acquisition from such knowledge sources. In order to achieve the efficient knowledge management, we propose at first an XML-based Web information extraction which contains a sophisticated control language to extract data from Web pages. With using standard XML Technologies in the system, our approach can make extracting information easy because of a) detaching rules from processing, b) restricting target for processing, c) Interactive operations for developing extracting rules. Then we propose a structuring knowledge system which includes, 1) automatic term recognition, 2) domain oriented automatic term clustering, 3) similarity-based document retrieval, 4) real-time document clustering, and 5) visualization. The system supports integrating different types of databases (textual and non textual) and retrieving different types of information simultaneously. Through further explanation to the specification and the implementation technique of the system, we will demonstrate how the system can accelerate knowledge acquisition on the Web even for novice users of the field.

  • PDF

Korean Spatial Information Extraction using Bi-LSTM-CRF Ensemble Model (Bi-LSTM-CRF 앙상블 모델을 이용한 한국어 공간 정보 추출)

  • Min, Tae Hong;Shin, Hyeong Jin;Lee, Jae Sung
    • The Journal of the Korea Contents Association
    • /
    • v.19 no.11
    • /
    • pp.278-287
    • /
    • 2019
  • Spatial information extraction is to retrieve static and dynamic aspects in natural language text by explicitly marking spatial elements and their relational words. This paper proposes a deep learning approach for spatial information extraction for Korean language using a two-step bidirectional LSTM-CRF ensemble model. The integrated model of spatial element extraction and spatial relation attribute extraction is proposed too. An experiment with the Korean SpaceBank demonstrates the better efficiency of the proposed deep learning model than that of the previous CRF model, also showing that the proposed ensemble model performed better than the single model.

Extraction of Geometric Information on Highway Using Terrestrial Laser Scanning Technology (지상 레이저 스캐닝 기술을 이용한 도로 기하정보 추출)

  • Lee, Jong-Chool;Lee, Byung-Gul;Kim, Jin-Soo
    • Proceedings of the Korean Society of Surveying, Geodesy, Photogrammetry, and Cartography Conference
    • /
    • 2007.04a
    • /
    • pp.379-382
    • /
    • 2007
  • Laser scanning technology with high positional accuracy and high density automation will be widely applied in vast range of fields including geomatics. Especially, the development of laser scanning technology enabling long range information extraction is increasing its full use in civil engineering. The purpose of this study is to extract accurate highway geometric information taking the advantages of scanning technology. Fulfilling this goal, the information of target highway's three-dimensional data was obtained through terrestrial laser scanning technology. In accordance with the result from target highway's geometric information extraction using the information above, laser scanning technology showed faster speed and better accuracy on highway geometric information extraction with reduced cost compared to traditional methods.

  • PDF