• Title/Summary/Keyword: Automatic Data Extraction

Search Result 315, Processing Time 0.026 seconds

Export Control System based on Case Based Reasoning: Design and Evaluation (사례 기반 지능형 수출통제 시스템 : 설계와 평가)

  • Hong, Woneui;Kim, Uihyun;Cho, Sinhee;Kim, Sansung;Yi, Mun Yong;Shin, Donghoon
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.3
    • /
    • pp.109-131
    • /
    • 2014
  • As the demand of nuclear power plant equipment is continuously growing worldwide, the importance of handling nuclear strategic materials is also increasing. While the number of cases submitted for the exports of nuclear-power commodity and technology is dramatically increasing, preadjudication (or prescreening to be simple) of strategic materials has been done so far by experts of a long-time experience and extensive field knowledge. However, there is severe shortage of experts in this domain, not to mention that it takes a long time to develop an expert. Because human experts must manually evaluate all the documents submitted for export permission, the current practice of nuclear material export is neither time-efficient nor cost-effective. Toward alleviating the problem of relying on costly human experts only, our research proposes a new system designed to help field experts make their decisions more effectively and efficiently. The proposed system is built upon case-based reasoning, which in essence extracts key features from the existing cases, compares the features with the features of a new case, and derives a solution for the new case by referencing similar cases and their solutions. Our research proposes a framework of case-based reasoning system, designs a case-based reasoning system for the control of nuclear material exports, and evaluates the performance of alternative keyword extraction methods (full automatic, full manual, and semi-automatic). A keyword extraction method is an essential component of the case-based reasoning system as it is used to extract key features of the cases. The full automatic method was conducted using TF-IDF, which is a widely used de facto standard method for representative keyword extraction in text mining. TF (Term Frequency) is based on the frequency count of the term within a document, showing how important the term is within a document while IDF (Inverted Document Frequency) is based on the infrequency of the term within a document set, showing how uniquely the term represents the document. The results show that the semi-automatic approach, which is based on the collaboration of machine and human, is the most effective solution regardless of whether the human is a field expert or a student who majors in nuclear engineering. Moreover, we propose a new approach of computing nuclear document similarity along with a new framework of document analysis. The proposed algorithm of nuclear document similarity considers both document-to-document similarity (${\alpha}$) and document-to-nuclear system similarity (${\beta}$), in order to derive the final score (${\gamma}$) for the decision of whether the presented case is of strategic material or not. The final score (${\gamma}$) represents a document similarity between the past cases and the new case. The score is induced by not only exploiting conventional TF-IDF, but utilizing a nuclear system similarity score, which takes the context of nuclear system domain into account. Finally, the system retrieves top-3 documents stored in the case base that are considered as the most similar cases with regard to the new case, and provides them with the degree of credibility. With this final score and the credibility score, it becomes easier for a user to see which documents in the case base are more worthy of looking up so that the user can make a proper decision with relatively lower cost. The evaluation of the system has been conducted by developing a prototype and testing with field data. The system workflows and outcomes have been verified by the field experts. This research is expected to contribute the growth of knowledge service industry by proposing a new system that can effectively reduce the burden of relying on costly human experts for the export control of nuclear materials and that can be considered as a meaningful example of knowledge service application.

Accelerated Loarning of Latent Topic Models by Incremental EM Algorithm (점진적 EM 알고리즘에 의한 잠재토픽모델의 학습 속도 향상)

  • Chang, Jeong-Ho;Lee, Jong-Woo;Eom, Jae-Hong
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.12
    • /
    • pp.1045-1055
    • /
    • 2007
  • Latent topic models are statistical models which automatically captures salient patterns or correlation among features underlying a data collection in a probabilistic way. They are gaining an increased popularity as an effective tool in the application of automatic semantic feature extraction from text corpus, multimedia data analysis including image data, and bioinformatics. Among the important issues for the effectiveness in the application of latent topic models to the massive data set is the efficient learning of the model. The paper proposes an accelerated learning technique for PLSA model, one of the popular latent topic models, by an incremental EM algorithm instead of conventional EM algorithm. The incremental EM algorithm can be characterized by the employment of a series of partial E-steps that are performed on the corresponding subsets of the entire data collection, unlike in the conventional EM algorithm where one batch E-step is done for the whole data set. By the replacement of a single batch E-M step with a series of partial E-steps and M-steps, the inference result for the previous data subset can be directly reflected to the next inference process, which can enhance the learning speed for the entire data set. The algorithm is advantageous also in that it is guaranteed to converge to a local maximum solution and can be easily implemented just with slight modification of the existing algorithm based on the conventional EM. We present the basic application of the incremental EM algorithm to the learning of PLSA and empirically evaluate the acceleration performance with several possible data partitioning methods for the practical application. The experimental results on a real-world news data set show that the proposed approach can accomplish a meaningful enhancement of the convergence rate in the learning of latent topic model. Additionally, we present an interesting result which supports a possible synergistic effect of the combination of incremental EM algorithm with parallel computing.

A Study on the Integration of Airborne LiDAR and UAV Data for High-resolution Topographic Information Construction of Tidal Flat (갯벌지역 고해상도 지형정보 구축을 위한 항공 라이다와 UAV 데이터 통합 활용에 관한 연구)

  • Kim, Hye Jin;Lee, Jae Bin;Kim, Yong Il
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.38 no.4
    • /
    • pp.345-352
    • /
    • 2020
  • To preserve and restore tidal flats and prevent safety accidents, it is necessary to construct tidal flat topographic information including the exact location and shape of tidal creeks. In the tidal flats where the field surveying is difficult to apply, airborne LiDAR surveying can provide accurate terrain data for a wide area. On the other hand, we can economically obtain relatively high-resolution data from UAV (Unmanned Aerial Vehicle) surveying. In this study, we proposed the methodology to generate high-resolution topographic information of tidal flats effectively by integrating airborne LiDAR and UAV point clouds. For the purpose, automatic ICP (Iterative Closest Points) registration between two different datasets was conducted and tidal creeks were extracted by applying CSF (Cloth Simulation Filtering) algorithm. Then, we integrated high-density UAV data for tidal creeks and airborne LiDAR data for flat grounds. DEM (Digital Elevation Model) and tidal flat area and depth were generated from the integrated data to construct high-resolution topographic information for large-scale tidal flat map creation. As a result, UAV data was registered without GCP (Ground Control Point), and integrated data including detailed topographic information of tidal creeks with a relatively small data size was generated.

A PageRank based Data Indexing Method for Designing Natural Language Interface to CRM Databases (분석 CRM 실무자의 자연어 질의 처리를 위한 기업 데이터베이스 구성요소 인덱싱 방법론)

  • Park, Sung-Hyuk;Hwang, Kyeong-Seo;Lee, Dong-Won
    • CRM연구
    • /
    • v.2 no.2
    • /
    • pp.53-70
    • /
    • 2009
  • Understanding consumer behavior based on the analysis of the customer data is one essential part of analytic CRM. To do this, the analytic skills for data extraction and data processing are required to users. As a user has various kinds of questions for the consumer data analysis, the user should use database language such as SQL. However, for the firm's user, to generate SQL statements is not easy because the accuracy of the query result is hugely influenced by the knowledge of work-site operation and the firm's database. This paper proposes a natural language based database search framework finding relevant database elements. Specifically, we describe how our TableRank method can understand the user's natural query language and provide proper relations and attributes of data records to the user. Through several experiments, it is supported that the TableRank provides accurate database elements related to the user's natural query. We also show that the close distance among relations in the database represents the high data connectivity which guarantees matching with a search query from a user.

  • PDF

Automatic Extraction of Initial Training Data Using National Land Cover Map and Unsupervised Classification and Updating Land Cover Map (국가토지피복도와 무감독분류를 이용한 초기 훈련자료 자동추출과 토지피복지도 갱신)

  • Soungki, Lee;Seok Keun, Choi;Sintaek, Noh;Noyeol, Lim;Juweon, Choi
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.33 no.4
    • /
    • pp.267-275
    • /
    • 2015
  • Those land cover maps have widely been used in various fields, such as environmental studies, military strategies as well as in decision-makings. This study proposes a method to extract training data, automatically and classify the cover using ingle satellite images and national land cover maps, provided by the Ministry of Environment. For this purpose, as the initial training data, those three were used; the unsupervised classification, the ISODATA, and the existing land cover maps. The class was classified and named automatically using the class information in the existing land cover maps to overcome the difficulty in selecting classification by each class and in naming class by the unsupervised classification; so as achieve difficulty in selecting the training data in supervised classification. The extracted initial training data were utilized as the training data of MLC for the land cover classification of target satellite images, which increase the accuracy of unsupervised classification. Finally, the land cover maps could be extracted from updated training data that has been applied by an iterative method. Also, in order to reduce salt and pepper occurring in the pixel classification method, the MRF was applied in each repeated phase to enhance the accuracy of classification. It was verified quantitatively and visually that the proposed method could effectively generate the land cover maps.

RPC Correction of KOMPSAT-3A Satellite Image through Automatic Matching Point Extraction Using Unmanned AerialVehicle Imagery (무인항공기 영상 활용 자동 정합점 추출을 통한 KOMPSAT-3A 위성영상의 RPC 보정)

  • Park, Jueon;Kim, Taeheon;Lee, Changhui;Han, Youkyung
    • Korean Journal of Remote Sensing
    • /
    • v.37 no.5_1
    • /
    • pp.1135-1147
    • /
    • 2021
  • In order to geometrically correct high-resolution satellite imagery, the sensor modeling process that restores the geometric relationship between the satellite sensor and the ground surface at the image acquisition time is required. In general, high-resolution satellites provide RPC (Rational Polynomial Coefficient) information, but the vendor-provided RPC includes geometric distortion caused by the position and orientation of the satellite sensor. GCP (Ground Control Point) is generally used to correct the RPC errors. The representative method of acquiring GCP is field survey to obtain accurate ground coordinates. However, it is difficult to find the GCP in the satellite image due to the quality of the image, land cover change, relief displacement, etc. By using image maps acquired from various sensors as reference data, it is possible to automate the collection of GCP through the image matching algorithm. In this study, the RPC of KOMPSAT-3A satellite image was corrected through the extracted matching point using the UAV (Unmanned Aerial Vehichle) imagery. We propose a pre-porocessing method for the extraction of matching points between the UAV imagery and KOMPSAT-3A satellite image. To this end, the characteristics of matching points extracted by independently applying the SURF (Speeded-Up Robust Features) and the phase correlation, which are representative feature-based matching method and area-based matching method, respectively, were compared. The RPC adjustment parameters were calculated using the matching points extracted through each algorithm. In order to verify the performance and usability of the proposed method, it was compared with the GCP-based RPC correction result. The GCP-based method showed an improvement of correction accuracy by 2.14 pixels for the sample and 5.43 pixelsfor the line compared to the vendor-provided RPC. In the proposed method using SURF and phase correlation methods, the accuracy of sample was improved by 0.83 pixels and 1.49 pixels, and that of line wasimproved by 4.81 pixels and 5.19 pixels, respectively, compared to the vendor-provided RPC. Through the experimental results, the proposed method using the UAV imagery presented the possibility as an alternative to the GCP-based method for the RPC correction.

Speech Recognition Using Linear Discriminant Analysis and Common Vector Extraction (선형 판별분석과 공통벡터 추출방법을 이용한 음성인식)

  • 남명우;노승용
    • The Journal of the Acoustical Society of Korea
    • /
    • v.20 no.4
    • /
    • pp.35-41
    • /
    • 2001
  • This paper describes Linear Discriminant Analysis and common vector extraction for speech recognition. Voice signal contains psychological and physiological properties of the speaker as well as dialect differences, acoustical environment effects, and phase differences. For these reasons, the same word spelled out by different speakers can be very different heard. This property of speech signal make it very difficult to extract common properties in the same speech class (word or phoneme). Linear algebra method like BT (Karhunen-Loeve Transformation) is generally used for common properties extraction In the speech signals, but common vector extraction which is suggested by M. Bilginer et at. is used in this paper. The method of M. Bilginer et al. extracts the optimized common vector from the speech signals used for training. And it has 100% recognition accuracy in the trained data which is used for common vector extraction. In spite of these characteristics, the method has some drawback-we cannot use numbers of speech signal for training and the discriminant information among common vectors is not defined. This paper suggests advanced method which can reduce error rate by maximizing the discriminant information among common vectors. And novel method to normalize the size of common vector also added. The result shows improved performance of algorithm and better recognition accuracy of 2% than conventional method.

  • PDF

An Automatic Extraction of English-Korean Bilingual Terms by Using Word-level Presumptive Alignment (단어 단위의 추정 정렬을 통한 영-한 대역어의 자동 추출)

  • Lee, Kong Joo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.2 no.6
    • /
    • pp.433-442
    • /
    • 2013
  • A set of bilingual terms is one of the most important factors in building language-related applications such as a machine translation system and a cross-lingual information system. In this paper, we introduce a new approach that automatically extracts candidates of English-Korean bilingual terms by using a bilingual parallel corpus and a basic English-Korean lexicon. This approach can be useful even though the size of the parallel corpus is small. A sentence alignment is achieved first for the document-level parallel corpus. We can align words between a pair of aligned sentences by referencing a basic bilingual lexicon. For unaligned words between a pair of aligned sentences, several assumptions are applied in order to align bilingual term candidates of two languages. A location of a sentence, a relation between words, and linguistic information between two languages are examples of the assumptions. An experimental result shows approximately 71.7% accuracy for the English-Korean bilingual term candidates which are automatically extracted from 1,000 bilingual parallel corpus.

Automatic Face Extraction with Unification of Brightness Distribution in Candidate Region and Triangle Structure among Facial Features (후보영역의 밝기 분산과 얼굴특징의 삼각형 배치구조를 결합한 얼굴의 자동 검출)

  • 이칠우;최정주
    • Journal of Korea Multimedia Society
    • /
    • v.3 no.1
    • /
    • pp.23-33
    • /
    • 2000
  • In this paper, we describe an algorithm which can extract human faces with natural pose from complex backgrounds. This method basically adopts the concept that facial region has the nearly same gray level for all pixels within appropriately scaled blocks. Based on the idea, we develop a hierarchial process that first, a block image data with pyramid structure of input image is generated, and some candidate regions for facial regions in the block image are Quickly determined, then finally the detailed facial features; organs are decided. To find the features easily, we introduce a local gray level transform which emphasizes dark and small regions, and estimate the geometrical triangle constraints among the facial features. The merit of our method is that we can be freed from the parameter assignment problem since the algorithm utilize a simple brightness computation, consequently robust systems not being depended on specific parameter values can be easily constructed.

  • PDF

Audio Segmentation and Classification Using Support Vector Machine and Fuzzy C-Means Clustering Techniques (서포트 벡터 머신과 퍼지 클러스터링 기법을 이용한 오디오 분할 및 분류)

  • Nguyen, Ngoc;Kang, Myeong-Su;Kim, Cheol-Hong;Kim, Jong-Myon
    • The KIPS Transactions:PartB
    • /
    • v.19B no.1
    • /
    • pp.19-26
    • /
    • 2012
  • The rapid increase of information imposes new demands of content management. The purpose of automatic audio segmentation and classification is to meet the rising need for efficient content management. With this reason, this paper proposes a high-accuracy algorithm that segments audio signals and classifies them into different classes such as speech, music, silence, and environment sounds. The proposed algorithm utilizes support vector machine (SVM) to detect audio-cuts, which are boundaries between different kinds of sounds using the parameter sequence. We then extract feature vectors that are composed of statistical data and they are used as an input of fuzzy c-means (FCM) classifier to partition audio-segments into different classes. To evaluate segmentation and classification performance of the proposed SVM-FCM based algorithm, we consider precision and recall rates for segmentation and classification accuracy for classification. Furthermore, we compare the proposed algorithm with other methods including binary and FCM classifiers in terms of segmentation performance. Experimental results show that the proposed algorithm outperforms other methods in both precision and recall rates.