• 제목/요약/키워드: 자동정보 추출

Search Result 1,995, Processing Time 0.037 seconds

Korean Word Sense Disambiguation using Dictionary and Corpus (사전과 말뭉치를 이용한 한국어 단어 중의성 해소)

  • Jeong, Hanjo;Park, Byeonghwa
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.1
    • /
    • pp.1-13
    • /
    • 2015
  • As opinion mining in big data applications has been highlighted, a lot of research on unstructured data has made. Lots of social media on the Internet generate unstructured or semi-structured data every second and they are often made by natural or human languages we use in daily life. Many words in human languages have multiple meanings or senses. In this result, it is very difficult for computers to extract useful information from these datasets. Traditional web search engines are usually based on keyword search, resulting in incorrect search results which are far from users' intentions. Even though a lot of progress in enhancing the performance of search engines has made over the last years in order to provide users with appropriate results, there is still so much to improve it. Word sense disambiguation can play a very important role in dealing with natural language processing and is considered as one of the most difficult problems in this area. Major approaches to word sense disambiguation can be classified as knowledge-base, supervised corpus-based, and unsupervised corpus-based approaches. This paper presents a method which automatically generates a corpus for word sense disambiguation by taking advantage of examples in existing dictionaries and avoids expensive sense tagging processes. It experiments the effectiveness of the method based on Naïve Bayes Model, which is one of supervised learning algorithms, by using Korean standard unabridged dictionary and Sejong Corpus. Korean standard unabridged dictionary has approximately 57,000 sentences. Sejong Corpus has about 790,000 sentences tagged with part-of-speech and senses all together. For the experiment of this study, Korean standard unabridged dictionary and Sejong Corpus were experimented as a combination and separate entities using cross validation. Only nouns, target subjects in word sense disambiguation, were selected. 93,522 word senses among 265,655 nouns and 56,914 sentences from related proverbs and examples were additionally combined in the corpus. Sejong Corpus was easily merged with Korean standard unabridged dictionary because Sejong Corpus was tagged based on sense indices defined by Korean standard unabridged dictionary. Sense vectors were formed after the merged corpus was created. Terms used in creating sense vectors were added in the named entity dictionary of Korean morphological analyzer. By using the extended named entity dictionary, term vectors were extracted from the input sentences and then term vectors for the sentences were created. Given the extracted term vector and the sense vector model made during the pre-processing stage, the sense-tagged terms were determined by the vector space model based word sense disambiguation. In addition, this study shows the effectiveness of merged corpus from examples in Korean standard unabridged dictionary and Sejong Corpus. The experiment shows the better results in precision and recall are found with the merged corpus. This study suggests it can practically enhance the performance of internet search engines and help us to understand more accurate meaning of a sentence in natural language processing pertinent to search engines, opinion mining, and text mining. Naïve Bayes classifier used in this study represents a supervised learning algorithm and uses Bayes theorem. Naïve Bayes classifier has an assumption that all senses are independent. Even though the assumption of Naïve Bayes classifier is not realistic and ignores the correlation between attributes, Naïve Bayes classifier is widely used because of its simplicity and in practice it is known to be very effective in many applications such as text classification and medical diagnosis. However, further research need to be carried out to consider all possible combinations and/or partial combinations of all senses in a sentence. Also, the effectiveness of word sense disambiguation may be improved if rhetorical structures or morphological dependencies between words are analyzed through syntactic analysis.

A Study on the Integration of Airborne LiDAR and UAV Data for High-resolution Topographic Information Construction of Tidal Flat (갯벌지역 고해상도 지형정보 구축을 위한 항공 라이다와 UAV 데이터 통합 활용에 관한 연구)

  • Kim, Hye Jin;Lee, Jae Bin;Kim, Yong Il
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.38 no.4
    • /
    • pp.345-352
    • /
    • 2020
  • To preserve and restore tidal flats and prevent safety accidents, it is necessary to construct tidal flat topographic information including the exact location and shape of tidal creeks. In the tidal flats where the field surveying is difficult to apply, airborne LiDAR surveying can provide accurate terrain data for a wide area. On the other hand, we can economically obtain relatively high-resolution data from UAV (Unmanned Aerial Vehicle) surveying. In this study, we proposed the methodology to generate high-resolution topographic information of tidal flats effectively by integrating airborne LiDAR and UAV point clouds. For the purpose, automatic ICP (Iterative Closest Points) registration between two different datasets was conducted and tidal creeks were extracted by applying CSF (Cloth Simulation Filtering) algorithm. Then, we integrated high-density UAV data for tidal creeks and airborne LiDAR data for flat grounds. DEM (Digital Elevation Model) and tidal flat area and depth were generated from the integrated data to construct high-resolution topographic information for large-scale tidal flat map creation. As a result, UAV data was registered without GCP (Ground Control Point), and integrated data including detailed topographic information of tidal creeks with a relatively small data size was generated.

Semantic Topic Selection Method of Document for Classification (문서분류를 위한 의미적 주제선정방법)

  • Ko, kwang-Sup;Kim, Pan-Koo;Lee, Chang-Hoon;Hwang, Myung-Gwon
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.11 no.1
    • /
    • pp.163-172
    • /
    • 2007
  • The web as global network includes text document, video, sound, etc and connects each distributed information using link Through development of web, it accumulates abundant information and the main is text based documents. Most of user use the web to retrieve information what they want. So, numerous researches have progressed to retrieve the text documents using the many methods, such as probability, statistics, vector similarity, Bayesian, and so on. These researches however, could not consider both the subject and the semantics of documents. As a result user have to find by their hand again. Especially, it is more hard to find the korean document because the researches of korean document classification is insufficient. So, to overcome the previous problems, we propose the korean document classification method for semantic retrieval. This method firstly, extracts TF value and RV value of concepts that is included in document, and maps into U-WIN that is korean vocabulary dictionary to select the topic of document. This method is possible to classify the document semantically and showed the efficiency through experiment.

Automatic Quality Evaluation with Completeness and Succinctness for Text Summarization (완전성과 간결성을 고려한 텍스트 요약 품질의 자동 평가 기법)

  • Ko, Eunjung;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.125-148
    • /
    • 2018
  • Recently, as the demand for big data analysis increases, cases of analyzing unstructured data and using the results are also increasing. Among the various types of unstructured data, text is used as a means of communicating information in almost all fields. In addition, many analysts are interested in the amount of data is very large and relatively easy to collect compared to other unstructured and structured data. Among the various text analysis applications, document classification which classifies documents into predetermined categories, topic modeling which extracts major topics from a large number of documents, sentimental analysis or opinion mining that identifies emotions or opinions contained in texts, and Text Summarization which summarize the main contents from one document or several documents have been actively studied. Especially, the text summarization technique is actively applied in the business through the news summary service, the privacy policy summary service, ect. In addition, much research has been done in academia in accordance with the extraction approach which provides the main elements of the document selectively and the abstraction approach which extracts the elements of the document and composes new sentences by combining them. However, the technique of evaluating the quality of automatically summarized documents has not made much progress compared to the technique of automatic text summarization. Most of existing studies dealing with the quality evaluation of summarization were carried out manual summarization of document, using them as reference documents, and measuring the similarity between the automatic summary and reference document. Specifically, automatic summarization is performed through various techniques from full text, and comparison with reference document, which is an ideal summary document, is performed for measuring the quality of automatic summarization. Reference documents are provided in two major ways, the most common way is manual summarization, in which a person creates an ideal summary by hand. Since this method requires human intervention in the process of preparing the summary, it takes a lot of time and cost to write the summary, and there is a limitation that the evaluation result may be different depending on the subject of the summarizer. Therefore, in order to overcome these limitations, attempts have been made to measure the quality of summary documents without human intervention. On the other hand, as a representative attempt to overcome these limitations, a method has been recently devised to reduce the size of the full text and to measure the similarity of the reduced full text and the automatic summary. In this method, the more frequent term in the full text appears in the summary, the better the quality of the summary. However, since summarization essentially means minimizing a lot of content while minimizing content omissions, it is unreasonable to say that a "good summary" based on only frequency always means a "good summary" in its essential meaning. In order to overcome the limitations of this previous study of summarization evaluation, this study proposes an automatic quality evaluation for text summarization method based on the essential meaning of summarization. Specifically, the concept of succinctness is defined as an element indicating how few duplicated contents among the sentences of the summary, and completeness is defined as an element that indicating how few of the contents are not included in the summary. In this paper, we propose a method for automatic quality evaluation of text summarization based on the concepts of succinctness and completeness. In order to evaluate the practical applicability of the proposed methodology, 29,671 sentences were extracted from TripAdvisor 's hotel reviews, summarized the reviews by each hotel and presented the results of the experiments conducted on evaluation of the quality of summaries in accordance to the proposed methodology. It also provides a way to integrate the completeness and succinctness in the trade-off relationship into the F-Score, and propose a method to perform the optimal summarization by changing the threshold of the sentence similarity.

An Effective Feature Extraction Method for Fault Diagnosis of Induction Motors (유도전동기의 고장 진단을 위한 효과적인 특징 추출 방법)

  • Nguyen, Hung N.;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.18 no.7
    • /
    • pp.23-35
    • /
    • 2013
  • This paper proposes an effective technique that is used to automatically extract feature vectors from vibration signals for fault classification systems. Conventional mel-frequency cepstral coefficients (MFCCs) are sensitive to noise of vibration signals, degrading classification accuracy. To solve this problem, this paper proposes spectral envelope cepstral coefficients (SECC) analysis, where a 4-step filter bank based on spectral envelopes of vibration signals is used: (1) a linear predictive coding (LPC) algorithm is used to specify spectral envelopes of all faulty vibration signals, (2) all envelopes are averaged to get general spectral shape, (3) a gradient descent method is used to find extremes of the average envelope and its frequencies, (4) a non-overlapped filter is used to have centers calculated from distances between valley frequencies of the envelope. This 4-step filter bank is then used in cepstral coefficients computation to extract feature vectors. Finally, a multi-layer support vector machine (MLSVM) with various sigma values uses these special parameters to identify faulty types of induction motors. Experimental results indicate that the proposed extraction method outperforms other feature extraction algorithms, yielding more than about 99.65% of classification accuracy.

A STUDY ON THE ANALYSIS OF DIGITAL AERIAL PHOTO USING IMAGE SEGMENTATION (영상분할기법을 이용한 수치항공영상 해석에 관한 연구)

  • Kwon, Hyun;Lee, Hyun-Jik;Park, Hyo-Keun
    • Journal of Korean Society for Geospatial Information Science
    • /
    • v.2 no.2 s.4
    • /
    • pp.131-142
    • /
    • 1994
  • Generally, there are two methods which generates the base map of Geo-Spatial Information System(GSIS). one is the digitizing of existing map, and the other is the analytical plotting method editing data acquired by sensors using computers. But the analytical plotting method and method of the digitizing of existing map is technically complex and has the disadvantages in the costs and time. The subject region of study(the Kwangyang province), was photographed by aircraft, and photographing scale was 1/6,000. Then this area was divided into two specific regions, the residential area, and the agricultural area. In this study, we developed the algorithm that generated base map of database in GSIS from the aerial photo. This algorithm is as followed. First, the digital aerial photos were generated using these aerial photos. Second, these digital aerial photos were enhanced by implementing the histogram equalization. Third, the objects of the enhanced images were extracted by implementing thresholding and edged detection techiques of image segmentation. Finally, these images could be used to updated the base map of database in GSIS. The result obtained from this study showed that method used by this study were more efficient than existing method in costs and time.

  • PDF

Touching Pigs Segmentation and Tracking Verification Using Motion Information (움직임 정보를 이용한 근접 돼지 분리와 추적 검증)

  • Park, Changhyun;Sa, Jaewon;Kim, Heegon;Chung, Yongwha;Park, Daihee;Kim, Hakjae
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.7 no.4
    • /
    • pp.135-144
    • /
    • 2018
  • The domestic pigsty environment is highly vulnerable to the spread of respiratory diseases such as foot-and-mouth disease because of the small space. In order to manage this issue, a variety of studies have been conducted to automatically analyze behavior of individual pigs in a pig pen through a video surveillance system using a camera. Even though it is required to correctly segment touching pigs for tracking each pig in complex situations such as aggressive behavior, detecting the correct boundaries among touching pigs using Kinect's depth information of lower accuracy is a challenging issue. In this paper, we propose a segmentation method using motion information of the touching pigs. In addition, our proposed method can be applied for detecting tracking errors in case of tracking individual pigs in the complex environment. In the experimental results, we confirmed that the touching pigs in a pig farm were separated with the accuracy of 86%, and also confirmed that the tracking errors were detected accurately.

The Automatic Extraction System of Application Update Information in Android Smart Device (안드로이드 스마트 기기 내의 애플리케이션 업데이트 정보 자동 추출 시스템)

  • Kim, Hyounghwan;Kim, Dohyun;Park, Jungheum;Lee, Sangjin
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.24 no.2
    • /
    • pp.345-352
    • /
    • 2014
  • As the utilization rate of smart device increases, various applications for smart device have been developed. Since these applications can contain important data related to user behaviors in digital forensic perspective, the analysis of them should be conducted in advance. However, lots of applications get to have new data format or type when they are updated. Therefore, whether the applications are updated or not should be checked one by one, and if they are, whether their data are changed should be also analyzed. But observing application data repeatedly is a time-consuming task, and that is why the effective method for dealing with this problem is needed. This paper suggests the automatic system which gets updated information and checks changed data by collecting application information.

A Method for Reconstructing Original Images for Captions Areas in Videos Using Block Matching Algorithm (블록 정합을 이용한 비디오 자막 영역의 원 영상 복원 방법)

  • 전병태;이재연;배영래
    • Journal of Broadcast Engineering
    • /
    • v.5 no.1
    • /
    • pp.113-122
    • /
    • 2000
  • It is sometimes necessary to remove the captions and recover original images from video images already broadcast, When the number of images requiring such recovery is small, manual processing is possible, but as the number grows it would be very difficult to do it manually. Therefore, a method for recovering original image for the caption areas in needed. Traditional research on image restoration has focused on restoring blurred images to sharp images using frequency filtering or video coding for transferring video images. This paper proposes a method for automatically recovering original image using BMA(Block Matching Algorithm). We extract information on caption regions and scene change that is used as a prior-knowledge for recovering original image. From the result of caption information detection, we know the start and end frames of captions in video and the character areas in the caption regions. The direction for the recovery is decided using information on the scene change and caption region(the start and end frame for captions). According to the direction, we recover the original image by performing block matching for character components in extracted caption region. Experimental results show that the case of stationary images with little camera or object motion is well recovered. We see that the case of images with motion in complex background is also recovered.

  • PDF

Individual Ortho-rectification of Coast Guard Aerial Images for Oil Spill Monitoring (유출유 모니터링을 위한 해경 항공 영상의 개별정사보정)

  • Oh, Youngon;Bui, An Ngoc;Choi, Kyoungah;Lee, Impyeong
    • Korean Journal of Remote Sensing
    • /
    • v.38 no.6_1
    • /
    • pp.1479-1488
    • /
    • 2022
  • Accidents in which oil spills occur intermittently in the ocean due to ship collisions and sinkings. In order to prepare prompt countermeasures when such an accident occurs, it is necessary to accurately identify the current status of spilled oil. To this end, the Coast Guard patrols the target area with a fixed-wing airplane or helicopter and checks it with the naked eye or video, but it was difficult to determine the area contaminated by the spilled oil and its exact location on the map. Accordingly, this study develops a technology for direct ortho-rectification by automatically geo-referencing aerial images collected by the Coast Guard without individual ground reference points to identify the current status of spilled oil. First, meta information required for georeferencing is extracted from a visualized screen of sensor information such as video by optical character recognition (OCR). Based on the extracted information, the external orientation parameters of the image are determined. Images are individually orthorectified using the determined the external orientation parameters. The accuracy of individual orthoimages generated through this method was evaluated to be about tens of meters up to 100 m. The accuracy level was reasonably acceptable considering the inherent errors of the position and attitude sensors, the inaccuracies in the internal orientation parameters such as camera focal length, without using no ground control points. It is judged to be an appropriate level for identifying the current status of spilled oil contaminated areas in the sea. In the future, if real-time transmission of images captured during flight becomes possible, individual orthoimages can be generated in real time through the proposed individual orthorectification technology. Based on this, it can be effectively used to quickly identify the current status of spilled oil contamination and establish countermeasures.