• Title/Summary/Keyword: Document research

Search Result 1,350, Processing Time 0.051 seconds

Clustering-based Statistical Machine Translation Using Syntactic Structure and Word Similarity (문장구조 유사도와 단어 유사도를 이용한 클러스터링 기반의 통계기계번역)

  • Kim, Han-Kyong;Na, Hwi-Dong;Li, Jin-Ji;Lee, Jong-Hyeok
    • Journal of KIISE:Software and Applications
    • /
    • v.37 no.4
    • /
    • pp.297-304
    • /
    • 2010
  • Clustering method which based on sentence type or document genre is a technique used to improve translation quality of SMT(statistical machine translation) by domain-specific translation. But there is no previous research using sentence type and document genre information simultaneously. In this paper, we suggest an integrated clustering method that classifying sentence type by syntactic structure similarity and document genre by word similarity information. We interpolated domain-specific models from clusters with general models to improve translation quality of SMT system. Kernel function and cosine measures are applied to calculate structural similarity and word similarity. With these similarities, we used machine learning algorithms similar to K-means to clustering. In Japanese-English patent translation corpus, we got 2.5% point relative improvements of translation quality at optimal case.

Automatic Generation of Training Character Samples for OCR Systems

  • Le, Ha;Kim, Soo-Hyung;Na, In-Seop;Do, Yen;Park, Sang-Cheol;Jeong, Sun-Hwa
    • International Journal of Contents
    • /
    • v.8 no.3
    • /
    • pp.83-93
    • /
    • 2012
  • In this paper, we propose a novel method that automatically generates real character images to familiarize existing OCR systems with new fonts. At first, we generate synthetic character images using a simple degradation model. The synthetic data is used to train an OCR engine, and the trained OCR is used to recognize and label real character images that are segmented from ideal document images. Since the OCR engine is unable to recognize accurately all real character images, a substring matching method is employed to fix wrongly labeled characters by comparing two strings; one is the string grouped by recognized characters in an ideal document image, and the other is the ordered string of characters which we are considering to train and recognize. Based on our method, we build a system that automatically generates 2350 most common Korean and 117 alphanumeric characters from new fonts. The ideal document images used in the system are postal envelope images with characters printed in ascending order of their codes. The proposed system achieved a labeling accuracy of 99%. Therefore, we believe that our system is effective in facilitating the generation of numerous character samples to enhance the recognition rate of existing OCR systems for fonts that have never been trained.

Cloud storage-based intelligent archiving system applying automatic document summarization (문서 자동요약 기술을 적용한 클라우드 스토리지 기반 지능적 아카이빙 시스템)

  • Yoo, Kee-Dong
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.17 no.3
    • /
    • pp.59-68
    • /
    • 2012
  • Zero client-based cloud storage technology is gaining much interest as a tool to centralized management of organizational documents nowadays. Besides the well-known cloud storage's defects such as security and privacy protection, users of the zero client-based cloud storage point out the difficulty in browsing and selecting the storage category because of its diversity and complexity. To resolve this problem, this study proposes a method of intelligent document archiving by applying an algorithm-based automatic topic identification technology. Without user's direct definition of category to store the working document, the proposed methodology and prototype enable the working documents to be automatically archived into the predefined categories according to the extracted topic. Based on the proposed ideas, more effective and efficient centralized management of electronic documents can be achieved.

BERT-based Classification Model for Korean Documents (한국어 기술문서 분석을 위한 BERT 기반의 분류모델)

  • Hwang, Sangheum;Kim, Dohyun
    • The Journal of Society for e-Business Studies
    • /
    • v.25 no.1
    • /
    • pp.203-214
    • /
    • 2020
  • It is necessary to classify technical documents such as patents, R&D project reports in order to understand the trends of technology convergence and interdisciplinary joint research, technology development and so on. Text mining techniques have been mainly used to classify these technical documents. However, in the case of classifying technical documents by text mining algorithms, there is a disadvantage that the features representing technical documents must be directly extracted. In this study, we propose a BERT-based document classification model to automatically extract document features from text information of national R&D projects and to classify them. Then, we verify the applicability and performance of the proposed model for classifying documents.

A Study on the 16th Century Food Culture of Chosun Dynasty Nobility in "Miam's Diary" (『미암일기(眉巖日記)』분석을 통한 16세기 사대부가(士大夫家) 음식문화 연구 - 정묘년(丁卯年)(1567년(年)) 10월(月)~무진년(戊辰年)(1568년(年)) 9월(月) -)

  • Kim, Mi-Hye
    • Journal of the Korean Society of Food Culture
    • /
    • v.28 no.5
    • /
    • pp.425-437
    • /
    • 2013
  • The aim of this study was to establish the identity of Korean traditional food based on the recorded food preferences during the period of the Chosun Dynasty. Our primary source in this regard was the invaluable, historical document called the "Miam's diary." This important document reveals details of such food preferences from October 1567 to September 1568. By analyzing the income-expenditure trends of virtually every household, this diary was used to describe a vivid traditional food preference of the people during that period. A detailed analysis of the diary reveals the summary of families' characteristics in the 16th century. First, it records the fact that expenditure on food was mainly based on stipend and gifts received. The type of food preferred by the people was diverse in nature; for it included rice, bean, chicken, pheasant, and seafood. However, there were dried or pickled forms too so as to prevent them from undergoing decay. Second, it throws light on the fact that people expended food mainly as a salary for servants. People utilized the income from selling such food items to purchase goods and land. They also used the same either to donate for a funeral or wedding purpose. Third, it records the fact that day-to-day purchase of groceries was mostly based on gift(s) for someone close to them such as a neighbor, colleague, relative, or student. Further, such gifts included small groceries, food items, and clothes. Fourth, based on the data available in the diary, it seemed likely that the gentry families laid emphasis on the customary formalities of a family dating back to as early as the late 16th century. Finally, the document also records the fact that noblemen of the Chosun Dynasty had a notion that they had to extend warmth and affection by presenting generous gifts to their guests at home. Noblemen during that period were very particular in welcoming their guests as they believed that this approach alone would testify their status as noblemen.

Design of Log Management System based on Document Database for Big Data Management (빅데이터 관리를 위한 문서형 DB 기반 로그관리 시스템 설계)

  • Ryu, Chang-ju;Han, Myeong-ho;Han, Seung-jo
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.19 no.11
    • /
    • pp.2629-2636
    • /
    • 2015
  • Recently Big Data management have a rapid increases interest in IT field, much research conducting to solve a problem of real-time processing to Big Data. Lots of resources are required for the ability to store data in real-time over the network but there is the problem of introducing an analyzing system due to aspect of high cost. Need of redesign of the system for low cost and high efficiency had been increasing to solve the problem. In this paper, the document type of database, MongoDB, is used for design a log management system based a document type of database, that is good at big data managing. The suggested log management system is more efficient than other method on log collection and processing, and it is strong on data forgery through the performance evaluation.

xPlaneb: 3-Dimensional Bitmap Index for Index Document Retrieval (xPlaneb: XML문서 검색을 위한 3차원 비트맵 인덱스)

  • 이재민;황병연
    • Journal of KIISE:Databases
    • /
    • v.31 no.3
    • /
    • pp.331-339
    • /
    • 2004
  • XML has got to be a new standard for data representation and exchanging by its many good points, and the core part of many new researches and emerging technologies. However, the self-describing characteristic, which is one of XML's good points, caused the spreading of XML documents with different structures, and so the need of the research for the effective XML-document search has been proposed. This paper is for the analysis of the problem in BitCube, which is a bitmap indexing that shows high performance grounded on its fast retrieval. In addition, to resolve the problem of BitCube, we did design and implement xPlaneb(XML Plane Web) which it a new 3-dimensional bitmap indexing made of linked lists. We propose an effective information retrieval technique by replacing BitCube operations with new ones and reconstructing 3-dimensional array index of BitCube with effective nodes. Performance evaluation shows that the proposed technique is better than BitCube, as the amount of document increases, in terms of memory consumptions and operation speed.

Analysis of the National Police Agency business trends using text mining (텍스트 마이닝 기법을 이용한 경찰청 업무 트렌드 분석)

  • Sun, Hyunseok;Lim, Changwon
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.2
    • /
    • pp.301-317
    • /
    • 2019
  • There has been significant research conducted on how to discover various insights through text data using statistical techniques. In this study we analyzed text data produced by the Korean National Police Agency to identify trends in the work by year and compare work characteristics among local authorities by identifying distinctive keywords in documents produced by each local authority. A preprocessing according to the characteristics of each data was conducted and the frequency of words for each document was calculated in order to draw a meaningful conclusion. The simple term frequency shown in the document is difficult to describe the characteristics of the keywords; therefore, the frequency for each term was newly calculated using the term frequency-inverse document frequency weights. The L2 norm normalization technique was used to compare the frequency of words. The analysis can be used as basic data that can be newly for future police work improvement policies and as a method to improve the efficiency of the police service that also help identify a demand for improvements in indoor work.

A Study on the Effect of Introduction of Smart Bills of Lading in International Commerce Transactions (무역거래에서 스마트 선하증권 도입의 필요성과 효과에 관한 연구)

  • Yang-Kee Lee;Ki-Young Lee;Jong-Seon Kim
    • Korea Trade Review
    • /
    • v.46 no.6
    • /
    • pp.93-107
    • /
    • 2021
  • The bill of lading serves to link imports and exports. It is the last document issued in the export process and is the most important document before import and export as it is the first document required for importers to take over goods. Transfer the right to the goods to be transported to another person through endorsement. The role and importance of the bill of lading has already been suggested in many previous studies and the trading partners are fully aware of it. In addition, all countries and international organizations recognized the importance and enacted various laws and systems in relation to possible legal problems and they have become customary in practice. However, trade fraud that exploits the characteristics of the bill of lading may occur. In order to solve this problem, various attempts related to the electronic rights transfer of the bill of lading began to be carried out, and many institutions and companies are still trying to develop a new system. As a result of these efforts, electronic bills of lading, such as bolero, appeared, and electronic bills of lading, which can transfer rights in an electronic way, appeared. Therefore, this study intends to present the feasibility of the introduction of smart bills of lading after examining the current status of electronic bills of lading and the introduction of block chain technology-based bills of lading.

Improving the Accuracy of Document Classification by Learning Heterogeneity (이질성 학습을 통한 문서 분류의 정확성 향상 기법)

  • Wong, William Xiu Shun;Hyun, Yoonjin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.3
    • /
    • pp.21-44
    • /
    • 2018
  • In recent years, the rapid development of internet technology and the popularization of smart devices have resulted in massive amounts of text data. Those text data were produced and distributed through various media platforms such as World Wide Web, Internet news feeds, microblog, and social media. However, this enormous amount of easily obtained information is lack of organization. Therefore, this problem has raised the interest of many researchers in order to manage this huge amount of information. Further, this problem also required professionals that are capable of classifying relevant information and hence text classification is introduced. Text classification is a challenging task in modern data analysis, which it needs to assign a text document into one or more predefined categories or classes. In text classification field, there are different kinds of techniques available such as K-Nearest Neighbor, Naïve Bayes Algorithm, Support Vector Machine, Decision Tree, and Artificial Neural Network. However, while dealing with huge amount of text data, model performance and accuracy becomes a challenge. According to the type of words used in the corpus and type of features created for classification, the performance of a text classification model can be varied. Most of the attempts are been made based on proposing a new algorithm or modifying an existing algorithm. This kind of research can be said already reached their certain limitations for further improvements. In this study, aside from proposing a new algorithm or modifying the algorithm, we focus on searching a way to modify the use of data. It is widely known that classifier performance is influenced by the quality of training data upon which this classifier is built. The real world datasets in most of the time contain noise, or in other words noisy data, these can actually affect the decision made by the classifiers built from these data. In this study, we consider that the data from different domains, which is heterogeneous data might have the characteristics of noise which can be utilized in the classification process. In order to build the classifier, machine learning algorithm is performed based on the assumption that the characteristics of training data and target data are the same or very similar to each other. However, in the case of unstructured data such as text, the features are determined according to the vocabularies included in the document. If the viewpoints of the learning data and target data are different, the features may be appearing different between these two data. In this study, we attempt to improve the classification accuracy by strengthening the robustness of the document classifier through artificially injecting the noise into the process of constructing the document classifier. With data coming from various kind of sources, these data are likely formatted differently. These cause difficulties for traditional machine learning algorithms because they are not developed to recognize different type of data representation at one time and to put them together in same generalization. Therefore, in order to utilize heterogeneous data in the learning process of document classifier, we apply semi-supervised learning in our study. However, unlabeled data might have the possibility to degrade the performance of the document classifier. Therefore, we further proposed a method called Rule Selection-Based Ensemble Semi-Supervised Learning Algorithm (RSESLA) to select only the documents that contributing to the accuracy improvement of the classifier. RSESLA creates multiple views by manipulating the features using different types of classification models and different types of heterogeneous data. The most confident classification rules will be selected and applied for the final decision making. In this paper, three different types of real-world data sources were used, which are news, twitter and blogs.