Search | Korea Science

Document classification using a deep neural network in text mining (텍스트 마이닝에서 심층 신경망을 이용한 문서 분류)

Lee, Bo-Hui;Lee, Su-Jin;Choi, Yong-Seok
- The Korean Journal of Applied Statistics
- /
- v.33 no.5
- /
- pp.615-625
- /
- 2020
The document-term frequency matrix is a term extracted from documents in which the group information exists in text mining. In this study, we generated the document-term frequency matrix for document classification according to research field. We applied the traditional term weighting function term frequency-inverse document frequency (TF-IDF) to the generated document-term frequency matrix. In addition, we applied term frequency-inverse gravity moment (TF-IGM). We also generated a document-keyword weighted matrix by extracting keywords to improve the document classification accuracy. Based on the keywords matrix extracted, we classify documents using a deep neural network. In order to find the optimal model in the deep neural network, the accuracy of document classification was verified by changing the number of hidden layers and hidden nodes. Consequently, the model with eight hidden layers showed the highest accuracy and all TF-IGM document classification accuracy (according to parameter changes) were higher than TF-IDF. In addition, the deep neural network was confirmed to have better accuracy than the support vector machine. Therefore, we propose a method to apply TF-IGM and a deep neural network in the document classification.
https://doi.org/10.5351/KJAS.2020.33.5.615 인용 PDF KSCI

A Ranking Technique of XML Documents using Path Similarity for Expanded Query Processing (확장된 질의 처리를 위해 경로간 의미적 유사도를 고려한 XML 문서 순위화 기법)

Kim, Hyun-Joo;Park, So-Mi;Park, Seog
- Journal of KIISE:Databases
- /
- v.37 no.2
- /
- pp.113-120
- /
- 2010
XML is broadly using for data storing and processing. XML is specified its structural characteristic and user can query with XPath when information from data document is needed. XPath query can process when the tern and structure of document and query is matched with each other. However, nowadays there are lots of data documents which are made by using different terminology and structure therefore user can not know the exact idea of target data. In fact, there are many possibilities that target data document has information which user is find or a similar ones. Accordingly user query should be processed when their term usage or structural characteristic is slightly different with data document. In order to do that we suggest a XML document ranking method based on path similarity. The method can measure a semantic similarity between user query and data document using three steps which are position, node and relaxation factors.
PDF KSCI

Research on Function and Policy for e-Government System using Semantic Technology (전자정부내 의미기반 기술 도입에 따른 기능 및 정책 연구)

Jang, Young-Cheol
- Journal of Korea Society of Industrial Information Systems
- /
- v.13 no.5
- /
- pp.22-28
- /
- 2008
This paper aims to offer a solution based on semantic document classification to improve e-Government utilization and efficiency for people using their own information retrieval system and linguistic expression. Generally, semantic document classification method is an approach that classifies documents based on the diverse relationships between keywords in a document without fully describing hierarchial concepts between keywords. Our approach considers the deep meanings within the context of the document and radically enhances the information retrieval performance. Concept Weight Document Classification(CoWDC) method, which goes beyond using existing keyword and simple thesaurus/ontology methods by fully considering the concept hierarchy of various concepts is proposed, experimented, and evaluated. With the recognition that in order to verify the superiority of the semantic retrieval technology through test results of the CoWDC and efficiently integrate it into the e-Government, creation of a thesaurus, management of the operating system, expansion of the knowledge base and improvements in search service and accuracy at the national level were needed.
PDF

A Study on the Feasibility of Full-Text Information Retrieval System Based on Document Content Structure (문헌의 내용단위구조에 의한 전문검색시스템의 타당성 고찰)

Lee Byeong-Ki
- Journal of the Korean Society for Library and Information Science
- /
- v.32 no.1
- /
- pp.129-154
- /
- 1998
In these days the online full-text database are increasing, but conventional full-text information retrieval system has been proved with high recall ratio and low precision ratio. One of the disadvantages of full-text IR system is that it is not designed to reflect the user's information need it is due to the fact that full-text IR system has been designed based on physical and logical structure of document without considering the content of document. Therefore, the purpose of the study examined feasibility of document content structure in full-text IR system by resolving such disadvantages of conventional system. 180 Journal articles have been analyzed to find common structure of document content and finally general model of the structure of journal articles were developed. The result shows that have relation to between user's cogntive schema structure, user's information need and contents structure of document. Thus it is concluded that full-text IR system need to be designed by using document content structure in order to meet user's information need more effectively.
PDF

Study on History Tracking Technique of the Document File through RSID Analysis in MS Word (MS 워드의 RSID 분석을 통한 문서파일 이력 추적 기법 연구)

Joun, Jihun;Han, Jaehyeok;Jung, Doowon;Lee, Sangjin
- Journal of the Korea Institute of Information Security & Cryptology
- /
- v.28 no.6
- /
- pp.1439-1448
- /
- 2018
Many electronic document files, including Microsoft Office Word (MS Word), have become a major issue in various legal disputes such as privacy, contract forgery, and trade secret leakage. The internal metadata of OOXML (Office Open XML) format, which is used since MS Word 2007, stores the unique Revision Identifier (RSID). The RSID is a distinct value assigned to a corresponding word, sentence, or paragraph that has been created/modified/deleted after a document is saved. Also, document history, such as addition/correction/deletion of contents or the order of creation, can be tracked using the RSID. In this paper, we propose a methodology to investigate discrimination between the original document and copy as well as possible document file leakage by utilizing the changes of the RSID according to the user's behavior.
https://doi.org/10.13089/JKIISC.2018.28.6.1439 인용 PDF KSCI HTML

A Study on the Pattern Segmentation and Classification in Specially Documentated Imaged (제한된 문서 영상에서 패턴 분절과 구분 처리에 관한 연구)

옥철호;허도근;진용옥
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.14 no.6
- /
- pp.663-674
- /
- 1989
In order to design the automatic processing system of image document, the pattern segmentation of image document and classification methods are presented. The contour extraction using first order differential operator of Gauassian distribution fucntions, the image segmentation using the chain code, and the pattern classication using the second order moments and two=dimensional Rf distance(in transform domain) are implemented. The resuts applied in specially documantated image shows to classify the characters, fingerprints, seals etc well. And the utility of the used algorithms is verified.
PDF

A Methodology for Automatic Multi-Categorization of Single-Categorized Documents (단일 카테고리 문서의 다중 카테고리 자동확장 방법론)

Hong, Jin-Sung;Kim, Namgyu;Lee, Sangwon
- Journal of Intelligence and Information Systems
- /
- v.20 no.3
- /
- pp.77-92
- /
- 2014
Recently, numerous documents including unstructured data and text have been created due to the rapid increase in the usage of social media and the Internet. Each document is usually provided with a specific category for the convenience of the users. In the past, the categorization was performed manually. However, in the case of manual categorization, not only can the accuracy of the categorization be not guaranteed but the categorization also requires a large amount of time and huge costs. Many studies have been conducted towards the automatic creation of categories to solve the limitations of manual categorization. Unfortunately, most of these methods cannot be applied to categorizing complex documents with multiple topics because the methods work by assuming that one document can be categorized into one category only. In order to overcome this limitation, some studies have attempted to categorize each document into multiple categories. However, they are also limited in that their learning process involves training using a multi-categorized document set. These methods therefore cannot be applied to multi-categorization of most documents unless multi-categorized training sets are provided. To overcome the limitation of the requirement of a multi-categorized training set by traditional multi-categorization algorithms, we propose a new methodology that can extend a category of a single-categorized document to multiple categorizes by analyzing relationships among categories, topics, and documents. First, we attempt to find the relationship between documents and topics by using the result of topic analysis for single-categorized documents. Second, we construct a correspondence table between topics and categories by investigating the relationship between them. Finally, we calculate the matching scores for each document to multiple categories. The results imply that a document can be classified into a certain category if and only if the matching score is higher than the predefined threshold. For example, we can classify a certain document into three categories that have larger matching scores than the predefined threshold. The main contribution of our study is that our methodology can improve the applicability of traditional multi-category classifiers by generating multi-categorized documents from single-categorized documents. Additionally, we propose a module for verifying the accuracy of the proposed methodology. For performance evaluation, we performed intensive experiments with news articles. News articles are clearly categorized based on the theme, whereas the use of vulgar language and slang is smaller than other usual text document. We collected news articles from July 2012 to June 2013. The articles exhibit large variations in terms of the number of types of categories. This is because readers have different levels of interest in each category. Additionally, the result is also attributed to the differences in the frequency of the events in each category. In order to minimize the distortion of the result from the number of articles in different categories, we extracted 3,000 articles equally from each of the eight categories. Therefore, the total number of articles used in our experiments was 24,000. The eight categories were "IT Science," "Economy," "Society," "Life and Culture," "World," "Sports," "Entertainment," and "Politics." By using the news articles that we collected, we calculated the document/category correspondence scores by utilizing topic/category and document/topics correspondence scores. The document/category correspondence score can be said to indicate the degree of correspondence of each document to a certain category. As a result, we could present two additional categories for each of the 23,089 documents. Precision, recall, and F-score were revealed to be 0.605, 0.629, and 0.617 respectively when only the top 1 predicted category was evaluated, whereas they were revealed to be 0.838, 0.290, and 0.431 when the top 1 - 3 predicted categories were considered. It was very interesting to find a large variation between the scores of the eight categories on precision, recall, and F-score.
https://doi.org/10.13088/jiis.2014.20.3.077 인용 PDF KSCI

King's Status Reflected in The Joseon Dynasty's Document transmission System (조선 문서행이체제에 반영된 국왕의 위상)

Lee, Hyeongjung
- The Korean Journal of Archival Studies
- /
- no.66
- /
- pp.203-227
- /
- 2020
This article explores the influence of the king in the Joseon dynasty's document transmission system, focusing on some exceptional cases. According to the Joseon's law, the form of official documents depended on rank differences between receiver and sender. However, there were cases of not following the general principles such as Byungjo(兵曹), Seungjeongwon(承政院) and Kyujanggak(奎章閣). Byungjo was a ministry in charge of military administration. Seungjeongwon was a royal secretary institution which assisted the king and delivered king's orders that existed from the early Joseon. Kyujanggak was a royal library and an assistant institution of the king that was established in the JeongJo(正祖) era. Byungjo was regarded as a relatively high-ranking institution when it sent and received military-related documents. Seungjeongwon and Kyujanggak could use Kwanmoon(關文) to upper rank institution. Kwanmoon was the document form used for institutions of the same or lower rank than itself. Conversely, higher rank institutions used Cheobjeong(牒呈) which was stipulated as a document form to using upper rank institution in law to send them. The reason that they could have privileges in transmission document system was that Joseon had an administrative system centered on the king. Byungjo was an institution entrusted with military power from King. Seungjeonwon and Kyujanggak took charge of the assistance and the delivery of King's order. so they could have a different system of receiving and sending document than the others. In conclusion, the Joseon Dynasty operated exceptions in document administration based on the existence of the king, it means Joseon's transmission document system was basically operated under the Confucian bureaucracy with the king as its peak.
https://doi.org/10.20923/kjas.2020.66.203 인용 PDF KSCI

Development of Quality Document Management System Using Hypertext (하이퍼텍스트를 이용한 품질문서 관리시스템 구축 사례)

정현석;남호수;박동준;김호균
- Journal of Korean Society for Quality Management
- /
- v.28 no.3
- /
- pp.104-113
- /
- 2000
In this paper, we present a useful system to manage the quality documents, using the concept of hypertext in HANGUEL wordprocessor, In order to develop this system, we classify all manuals, procedures and forms into files. A relationship chart of these files is constructed and files are hyperlinked according to this chart. We apply this quality document management system using hyper- text to a small precision manufacturing firm by analyzing its all kinds of quality documents. We confirm that this system effectively reduces the handling time of quality documents and supports revising task of quality documents with consistency.
PDF

The Study of Ancient Chinese and arrange SHI-JI document (고한어(古漢語) 연구와 <사기(史記)>문헌(文獻) 정리)

SEO, Weon Nam
- Cross-Cultural Studies
- /
- v.35
- /
- pp.269-291
- /
- 2014
China has countless ancient books that contain thousand years of continuously recorded history across every generation. It is essential to organize the ancient cultural literature in order to better communicate. The study of ancient Chinese literature recently has become a subject of priority. Shi-Ji, one of the representative documents, is used for record keeping because of its experience with ancient Chinese historic materials and artifacts. This draft of Shi-ji is based on ancient Chinese research methods with the purpose to explore the character, phonology, syntax, exegesis and collation of historical value.

Search Result 777, Processing Time 0.033 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)