• 제목/요약/키워드: Text data

Search Result 2,953, Processing Time 0.026 seconds

Privacy-Preserving Language Model Fine-Tuning Using Offsite Tuning (프라이버시 보호를 위한 오프사이트 튜닝 기반 언어모델 미세 조정 방법론)

  • Jinmyung Jeong;Namgyu Kim
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.4
    • /
    • pp.165-184
    • /
    • 2023
  • Recently, Deep learning analysis of unstructured text data using language models, such as Google's BERT and OpenAI's GPT has shown remarkable results in various applications. Most language models are used to learn generalized linguistic information from pre-training data and then update their weights for downstream tasks through a fine-tuning process. However, some concerns have been raised that privacy may be violated in the process of using these language models, i.e., data privacy may be violated when data owner provides large amounts of data to the model owner to perform fine-tuning of the language model. Conversely, when the model owner discloses the entire model to the data owner, the structure and weights of the model are disclosed, which may violate the privacy of the model. The concept of offsite tuning has been recently proposed to perform fine-tuning of language models while protecting privacy in such situations. But the study has a limitation that it does not provide a concrete way to apply the proposed methodology to text classification models. In this study, we propose a concrete method to apply offsite tuning with an additional classifier to protect the privacy of the model and data when performing multi-classification fine-tuning on Korean documents. To evaluate the performance of the proposed methodology, we conducted experiments on about 200,000 Korean documents from five major fields, ICT, electrical, electronic, mechanical, and medical, provided by AIHub, and found that the proposed plug-in model outperforms the zero-shot model and the offsite model in terms of classification accuracy.

Decision Method of Importance of E-Mail based on User Profiles (사용자 프로파일에 기반한 전자 메일의 중요도 결정)

  • Lee, Samuel Sang-Kon
    • The KIPS Transactions:PartB
    • /
    • v.15B no.5
    • /
    • pp.493-500
    • /
    • 2008
  • Although modern day people gather many data from the network, the users want only the information needed. Using this technology, the users can extract on the data that satisfy the query. As the previous studies use the single data in the document, frequency of the data for example, it cannot be considered as the effective data clustering method. What is needed is the effective clustering technology that can process the electronic network documents such as the e-mail or XML that contain the tags of various formats. This paper describes the study of extracting the information from the user query based on the multi-attributes. It proposes a method of extracting the data such as the sender, text type, time limit syntax in the text, and title from the e-mail and using such data for filtering. It also describes the experiment to verify that the multi-attribute based clustering method is more accurate than the existing clustering methods using only the word frequency.

A Semi-supervised Dimension Reduction Method Using Ensemble Approach (앙상블 접근법을 이용한 반감독 차원 감소 방법)

  • Park, Cheong-Hee
    • The KIPS Transactions:PartD
    • /
    • v.19D no.2
    • /
    • pp.147-150
    • /
    • 2012
  • While LDA is a supervised dimension reduction method which finds projective directions to maximize separability between classes, the performance of LDA is severely degraded when the number of labeled data is small. Recently semi-supervised dimension reduction methods have been proposed which utilize abundant unlabeled data and overcome the shortage of labeled data. However, matrix computation usually used in statistical dimension reduction methods becomes hindrance to make the utilization of a large number of unlabeled data difficult, and moreover too much information from unlabeled data may not so helpful compared to the increase of its processing time. In order to solve these problems, we propose an ensemble approach for semi-supervised dimension reduction. Extensive experimental results in text classification demonstrates the effectiveness of the proposed method.

Design and Implementation of Data Replication Web Agent between Heterogeneous DBMSs based on XML (XML 기반의 이기종 DBMS간 데이터 복제 웹 에이전트 설계 및 구현)

  • Yu, Sun-Young;Yim, Jae-Hong
    • Journal of Navigation and Port Research
    • /
    • v.26 no.4
    • /
    • pp.427-433
    • /
    • 2002
  • Since current HTML used on the internet is to use restricted tag, it is not easy to store information and extract data from information of document. XML defined newly tag and is easy to store information and extract data from information. So XML is easier to transact information rather than HTML. XML is suitable for enterprise's requirement needs data exchange between heterogeneous databases. This paper proposes web agent for data replication between heterogeneous DBMSs(Database Management System). Web agent system maneges database on the web and exchange data in heterogeneous database using XML. Then we designed and implemented for web agent of data replication between heterogeneous DBMSs.

Big data-based information recommendation system (빅데이터 기반 정보 추천 시스템)

  • Lee, Jong-Chan;Lee, Moon-Ho
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.22 no.3
    • /
    • pp.443-450
    • /
    • 2018
  • Due to the improvement of quality of life, health care is a main concern of modern people, and the demand for healthcare system is increasing naturally. However, it is difficult to provide customized wellness information suitable for a specific user because there are various medical information on the Internet and it is difficult to estimate the reliability of the information. In this study, we propose a user - centered service that can provide customized service suitable for users rather than simple search function by classifying big data as text mining and providing personalized medical information. We built a big data system and measured the data processing time while increasing the Hadoop slave node for efficient big data analysis. It is confirmed that it is efficient to build big data system than existing system.

An Analysis for the Student's Needs of non-face-to-face based Software Lecture in General Education using Text Mining (텍스트 마이닝을 이용한 비대면 소프트웨어 교양과목의 요구사항 분석)

  • Jeong, Hwa-Young
    • The Journal of the Korea Contents Association
    • /
    • v.22 no.3
    • /
    • pp.105-111
    • /
    • 2022
  • Multiple-choice survey types have been mainly performed to analyze students' needs for online classes. However, in order to analyze the exact needs of students, unstructured data analysis by answer for essay question is required. Big data is applied in various fields because it is possible to analyze unstructured data. This study aims to investigate and analyze what students want subjects or topics for software lecture in general education that process on non-face-to-face online teaching methods. As for the experimental method, keyword analysis and association analysis of big data were performed with unstructured data by giving a subjective questionnaire to students. By the result, we are able to know the keyword what the students want for software lecture, so it will be an important data for planning and designing software lecture of liberal arts in the future as students can grasp the topics they want to learn.

Keyword Analysis of Arboretums and Botanical Gardens Using Social Big Data

  • Shin, Hyun-Tak;Kim, Sang-Jun;Sung, Jung-Won
    • Journal of People, Plants, and Environment
    • /
    • v.23 no.2
    • /
    • pp.233-243
    • /
    • 2020
  • This study collects social big data used in various fields in the past 9 years and explains the patterns of major keywords of the arboretums and botanical gardens to use as the basic data to establish operational strategies for future arboretums and botanical gardens. A total of 6,245,278 cases of data were collected: 4,250,583 from blogs (68.1%), 1,843,677 from online cafes (29.5%), and 151,018 from knowledge search engine (2.4%). As a result of refining valid data, 1,223,162 cases were selected for analysis. We came up with keywords through big data, and used big data program Textom to derive keywords of arboretums and botanical gardens using text mining analysis. As a result, we identified keywords such as 'travel', 'picnic', 'children', 'festival', 'experience', 'Garden of Morning Calm', 'program', 'recreation forest', 'healing', and 'museum'. As a result of keyword analysis, we found that keywords such as 'healing', 'tree', 'experience', 'garden', and 'Garden of Morning Calm' received high public interest. We conducted word cloud analysis by extracting keywords with high frequency in total 6,245,278 titles on social media. The results showed that arboretums and botanical gardens were perceived as spaces for relaxation and leisure such as 'travel', 'picnic' and 'recreation', and that people had high interest in educational aspects with keywords such as 'experience' and 'field trip'. The demand for rest and leisure space, education, and things to see and enjoy in arboretums and botanical gardens increased than in the past. Therefore, there must be differentiation and specialization strategies such as plant collection strategies, exhibition planning and programs in establishing future operation strategies.

Tourism Information Contents and Text Networking (Focused on Formal Website of Jeju and Chinese Personal Blogs) (온라인 관광정보의 내용 및 텍스트 네트워크 (제주 공식 웹사이트와 중국 개인블로그를 중심으로))

  • Zhang, Lin;Yun, Hee Jeong
    • The Journal of the Korea Contents Association
    • /
    • v.18 no.1
    • /
    • pp.19-30
    • /
    • 2018
  • The main purposes of this study are to analyze the contents and text network of online tourism information. For this purpose, Jeju Island, one of the representative tourist destinations in South Korea is selected as a study site. And this study collects the contents of both JeJu official tourism website and Sina Weibo's personal blogs which is one of the most popular Social Network Systems in China. In addition, this study analyzes this online text information using ROST Content Mining System, one of the Chinese big data mining systems. The results of the content analysis show that the formal website of Jeju includes the nouns related to natural, geographical and physical resources, verbs related to existence of resources, and adjectives related to the beauty, cleanness and convenience of resources mainly. Meanwhile, personal blogs include the nouns of Korean-wave, food, local products, other destinations and shopping, verbs related to activity and feeling in Jeju, and adjectives related to their experiences and feeling mainly. Finally, the results of text network show that there are some strong centrality and network of online tourism information at formal website, but there are weak relationships in personal blogs. The results of this study may be able to contribute to the development of demand-based marketing strategies of tourists destination.

A Trend Analysis and Policy proposal for the Work Permit System through Text Mining: Focusing on Text Mining and Social Network analysis (텍스트마이닝을 통한 고용허가제 트렌드 분석과 정책 제안 : 텍스트마이닝과 소셜네트워크 분석을 중심으로)

  • Ha, Jae-Been;Lee, Do-Eun
    • Journal of Convergence for Information Technology
    • /
    • v.11 no.9
    • /
    • pp.17-27
    • /
    • 2021
  • The aim of this research was to identify the issue of the work permit system and consciousness of the people on the system, and to suggest some ideas on the government policies on it. To achieve the aim of research, this research used text mining based on social data. This research collected 1,453,272 texts from 6,217 units of online documents which contained 'work permit system' from January to December, 2020 using Textom, and did text-mining and social network analysis. This research extracted 100 key words frequently mentioned from the analyses of data top-level key word frequency, and degree centrality analysis, and constituted job problem, importance of policy process, competitiveness in the respect of industries, and improvement of living conditions of foreign workers as major key words. In addition, through semantic network analysis, this research figured out major awareness like 'employment policy', and various kinds of ambient awareness like 'international cooperation', 'workers' human rights', 'law', 'recruitment of foreigners', 'corporate competitiveness', 'immigrant culture' and 'foreign workforce management'. Finally, this research suggested some ideas worth considering in establishing government policies on the work permit system and doing related researches.

Analyzing the Effect of Characteristics of Dictionary on the Accuracy of Document Classifiers (용어 사전의 특성이 문서 분류 정확도에 미치는 영향 연구)

  • Jung, Haegang;Kim, Namgyu
    • Management & Information Systems Review
    • /
    • v.37 no.4
    • /
    • pp.41-62
    • /
    • 2018
  • As the volume of unstructured data increases through various social media, Internet news articles, and blogs, the importance of text analysis and the studies are increasing. Since text analysis is mostly performed on a specific domain or topic, the importance of constructing and applying a domain-specific dictionary has been increased. The quality of dictionary has a direct impact on the results of the unstructured data analysis and it is much more important since it present a perspective of analysis. In the literature, most studies on text analysis has emphasized the importance of dictionaries to acquire clean and high quality results. However, unfortunately, a rigorous verification of the effects of dictionaries has not been studied, even if it is already known as the most essential factor of text analysis. In this paper, we generate three dictionaries in various ways from 39,800 news articles and analyze and verify the effect each dictionary on the accuracy of document classification by defining the concept of Intrinsic Rate. 1) A batch construction method which is building a dictionary based on the frequency of terms in the entire documents 2) A method of extracting the terms by category and integrating the terms 3) A method of extracting the features according to each category and integrating them. We compared accuracy of three artificial neural network-based document classifiers to evaluate the quality of dictionaries. As a result of the experiment, the accuracy tend to increase when the "Intrinsic Rate" is high and we found the possibility to improve accuracy of document classification by increasing the intrinsic rate of the dictionary.