• Title/Summary/Keyword: 텍스트 기반 검색

Search Result 376, Processing Time 0.025 seconds

Korean Abbreviation Generation using Sequence to Sequence Learning (Sequence-to-sequence 학습을 이용한 한국어 약어 생성)

  • Choi, Su Jeong;Park, Seong-Bae;Kim, Kweon-Yang
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.3
    • /
    • pp.183-187
    • /
    • 2017
  • Smart phone users prefer fast reading and texting. Hence, users frequently use abbreviated sequences of words and phrases. Nowadays, abbreviations are widely used from chat terms to technical terms. Therefore, gathering abbreviations would be helpful to many services, including information retrieval, recommendation system, and so on. However, manually gathering abbreviations needs to much effort and cost. This is because new abbreviations are continuously generated whenever a new material such as a TV program or a phenomenon is made. Thus it is required to generate of abbreviations automatically. To generate Korean abbreviations, the existing methods use the rule-based approach. The rule-based approach has limitations, in that it is unable to generate irregular abbreviations. Another problem is to decide the correct abbreviation among candidate abbreviations generated rules. To address the limitations, we propose a method of generating Korean abbreviations automatically using sequence-to-sequence learning in this paper. The sequence-to-sequence learning can generate irregular abbreviation and does not lead to the problem of deciding correct abbreviation among candidate abbreviations. Accordingly, it is suitable for generating Korean abbreviations. To evaluate the proposed method, we use dataset of two type. As experimental results, we prove that our method is effective for irregular abbreviations.

Character-based Subtitle Generation by Learning of Multimodal Concept Hierarchy from Cartoon Videos (멀티모달 개념계층모델을 이용한 만화비디오 컨텐츠 학습을 통한 등장인물 기반 비디오 자막 생성)

  • Kim, Kyung-Min;Ha, Jung-Woo;Lee, Beom-Jin;Zhang, Byoung-Tak
    • Journal of KIISE
    • /
    • v.42 no.4
    • /
    • pp.451-458
    • /
    • 2015
  • Previous multimodal learning methods focus on problem-solving aspects, such as image and video search and tagging, rather than on knowledge acquisition via content modeling. In this paper, we propose the Multimodal Concept Hierarchy (MuCH), which is a content modeling method that uses a cartoon video dataset and a character-based subtitle generation method from the learned model. The MuCH model has a multimodal hypernetwork layer, in which the patterns of the words and image patches are represented, and a concept layer, in which each concept variable is represented by a probability distribution of the words and the image patches. The model can learn the characteristics of the characters as concepts from the video subtitles and scene images by using a Bayesian learning method and can also generate character-based subtitles from the learned model if text queries are provided. As an experiment, the MuCH model learned concepts from 'Pororo' cartoon videos with a total of 268 minutes in length and generated character-based subtitles. Finally, we compare the results with those of other multimodal learning models. The Experimental results indicate that given the same text query, our model generates more accurate and more character-specific subtitles than other models.

A Study on the Operating Conditions of Lecture Contents in Contactless Online Classes for University Students (대학생 대상 비대면 온라인 수업에서의 강의 콘텐츠 운영 실태 연구)

  • Lee, Jongmoon
    • Journal of the Korean BIBLIA Society for library and Information Science
    • /
    • v.32 no.4
    • /
    • pp.5-24
    • /
    • 2021
  • The purpose of this study was to investigate and analyze the operating conditions of lecture contents in contactless online classes for University students. First, as a result of analyzing the responses of 93 respondents, 93.3% of the respondents took real-time online lectures (47.7%) or recorded video lectures (45.6%). Second, as a result of analyzing the contents used as textbooks, it was found that e-books (materials) and paper books (materials) were used together (36.6%), or e-books or electronic materials (36.6% and 37.6% respectively) were used in both liberal arts (47.3%) and major subjects (39.8%). In addition to textbooks, both major subjects and liberal arts highly used web materials (47.6% and 40.5% respectively) and YouTube materials (33.3% and 48.0% respectively) as external materials. Third, both liberal arts and major subjects used 'electronic files in the form of PPT or text organized and written by instructors' (62.9% and 58.1% respectively), 'internet materials' (16.7% and 19% respectively) and 'paper book or materials' (10.4% and 12.3% respectively) to share lecture contents. For the screen displayed lecture contents, 93.5% of the respondents satisfied in major subjects, and 90.2% of the respondents satisfied in liberal arts. These results suggest developing multimedia-based lecture contents and an evaluation solution capable of real-time exam supervision, developing a task management system capable of AI-based plagiarism search, task guidance, and task evaluation, and institutionalizing a solution to copyright problems for electronicizing lecture materials so that lectures can be given in the ubiquitous environment.

Water leakage accident analysis of water supply networks using big data analysis technique (R기반 빅데이터 분석기법을 활용한 상수도시스템 누수사고 분석)

  • Hong, Sung-Jin;Yoo, Do-Guen
    • Journal of Korea Water Resources Association
    • /
    • v.55 no.spc1
    • /
    • pp.1261-1270
    • /
    • 2022
  • The purpose of this study is to collect and analyze information related to water leaks that cannot be easily accessed, and utilized by using the news search results that people can easily access. We applied a web crawling technique for extracting big data news on water leakage accidents in the water supply system and presented an algorithm in a procedural way to obtain accurate leak accident news. In addition, a data analysis technique suitable for water leakage accident information analysis was developed so that additional information such as the date and time of occurrence, cause of occurrence, location of occurrence, damaged facilities, damage effect. The primary goal of value extraction through big data-based leak analysis proposed in this study is to extract a meaningful value through comparison with the existing waterworks statistical results. In addition, the proposed method can be used to effectively respond to consumers or determine the service level of water supply networks. In other words, the presentation of such analysis results suggests the need to inform the public of information such as accidents a little more, and can be used in conjunction to prepare a radio wave and response system that can quickly respond in case of an accident.

Exploring the Trend of Korean Creative Dance by Analyzing Research Topics : Application of Text Mining (연구주제 분석을 통한 한국창작무용 경향 탐색 : 텍스트 마이닝의 적용)

  • Yoo, Ji-Young;Kim, Woo-Kyung
    • Journal of Korea Entertainment Industry Association
    • /
    • v.14 no.6
    • /
    • pp.53-60
    • /
    • 2020
  • The study is based on the assumption that the trend of phenomena and trends in research are contextually consistent. Therefore the purpose of this study is to explore the trend of dance through the subject analysis of the Korean creative dance study by utilizing text mining. Thus, 1,291 words were analyzed in the 616 journal title, which were established on the paper search website. The collection, refining and analysis of the data were all R 3.6.0 SW. According to the study, keywords representing the times were frequently used before the 2000s, but Korean creative dance research types were also found in terms of education and physical training. Second, the frequency of keywords related to the dance troupe's performance was high after the 2000s, but it was confirmed that Choi Seung-hee was still in an important position in the study of Korean creative dance. Third, an analysis of the overall research subjects of the Korean creative dance study showed that the research on 'Art of Choi Seung-hee in the modern era' was the highest proportion. Fourth, the Hot Topics, which are rising as of 2000, appeared as 'the performance activities of the National Dance Company' and 'the choreography expression and utilization of traditional dance'. However, since the recent trend of the National Dance Company's performance is advocating 'modernization based on tradition', it has been confirmed that the trend of Korean creative dance since the 2000s has been focused on the use of traditional dance motifs. Fifth, the Cold Topic, which has been falling as of 2000, has been shown to be a study of 'dancing expressions by age'. It was judged that interest in research also decreased due to the tendency to mix various dance styles after the establishment of the genre of Korean creative dance.

Semantic Dependency Link Topic Model for Biomedical Acronym Disambiguation (의미적 의존 링크 토픽 모델을 이용한 생물학 약어 중의성 해소)

  • Kim, Seonho;Yoon, Juntae;Seo, Jungyun
    • Journal of KIISE
    • /
    • v.41 no.9
    • /
    • pp.652-665
    • /
    • 2014
  • Many important terminologies in biomedical text are expressed as abbreviations or acronyms. We newly suggest a semantic link topic model based on the concepts of topic and dependency link to disambiguate biomedical abbreviations and cluster long form variants of abbreviations which refer to the same senses. This model is a generative model inspired by the latent Dirichlet allocation (LDA) topic model, in which each document is viewed as a mixture of topics, with each topic characterized by a distribution over words. Thus, words of a document are generated from a hidden topic structure of a document and the topic structure is inferred from observable word sequences of document collections. In this study, we allow two distinct word generation to incorporate semantic dependencies between words, particularly between expansions (long forms) of abbreviations and their sentential co-occurring words. Besides topic information, the semantic dependency between words is defined as a link and a new random parameter for the link presence is assigned to each word. As a result, the most probable expansions with respect to abbreviations of a given abstract are decided by word-topic distribution, document-topic distribution, and word-link distribution estimated from document collection though the semantic dependency link topic model. The abstracts retrieved from the MEDLINE Entrez interface by the query relating 22 abbreviations and their 186 expansions were used as a data set. The link topic model correctly predicted expansions of abbreviations with the accuracy of 98.30%.

Implementation and Performance Analysis of the Group Communication Using CORBA-ORB, JAVA-RMI and Socket (CORBA-ORB, JAVA-RMI, 소켓을 이용한 그룹 통신의 구현 및 성능 분석)

  • 한윤기;구용완
    • Journal of Internet Computing and Services
    • /
    • v.3 no.1
    • /
    • pp.81-90
    • /
    • 2002
  • Large-scale distributed applications based on Internet and client/server applications have to deal with series of problems. Load balancing, unpredictable communication delays, and networking failures can be the example of the series of problems. Therefore. sophisticated applications such as teleconferencing, video-on-demand, and concurrent software engineering require an abstracted group communication, CORBA does not address these paradigms adequately. It mainly deals with point-to-point communication and does not support the development of reliable applications that include predictable behavior in distributed systems. In this paper, we present our design, implementation and performance analysis of the group communication using the CORBA-ORB. JAVA-RML and Socket based on distributed computing Performance analysis will be estimated latency-lime according to object increment, in case of group communication using ORB of CORBA the average is 14.5172msec, in case of group communication using RMI of Java the average is 21.4085msec, in case of group communication using socket the average is becoming 18.0714msec. Each group communication using multicast and UDP can be estimated 0.2735msec and 0.2157msec. The performance of the CORBA-ORB group communication is increased because of the increased object by the result of this research. This study can be applied to the fault-tolerant client/server system, group-ware. text retrieval system, and financial information systems.

  • PDF

Reinforcement Method for Automated Text Classification using Post-processing and Training with Definition Criteria (학습방법개선과 후처리 분석을 이용한 자동문서분류의 성능향상 방법)

  • Choi, Yun-Jeong;Park, Seung-Soo
    • The KIPS Transactions:PartB
    • /
    • v.12B no.7 s.103
    • /
    • pp.811-822
    • /
    • 2005
  • Automated text categorization is to classify free text documents into predefined categories automatically and whose main goals is to reduce considerable manual process required to the task. The researches to improving the text categorization performance(efficiency) in recent years, focused on enhancing existing classification models and algorithms itself, but, whose range had been limited by feature based statistical methodology. In this paper, we propose RTPost system of different style from i.ny traditional method, which takes fault tolerant system approach and data mining strategy. The 2 important parts of RTPost system are reinforcement training and post-processing part. First, the main point of training method deals with the problem of defining category to be classified before selecting training sample documents. And post-processing method deals with the problem of assigning category, not performance of classification algorithms. In experiments, we applied our system to documents getting low classification accuracy which were laid on a decision boundary nearby. Through the experiments, we shows that our system has high accuracy and stability in actual conditions. It wholly did not depend on some variables which are important influence to classification power such as number of training documents, selection problem and performance of classification algorithms. In addition, we can expect self learning effect which decrease the training cost and increase the training power with employing active learning advantage.

The Method for Real-time Complex Event Detection of Unstructured Big data (비정형 빅데이터의 실시간 복합 이벤트 탐지를 위한 기법)

  • Lee, Jun Heui;Baek, Sung Ha;Lee, Soon Jo;Bae, Hae Young
    • Spatial Information Research
    • /
    • v.20 no.5
    • /
    • pp.99-109
    • /
    • 2012
  • Recently, due to the growth of social media and spread of smart-phone, the amount of data has considerably increased by full use of SNS (Social Network Service). According to it, the Big Data concept is come up and many researchers are seeking solutions to make the best use of big data. To maximize the creative value of the big data held by many companies, it is required to combine them with existing data. The physical and theoretical storage structures of data sources are so different that a system which can integrate and manage them is needed. In order to process big data, MapReduce is developed as a system which has advantages over processing data fast by distributed processing. However, it is difficult to construct and store a system for all key words. Due to the process of storage and search, it is to some extent difficult to do real-time processing. And it makes extra expenses to process complex event without structure of processing different data. In order to solve this problem, the existing Complex Event Processing System is supposed to be used. When it comes to complex event processing system, it gets data from different sources and combines them with each other to make it possible to do complex event processing that is useful for real-time processing specially in stream data. Nevertheless, unstructured data based on text of SNS and internet articles is managed as text type and there is a need to compare strings every time the query processing should be done. And it results in poor performance. Therefore, we try to make it possible to manage unstructured data and do query process fast in complex event processing system. And we extend the data complex function for giving theoretical schema of string. It is completed by changing the string key word into integer type with filtering which uses keyword set. In addition, by using the Complex Event Processing System and processing stream data at real-time of in-memory, we try to reduce the time of reading the query processing after it is stored in the disk.

Analysis of Research Trends of 'Word of Mouth (WoM)' through Main Path and Word Co-occurrence Network (주경로 분석과 연관어 네트워크 분석을 통한 '구전(WoM)' 관련 연구동향 분석)

  • Shin, Hyunbo;Kim, Hea-Jin
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.179-200
    • /
    • 2019
  • Word-of-mouth (WoM) is defined by consumer activities that share information concerning consumption. WoM activities have long been recognized as important in corporate marketing processes and have received much attention, especially in the marketing field. Recently, according to the development of the Internet, the way in which people exchange information in online news and online communities has been expanded, and WoM is diversified in terms of word of mouth, score, rating, and liking. Social media makes online users easy access to information and online WoM is considered a key source of information. Although various studies on WoM have been preceded by this phenomenon, there is no meta-analysis study that comprehensively analyzes them. This study proposed a method to extract major researches by applying text mining techniques and to grasp the main issues of researches in order to find the trend of WoM research using scholarly big data. To this end, a total of 4389 documents were collected by the keyword 'Word-of-mouth' from 1941 to 2018 in Scopus (www.scopus.com), a citation database, and the data were refined through preprocessing such as English morphological analysis, stopwords removal, and noun extraction. To carry out this study, we adopted main path analysis (MPA) and word co-occurrence network analysis. MPA detects key researches and is used to track the development trajectory of academic field, and presents the research trend from a macro perspective. For this, we constructed a citation network based on the collected data. The node means a document and the link means a citation relation in citation network. We then detected the key-route main path by applying SPC (Search Path Count) weights. As a result, the main path composed of 30 documents extracted from a citation network. The main path was able to confirm the change of the academic area which was developing along with the change of the times reflecting the industrial change such as various industrial groups. The results of MPA revealed that WoM research was distinguished by five periods: (1) establishment of aspects and critical elements of WoM, (2) relationship analysis between WoM variables, (3) beginning of researches of online WoM, (4) relationship analysis between WoM and purchase, and (5) broadening of topics. It was found that changes within the industry was reflected in the results such as online development and social media. Very recent studies showed that the topics and approaches related WoM were being diversified to circumstantial changes. However, the results showed that even though WoM was used in diverse fields, the main stream of the researches of WoM from the start to the end, was related to marketing and figuring out the influential factors that proliferate WoM. By applying word co-occurrence network analysis, the research trend is presented from a microscopic point of view. Word co-occurrence network was constructed to analyze the relationship between keywords and social network analysis (SNA) was utilized. We divided the data into three periods to investigate the periodic changes and trends in discussion of WoM. SNA showed that Period 1 (1941~2008) consisted of clusters regarding relationship, source, and consumers. Period 2 (2009~2013) contained clusters of satisfaction, community, social networks, review, and internet. Clusters of period 3 (2014~2018) involved satisfaction, medium, review, and interview. The periodic changes of clusters showed transition from offline to online WoM. Media of WoM have become an important factor in spreading the words. This study conducted a quantitative meta-analysis based on scholarly big data regarding WoM. The main contribution of this study is that it provides a micro perspective on the research trend of WoM as well as the macro perspective. The limitation of this study is that the citation network constructed in this study is a network based on the direct citation relation of the collected documents for MPA.