• Title/Summary/Keyword: 데이터 선별

Search Result 583, Processing Time 0.032 seconds

One-Class Classification Model Based on Lexical Information and Syntactic Patterns (어휘 정보와 구문 패턴에 기반한 단일 클래스 분류 모델)

  • Lee, Hyeon-gu;Choi, Maengsik;Kim, Harksoo
    • Journal of KIISE
    • /
    • v.42 no.6
    • /
    • pp.817-822
    • /
    • 2015
  • Relation extraction is an important information extraction technique that can be widely used in areas such as question-answering and knowledge population. Previous studies on relation extraction have been based on supervised machine learning models that need a large amount of training data manually annotated with relation categories. Recently, to reduce the manual annotation efforts for constructing training data, distant supervision methods have been proposed. However, these methods suffer from a drawback: it is difficult to use these methods for collecting negative training data that are necessary for resolving classification problems. To overcome this drawback, we propose a one-class classification model that can be trained without using negative data. The proposed model determines whether an input data item is included in an inner category by using a similarity measure based on lexical information and syntactic patterns in a vector space. In the experiments conducted in this study, the proposed model showed higher performance (an F1-score of 0.6509 and an accuracy of 0.6833) than a representative one-class classification model, one-class SVM(Support Vector Machine).

Statistical Approach to Sentiment Classification using MapReduce (맵리듀스를 이용한 통계적 접근의 감성 분류)

  • Kang, Mun-Su;Baek, Seung-Hee;Choi, Young-Sik
    • Science of Emotion and Sensibility
    • /
    • v.15 no.4
    • /
    • pp.425-440
    • /
    • 2012
  • As the scale of the internet grows, the amount of subjective data increases. Thus, A need to classify automatically subjective data arises. Sentiment classification is a classification of subjective data by various types of sentiments. The sentiment classification researches have been studied focused on NLP(Natural Language Processing) and sentiment word dictionary. The former sentiment classification researches have two critical problems. First, the performance of morpheme analysis in NLP have fallen short of expectations. Second, it is not easy to choose sentiment words and determine how much a word has a sentiment. To solve these problems, this paper suggests a combination of using web-scale data and a statistical approach to sentiment classification. The proposed method of this paper is using statistics of words from web-scale data, rather than finding a meaning of a word. This approach differs from the former researches depended on NLP algorithms, it focuses on data. Hadoop and MapReduce will be used to handle web-scale data.

  • PDF

A method for contents management using extended metadata in CDN (CDN에서 확장된 메타데이터를 이용한 콘텐츠 관리 방법)

  • Lim, Jung-Eun;Choi, O-Hoon;Na, Hong-Seok;Baik, Doo-Kwon
    • Journal of Digital Contents Society
    • /
    • v.9 no.4
    • /
    • pp.725-733
    • /
    • 2008
  • CDN(Content Delivery Network) has been used as contents transmission network for transmitting high capacity contents fastly and stably. Main goals of CDN are efficient distribution and management high capacity contents. Current CDN distributes contents by managing contents based on basic metadata created by contents provider. However, existing CDN management system doesn't provide a method for applying additional metadata in content itself that is necessary for efficient contents management and distribution. Since the existing system can not annotate additional information in metadata about contents itself, and can not search contents that user wants. This paper proposes a method for applying additional metadata in existing CDN and implemented it as contents metadata management system(CMMS). A user can search needed contents effectively via CMMS. Also, the searched result can help selecting and managing contents to distribute in CDN.

  • PDF

Creation and clustering of proximity data for text data analysis (텍스트 데이터 분석을 위한 근접성 데이터의 생성과 군집화)

  • Jung, Min-Ji;Shin, Sang Min;Choi, Yong-Seok
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.3
    • /
    • pp.451-462
    • /
    • 2019
  • Document-term frequency matrix is a type of data used in text mining. This matrix is often based on various documents provided by the objects to be analyzed. When analyzing objects using this matrix, researchers generally select only terms that are common in documents belonging to one object as keywords. Keywords are used to analyze the object. However, this method misses the unique information of the individual document as well as causes a problem of removing potential keywords that occur frequently in a specific document. In this study, we define data that can overcome this problem as proximity data. We introduce twelve methods that generate proximity data and cluster the objects through two clustering methods of multidimensional scaling and k-means cluster analysis. Finally, we choose the best method to be optimized for clustering the object.

A Study on Duplication Verification of Public Library Catalog Data: Focusing on the Case of G Library in Busan (공공도서관 목록데이터의 중복검증에 관한 연구 - 부산 지역 G도서관 사례를 중심으로 -)

  • Min-geon Song;Soo-Sang Lee
    • Journal of Korean Library and Information Science Society
    • /
    • v.55 no.1
    • /
    • pp.1-26
    • /
    • 2024
  • The purpose of this study is to derive an integration plan for bibliographic records by applying a duplicate verification algorithm to the item-based catalog in public libraries. To this, G Library, which was opened recently in Busan, was selected. After collecting OPAC data from G Library through web crawling, multipart monographs of Korean Literature (KDC 800) were selected and KERIS duplicate verification algorithm was applied. After two rounds of data correction based on the verification results, the duplicate verification rate increased by a total of 2.74% from 95.53% to 98.27%. Even after data correction, 24 books that were judged to be similar or inconsistent were identified as data from other published editions after receiving separate ISBN such as revised versions or hard copies. Through this, it was confirmed that the duplicate verification rate could be improved through catalog data correction work, and the possibility of using the KERIS duplicate verification algorithm as a tool to convert duplicate item-based records from public libraries into manifestation-based records was confirmed.

An Intelligent System of Marker Gene Selection for Classification of Cancers using Microarray Data (마이크로어레이 데이터를 이용한 암 분류 표지 유전자 선별 시스템)

  • Park, Su-Young;Jung, Chai-Yeoung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.14 no.10
    • /
    • pp.2365-2370
    • /
    • 2010
  • The method of cancer classification based on microarray could contribute to being accurate cancer classification by finding differently expressing gene pattern statistically according to a cancer type. Therefore, the process to select a closely related informative gene with a particular cancer classification to classify cancer using present microarray technology with effect is essential. In this paper, the system can detect marker genes to likely express the most differentially explaining the effects of cancer using ovarian cancer microarray data. And it compare and analyze a performance of classification of the proposed system with it of established microarray system using multi-perceptron neural network layer. Microarray data set including marker gene that are selected using ANOVA method represent the highest classification accuracy of 98.61%, which show that it improve classification performance than established microarray system.

Customer's Pattern Analysis System using Intelligent Weblog Server (지능형 웹로그 서버를 이용한 전자상거래 사용자 패턴 수집 시스템)

  • Han, Ji-Seon;Kang, Mi-Jung;Cho, Dong-Sub
    • Proceedings of the KIEE Conference
    • /
    • 2000.11d
    • /
    • pp.836-838
    • /
    • 2000
  • 전자상거래에서 쇼핑몰의 개인화된 서비스를 제공하기 위해서는 소비자의 구매 패턴을 분석하는 것이 필요하다. 이러한 패턴을 효과적으로 분석하기 위해 웹사이트 상에서 사용자 행동 패턴 정보를 수집해야 한다. 본 논문에서는 사용자 패턴 수집 시스템으로 쇼핑몰 서버에 기능을 추가하고 지능형 웹로그 서버를 정의하며 이를 설계, 구현하였다. 전자상거래 쇼핑몰 서버에는 사용자 행위 정보를 로그에 포함시켜 지능형 웹로그 서버에 전송하는 기능을 추가하였다. 그리고 지능형 웹로그 서버는 쇼핑몰 서버로부터 받은 로그 데이터를 분석하고 데이터베이스화하여 저장한다. 이때 데이터베이스 저장 기술로 OLE DB Provider상에서 수행되는 ADO기술을 사용한다. 그리고 저장된 데이터베이스를 레코드셋 단위로 원격에서 제어 가능하게 한다. 또 생성된 데이터베이스에서 필요한 데이터를 선별하여 XML DB로 저장한다. 이와 같은 사용자 패턴 수집 시스템은 데이터베이스 접근 속도가 빠르고, 관계형이나 비관계형 둘 다의 데이터베이스 접근이 가능하다는 장정을 가지며, 원격 데이터 베이스 접근 시 서버의 부하를 줄일 수 있다는 장점이 있다.

  • PDF

Application for Personalized Advertisement (Personalized Advertisement 어플리케이션 개발)

  • Park Sung-Soo;Jung Moon-Ryul
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2004.11a
    • /
    • pp.137-141
    • /
    • 2004
  • 본 논문은 디지털방송 컨텐츠(드라마, 영화, 토크쇼)상에서 PPL(Product Placement) 간접광고를 보다 개인화 된 맞춤 광고로 구현한 어플리케이션을 기술한다. 이러한 애플리케이션은 개인의 취향에 최적화된 광고를 제공하고 방송사와 시청자간의 Interaction에 의해 전자상거래가 가능한 채널로 이동할 수 있는 기능을 제공한다. 다시 말해서 본 논문의 어플리케이션은 컨텐츠 시작 전에 개인이 선호하는 물품을 선택하여 컨텐츠 속에 나오는 PPL광고에서 시청자가 선택한 물품만이 컨텐츠 방영 중에 나타나고, 그 선택 물품의 상세 정보와 구매를 할 수 있는 DAL(Dedicated Advertisers Location)채널로 이동할 수 있도록 하였다. 따라서 시청자 측면에서는 개인화 된 방송 서비스를 이용하여 자신이 원하는 선별된 광고를 보는 효율적이고 능동적인 방송시청을 하게 되며, 방송 사업자 측면에서는 맞춤 방송 서비스로 효과적인 타겟 소비자를 정하여 효과적인 마케팅을 할 수 있다. 그리고 시청한 광고 물품들을 장바구니라는 일종의 북마크에 담을 수 있게 하였다. 시청자가 원할 때는 언제든지 광고된 물품의 T-Commerce채널로 이동 가능하도록 설계, 구현하였다. 이것은 개인화 된 맞춤형 방송과 쌍방향 Interaction이 가능한 새로운 데이터방송의 특성을 잘 보여주는 Interactive 광고로서 새로운 모델이 될 것이다. 본 논문의 어플리케이션(Xlet)은 우리나라 위성방송 데이터방송 표준인 MHP 미들웨어에 의해 구동되어지며, 데이터방송용 API인 JavaTV API, Havi & Davic API에 따라 구현되어졌다.

  • PDF

Control Technique of Modem Output Level to improve Frequency Response Equalization of Satellite TX Terminals (위성 단말 송신부의 주파수 응답 평탄도를 향상시키기 위한 모뎀 출력 조절 방법)

  • Cho, Tae-Chong
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.19 no.5
    • /
    • pp.129-133
    • /
    • 2019
  • Frequency resource efficiency is important in satellite communication systems. One of the causes of a waste of frequency resource is bad flatness. In the case, flatness of satellite Tx terminals would be worse by ACI and guard band. In order to overcome this problem, this paper proposes a technique for frequency response equalization in satellite Tx terminals. First of all, a general linear polynomial expression which meets least squares of representative measurement data is calculated to interpolate unmeasured data. And then flatness can be adjusted using the polynomial expression. Simulation results illustrate adjusted data have lower peak to peak and standard deviation than original data, and these show that flatness be improved.

Acquiring Credential and Analyzing Artifacts of Wire Messenger on Windows (Windows에서의 Wire 크리덴셜 획득 및 아티팩트 분석)

  • Shin, Sumin;Kim, Soram;Youn, Byungchul;Kim, Jongsung
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.31 no.1
    • /
    • pp.61-71
    • /
    • 2021
  • Instant messengers are a means of communication for modern people and can be used with smartphones and PCs respectively or connected with each other. Messengers, which provide various functions such as message, call, and file sharing, contain user behavior information regarded as important evidence in forensic investigation. However, it is difficult to analyze as well as acquire smartphone data because of the security of smartphones or apps. However, messenger data can be extracted through PC when the messenger is used on PC. In this paper, we obtained the credential data of Wire messenger in Windows 10, and showed that it is possible to log-in from another PC without authentication. In addition, we identified and classified major artifacts generated based on user behavior.