• Title/Summary/Keyword: 대용량 분류

Search Result 243, Processing Time 0.029 seconds

Feature Extraction to Detect Hoax Articles (낚시성 인터넷 신문기사 검출을 위한 특징 추출)

  • Heo, Seong-Wan;Sohn, Kyung-Ah
    • Journal of KIISE
    • /
    • v.43 no.11
    • /
    • pp.1210-1215
    • /
    • 2016
  • Readership of online newspapers has grown with the proliferation of smart devices. However, fierce competition between Internet newspaper companies has resulted in a large increase in the number of hoax articles. Hoax articles are those where the title does not convey the content of the main story, and this gives readers the wrong information about the contents. We note that the hoax articles have certain characteristics, such as unnecessary celebrity quotations, mismatch in the title and content, or incomplete sentences. Based on these, we extract and validate features to identify hoax articles. We build a large-scale training dataset by analyzing text keywords in replies to articles and thus extracted five effective features. We evaluate the performance of the support vector machine classifier on the extracted features, and a 92% accuracy is observed in our validation set. In addition, we also present a selective bigram model to measure the consistency between the title and content, which can be effectively used to analyze short texts in general.

A Robust Pattern-based Feature Extraction Method for Sentiment Categorization of Korean Customer Reviews (강건한 한국어 상품평의 감정 분류를 위한 패턴 기반 자질 추출 방법)

  • Shin, Jun-Soo;Kim, Hark-Soo
    • Journal of KIISE:Software and Applications
    • /
    • v.37 no.12
    • /
    • pp.946-950
    • /
    • 2010
  • Many sentiment categorization systems based on machine learning methods use morphological analyzers in order to extract linguistic features from sentences. However, the morphological analyzers do not generally perform well in a customer review domain because online customer reviews include many spacing errors and spelling errors. These low performances of the underlying systems lead to performance decreases of the sentiment categorization systems. To resolve this problem, we propose a feature extraction method based on simple longest matching of Eojeol (a Korean spacing unit) and phoneme patterns. The two kinds of patterns are automatically constructed from a large amount of POS (part-of-speech) tagged corpus. Eojeol patterns consist of Eojeols including content words such as nouns and verbs. Phoneme patterns consist of leading consonant and vowel pairs of predicate words such as verbs and adjectives because spelling errors seldom occur in leading consonants and vowels. To evaluate the proposed method, we implemented a sentiment categorization system using a SVM (Support Vector Machine) as a machine learner. In the experiment with Korean customer reviews, the sentiment categorization system using the proposed method outperformed that using a morphological analyzer as a feature extractor.

Effective Mood Classification Method based on Music Segments (부분 정보에 기반한 효과적인 음악 무드 분류 방법)

  • Park, Gun-Han;Park, Sang-Yong;Kang, Seok-Joong
    • Journal of Korea Multimedia Society
    • /
    • v.10 no.3
    • /
    • pp.391-400
    • /
    • 2007
  • According to the recent advances in multimedia computing, storage and searching technology have made large volume of music contents become prevalent. Also there has been increasing needs for the study on efficient categorization and searching technique for music contents management. In this paper, a new classifying method using the local information of music content and music tone feature is proposed. While the conventional classifying algorithms are based on entire information of music content, the algorithm proposed in this paper focuses on only the specific local information, which can drastically reduce the computing time without losing classifying accuracy. In order to improve the classifying accuracy, it uses a new classification feature based on music tone. The proposed method has been implemented as a part of MuSE (Music Search/Classification Engine) which was installed on various systems including commercial PDAs and PCs.

  • PDF

One-Class Classification Model Based on Lexical Information and Syntactic Patterns (어휘 정보와 구문 패턴에 기반한 단일 클래스 분류 모델)

  • Lee, Hyeon-gu;Choi, Maengsik;Kim, Harksoo
    • Journal of KIISE
    • /
    • v.42 no.6
    • /
    • pp.817-822
    • /
    • 2015
  • Relation extraction is an important information extraction technique that can be widely used in areas such as question-answering and knowledge population. Previous studies on relation extraction have been based on supervised machine learning models that need a large amount of training data manually annotated with relation categories. Recently, to reduce the manual annotation efforts for constructing training data, distant supervision methods have been proposed. However, these methods suffer from a drawback: it is difficult to use these methods for collecting negative training data that are necessary for resolving classification problems. To overcome this drawback, we propose a one-class classification model that can be trained without using negative data. The proposed model determines whether an input data item is included in an inner category by using a similarity measure based on lexical information and syntactic patterns in a vector space. In the experiments conducted in this study, the proposed model showed higher performance (an F1-score of 0.6509 and an accuracy of 0.6833) than a representative one-class classification model, one-class SVM(Support Vector Machine).

TCAM Partitioning for High-Performance Packet Classification (고성능 패킷 분류를 위한 TCAM 분할)

  • Kim Kyu-Ho;Kang Seok-Min;Song Il-Seop;Kwon Teack-Geun
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.31 no.2B
    • /
    • pp.91-97
    • /
    • 2006
  • As increasing the network bandwidth, the threat of a network also increases with emerging various new services. For a high-performance network security, It is generally used that high-speed packet classification methods which employ hardware like TCAM. There needs an method using these devices efficiently because they are expensive and their capacity is not sufficient. In this paper, we propose an efficient packet classification using a Ternary-CAM(TCAM) which is widely used device for high-speed packet classification in which we have applied Snort rule set for the well-known intrusion detection system. In order to save the size of an expensive TCAM, we have eliminated duplicated IP addresses and port numbers in the rule according to the partitioning of a table in the TCAM, and we have represented negation and range rules with reduced TCAM size. We also keep advantages of low TCAM capacity consumption and reduce the number of TCAM lookups by decreasing the TCAM partitioning using combining port numbers. According to simulation results on our TCAM partitioning, the size of a TCAM can be reduced by upto 98$\%$ and the performance does not degrade significantly for high-speed packet classification with a large amount of rules.

Parallelization of Raster GIS Operations Using PC Clusters (PC 클러스터를 이용한 래스터 GIS 연산의 병렬화)

  • 신윤호;박수홍
    • Spatial Information Research
    • /
    • v.11 no.3
    • /
    • pp.213-226
    • /
    • 2003
  • With the increasing demand of processing massive geographic data, conventional GISs based on the single processor architecture appear to be problematic. Especially, performing complex GIS operations on the massive geographic data is very time consuming and even impossible. This is due to the processor speed development does not keep up with the data volume to be processed. In the field of GIS, this PC clustering is one of the emerging technology for handling massive geographic data effectively. In this study, a MPI(Message Passing Interface)-based parallel processing approach was conducted to implement the existing raster GIS operations that typically requires massive geographic data sets in order to improve the processing capabilities and performance. Specially for this research, four types of raster CIS operations that Tomlin(1990) has introduced for systematic analysis of raster GIS operation. A data decomposition method was designed and implemented for selected raster GIS operations.

  • PDF

An Effective Increment리 Content Clustering Method for the Large Documents in U-learning Environment (U-learning 환경의 대용량 학습문서 판리를 위한 효율적인 점진적 문서)

  • Joo, Kil-Hong;Choi, Jin-Tak
    • Journal of the Korea Computer Industry Society
    • /
    • v.5 no.9
    • /
    • pp.859-872
    • /
    • 2004
  • With the rapid advance of computer and communication techonology, the recent trend of education environment is edveloping in the ubiquitous learning (u-learning) direction that learners select and organize the contents, time and order of learning by themselves. Since the amount of education information through the internet is increasing rapidly and it is managed in document in an effective way is necessary. The document clustering is integrated documents to subject by classifying a set of documents through their similarity among them. Accordingly, the document clustering can be used in exploring and searching a document and it can increased accuracy of search. This paper proposes an efficient incremental clustering method for a set of documents increase gradually. The incremental document clustering algorithm assigns a set of new documents to the legacy clusters which have been identified in advance. In addition, to improve the correctness of the clustering, removing the stop words can be proposed.

  • PDF

A detailed information browsing as a standard of the hierarchical structure on 3D national treasure building (3D 건조물 문화재의 계층적 구조를 기반으로 한 상세정보브라우징)

  • Jung, jung-il;Cho, Jin-so
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2009.05a
    • /
    • pp.816-821
    • /
    • 2009
  • In this paper, I would like to talk about a step by step detailed information browsing which is founded on hierarchical structure for offering suitable information about the mass 3D data of a national treasure building to user as a standard of the visual distance. A step by step detailed information of the national treasure building of gigantic proportions offers a process of detailed information browsing which decided suitable hierarchical structure as considering of the preprocessing procedure which produces hierarchical structure and a visual distance of user. In the preprocessing procedure, 3D data is divided and controlled by optimized spacial structures. The relevance connection between the inner spacial surface is then examined and reconfigured in order to prevent holes or distortions. Finally, relative information data is created. In detailed information browsing, by examining the visual distance between model and user, then by browsing proper step of data, suitable level model data can be provided to the users in accordance with the position of observation.

  • PDF

A Case-Based Reasoning Method Improving Real-Time Computational Performances: Application to Diagnose for Heart Disease (대용량 데이터를 위한 사례기반 추론기법의 실시간 처리속도 개선방안에 대한 연구: 심장병 예측을 중심으로)

  • Park, Yoon-Joo
    • Information Systems Review
    • /
    • v.16 no.1
    • /
    • pp.37-50
    • /
    • 2014
  • Conventional case-based reasoning (CBR) does not perform efficiently for high volume dataset because of case-retrieval time. In order to overcome this problem, some previous researches suggest clustering a case-base into several small groups, and retrieve neighbors within a corresponding group to a target case. However, this approach generally produces less accurate predictive performances than the conventional CBR. This paper suggests a new hybrid case-based reasoning method which dynamically composing a searching pool for each target case. This method is applied to diagnose for the heart disease dataset. The results show that the suggested hybrid method produces statistically the same level of predictive performances with using significantly less computational cost than the CBR method and also outperforms the basic clustering-CBR (C-CBR) method.

Analysis of Network Log based on Hadoop (하둡 기반 네트워크 로그 시스템)

  • Kim, Jeong-Joon;Park, Jeong-Min;Chung, Sung-Taek
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.17 no.5
    • /
    • pp.125-130
    • /
    • 2017
  • Since field control equipment such as PLC has no function to log key event information in the log, it is difficult to analyze the accident. Therefore, it is necessary to secure information that can analyze when a cyber accident occurs by logging the main event information of the field control equipment such as PLC and IED. The protocol analyzer is required to analyze the field control device (the embedded device) communication protocol for event logging. However, the conventional analyzer, such as Wireshark is difficult to process the data identification and extraction of the large variety of protocols for event logging is difficult analysis of the payload data based and classification. In this paper, we developed a system for Big Data based on field control device communication protocol payload data extraction for event logging of large studies.