• Title/Summary/Keyword: 대용량 데이터셋

Search Result 55, Processing Time 0.024 seconds

Mining the Up-to-Moment Preference Model based on Partitioned Datasets for Real Time Recommendation (실시간 추천을 위한 분할셋 기반 Up-to-Moment 선호모델 탐색)

  • Han, Jeong-Hye;Byon, Lu-Na
    • Journal of Internet Computing and Services
    • /
    • v.8 no.2
    • /
    • pp.105-115
    • /
    • 2007
  • The up-to-moment dataset is built by combining the past dataset and the recent dataset. The proposal is to compute association rules in real time. This study proposed the model, $EM_{past'}$ and algorithm that is sensitive to time. It can be utilized in real time by applying partitioned combination law after dividing the past dataset into(k-1). Also, we suggested $EM^{ES}_{past}$ applying the exponential smoothing method to $EM^p_{past'}$ When the association rules of $EM_{past'}\;EM^w_{past'\;and\;EM^{ES}_{past}$ were compared, The simulation results showed that $EM^{ES}_{past}$ is most accurate for testing dataset than $EM_{past}$ and $EM^w_{past}$ in huge dataset.

  • PDF

A Design on Informal Big Data Topic Extraction System Based on Spark Framework (Spark 프레임워크 기반 비정형 빅데이터 토픽 추출 시스템 설계)

  • Park, Kiejin
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.5 no.11
    • /
    • pp.521-526
    • /
    • 2016
  • As on-line informal text data have massive in its volume and have unstructured characteristics in nature, there are limitations in applying traditional relational data model technologies for data storage and data analysis jobs. Moreover, using dynamically generating massive social data, social user's real-time reaction analysis tasks is hard to accomplish. In the paper, to capture easily the semantics of massive and informal on-line documents with unsupervised learning mechanism, we design and implement automatic topic extraction systems according to the mass of the words that consists a document. The input data set to the proposed system are generated first, using N-gram algorithm to build multiple words to capture the meaning of the sentences precisely, and Hadoop and Spark (In-memory distributed computing framework) are adopted to run topic model. In the experiment phases, TB level input data are processed for data preprocessing and proposed topic extraction steps are applied. We conclude that the proposed system shows good performance in extracting meaningful topics in time as the intermediate results come from main memories directly instead of an HDD reading.

Distributed In-Memory based Large Scale RDFS Reasoning and Query Processing Engine for the Population of Temporal/Spatial Information of Media Ontology (미디어 온톨로지의 시공간 정보 확장을 위한 분산 인메모리 기반의 대용량 RDFS 추론 및 질의 처리 엔진)

  • Lee, Wan-Gon;Lee, Nam-Gee;Jeon, MyungJoong;Park, Young-Tack
    • Journal of KIISE
    • /
    • v.43 no.9
    • /
    • pp.963-973
    • /
    • 2016
  • Providing a semantic knowledge system using media ontologies requires not only conventional axiom reasoning but also knowledge extension based on various types of reasoning. In particular, spatio-temporal information can be used in a variety of artificial intelligence applications and the importance of spatio-temporal reasoning and expression is continuously increasing. In this paper, we append the LOD data related to the public address system to large-scale media ontologies in order to utilize spatial inference in reasoning. We propose an RDFS/Spatial inference system by utilizing distributed memory-based framework for reasoning about large-scale ontologies annotated with spatial information. In addition, we describe a distributed spatio-temporal SPARQL parallel query processing method designed for large scale ontology data annotated with spatio-temporal information. In order to evaluate the performance of our system, we conducted experiments using LUBM and BSBM data sets for ontology reasoning and query processing benchmark.

High volumes of data conversion based on Hadoop (Hadoop을 이용한 대용량 데이터 변환)

  • Lee, Kang Eun;Jeong, Min Jin;Jeong, Dabin;Kim, Sungsuk;Yang, Sun-Ok
    • Annual Conference of KIPS
    • /
    • 2019.05a
    • /
    • pp.72-74
    • /
    • 2019
  • Hadoop은 대용량 데이터의 분산 처리 응용을 지원하는 프레임워크이다. 이는 마스터 노드와 데이터 노드간에 Map-Reduce 과정을 거쳐 분산 처리를 지원한다. 이에 본 연구에서는 3D 프린팅을 위해 생성한 3D 모델을 프린터가 인식할 수 있는 G-code로 변환하는 작업을 Hadoop에서 수행하였다. 3D 모델은 대개 2차원 개체(페이셋)를 이용하여 표면을 표현하는데, 이 개체를 높이(Z 축)에 따라 슬라이싱한 후각 레이어별로 G-code를 생성하여야 한다. 우선 5대의 컴퓨터에 Hadoop 클러스터를 설치한 후, 대상 3D 모델에 다양한 속성값을 변경하면서 변환작업을 진행하여 Hadoop 프로그래밍의 장점을 확인할 수 있었다.

Integrating Query Column-Sets and Horizontal Partitions on Very Large Data (대용량 데이터 처리를 위한 질의 컬럼셋과 수평 파티션의 통합 방법)

  • Chung, Moonyoung;Lee, Taewhi;Kim, Sung-Soo;Song, Hyewon;Won, Jongho
    • Annual Conference of KIPS
    • /
    • 2016.10a
    • /
    • pp.521-522
    • /
    • 2016
  • 분산된 테이터에 대한 질의 처리에서는 중간 데이터를 전송하는 단계에서 많은 디스크 I/O 및 네트워크 트래픽을 야기할 수 있다. 따라서, 질의에 필요하지 않은 데이터를 미리 필터링하면 불필요한 I/O 및 네트워크 전송을 줄일 수 있어 질의 처리 성능을 높일 수 있다. 이 논문에서는 질의 컬럼셋과 수평 파티션 방법을 통합하여 질의 처리에 불필요한 데이터를 초기 단계에 미리 필터링하여 질의 처리 성능을 높이는 방법을 제안한다.

Syllable-based Korean Named Entity Recognition and Slot Filling with ELECTRA (ELECTRA 모델을 이용한 음절 기반 한국어 개체명 인식과 슬롯 필링)

  • Do, Soojong;Park, Cheoneum;Lee, Cheongjae;Han, Kyuyeol;Lee, Mirye
    • Annual Conference on Human and Language Technology
    • /
    • 2020.10a
    • /
    • pp.337-342
    • /
    • 2020
  • 음절 기반 모델은 음절 하나가 모델의 입력이 되며, 형태소 분석을 기반으로 하는 모델에서 발생하는 에러 전파(error propagation)와 미등록어 문제를 회피할 수 있다. 개체명 인식은 주어진 문장에서 고유한 의미를 갖는 단어를 찾아 개체 범주로 분류하는 자연어처리 태스크이며, 슬롯 필링(slot filling)은 문장 안에서 의미 정보를 추출하는 자연어이해 태스크이다. 본 논문에서는 자동차 도메인 슬롯 필링 데이터셋을 구축하며, 음절 단위로 한국어 개체명 인식과 슬롯 필링을 수행하고, 성능 향상을 위하여 한국어 대용량 코퍼스를 음절 단위로 사전학습한 ELECTRA 모델 기반 학습방법을 제안한다. 실험 결과, 국립국어원 문어체 개체명 데이터셋에서 F1 88.93%, ETRI 데이터셋에서는 F1 94.85%, 자동차 도메인 슬롯 필링에서는 F1 94.74%로 우수한 성능을 보였다. 이에 따라, 본 논문에서 제안한 방법이 의미있음을 알 수 있다.

  • PDF

Distributed Assumption-Based Truth Maintenance System for Scalable Reasoning (대용량 추론을 위한 분산환경에서의 가정기반진리관리시스템)

  • Jagvaral, Batselem;Park, Young-Tack
    • Journal of KIISE
    • /
    • v.43 no.10
    • /
    • pp.1115-1123
    • /
    • 2016
  • Assumption-based truth maintenance system (ATMS) is a tool that maintains the reasoning process of inference engine. It also supports non-monotonic reasoning based on dependency-directed backtracking. Bookkeeping all the reasoning processes allows it to quickly check and retract beliefs and efficiently provide solutions for problems with large search space. However, the amount of data has been exponentially grown recently, making it impossible to use a single machine for solving large-scale problems. The maintaining process for solving such problems can lead to high computation cost due to large memory overhead. To overcome this drawback, this paper presents an approach towards incrementally maintaining the reasoning process of inference engine on cluster using Spark. It maintains data dependencies such as assumption, label, environment and justification on a cluster of machines in parallel and efficiently updates changes in a large amount of inferred datasets. We deployed the proposed ATMS on a cluster with 5 machines, conducted OWL/RDFS reasoning over University benchmark data (LUBM) and evaluated our system in terms of its performance and functionalities such as assertion, explanation and retraction. In our experiments, the proposed system performed the operations in a reasonably short period of time for over 80GB inferred LUBM2000 dataset.

A Large-scale Test Set for Author Disambiguation (저자 식별을 위한 대용량 평가셋 구축)

  • Kang, In-Su;Kim, Pyung;Lee, Seung-Woo;Jung, Han-Min;You, Beom-Jong
    • The Journal of the Korea Contents Association
    • /
    • v.9 no.11
    • /
    • pp.455-464
    • /
    • 2009
  • To overcome article-oriented search functions and provide author-oriented ones, a namesake problem for author names should be solved. Author disambiguation, proposed as its solution, assigns identifiers of real individuals to author name entities. Although recent state-of-the-art approaches to author disambiguation have reported above 90% performance, there are few academic information services which adopt author-resolving functions. This paper describes a large-scale test set for author disambiguation which was created by KISTI to foster author resolution researches. The result of these researches can be applied to academic information systems and make better service. The test set was constructed from DBLP data through web searches and manual inspection, Currently it consists of 881 author names, 41,673 author name entities, and 6,921 person identifiers.

2D Artificial Data Set Construction System for Object Detection and Detection Rate Analysis According to Data Characteristics and Arrangement Structure: Focusing on vehicle License Plate Detection (객체 검출을 위한 2차원 인조데이터 셋 구축 시스템과 데이터 특징 및 배치 구조에 따른 검출률 분석 : 자동차 번호판 검출을 중점으로)

  • Kim, Sang Joon;Choi, Jin Won;Kim, Do Young;Park, Gooman
    • Journal of Broadcast Engineering
    • /
    • v.27 no.2
    • /
    • pp.185-197
    • /
    • 2022
  • Recently, deep learning networks with high performance for object recognition are emerging. In the case of object recognition using deep learning, it is important to build a training data set to improve performance. To build a data set, we need to collect and label the images. This process requires a lot of time and manpower. For this reason, open data sets are used. However, there are objects that do not have large open data sets. One of them is data required for license plate detection and recognition. Therefore, in this paper, we propose an artificial license plate generator system that can create large data sets by minimizing images. In addition, the detection rate according to the artificial license plate arrangement structure was analyzed. As a result of the analysis, the best layout structure was FVC_III and B, and the most suitable network was D2Det. Although the artificial data set performance was 2-3% lower than that of the actual data set, the time to build the artificial data was about 11 times faster than the time to build the actual data set, proving that it is a time-efficient data set building system.

Development of Automatic Rule Extraction Method in Data Mining : An Approach based on Hierarchical Clustering Algorithm and Rough Set Theory (데이터마이닝의 자동 데이터 규칙 추출 방법론 개발 : 계층적 클러스터링 알고리듬과 러프 셋 이론을 중심으로)

  • Oh, Seung-Joon;Park, Chan-Woong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.14 no.6
    • /
    • pp.135-142
    • /
    • 2009
  • Data mining is an emerging area of computational intelligence that offers new theories, techniques, and tools for analysis of large data sets. The major techniques used in data mining are mining association rules, classification and clustering. Since these techniques are used individually, it is necessary to develop the methodology for rule extraction using a process of integrating these techniques. Rule extraction techniques assist humans in analyzing of large data sets and to turn the meaningful information contained in the data sets into successful decision making. This paper proposes an autonomous method of rule extraction using clustering and rough set theory. The experiments are carried out on data sets of UCI KDD archive and present decision rules from the proposed method. These rules can be successfully used for making decisions.