• Title/Summary/Keyword: data partition

Search Result 416, Processing Time 0.026 seconds

Feature Selection for Mixed Type of Data (다종 형태 데이터를 위한 요소선택 방법)

  • Yang, Jae-Kyung;Lee, Tae-Han
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.33 no.1
    • /
    • pp.114-120
    • /
    • 2010
  • 데이터마이닝의 사전 단계에서 데이터의 차원(Dimensionality)을 줄이기 위한 단계로서 많은 요소선택(Feature Selection) 방법들이 개발되었다. 이 방법은 결과를 예측하거나 데이터를 설명하고자 할 때 어떤 요소들이 관련이 있는지를 결정하는 과정을 포함한다. 또한 이 방법은 데이터의 크기에 대한 확장성 (Scalability)를 향상시키며 학습 모델을 더욱 이해하기 쉽도록 줄 수 있다. 이 논문에서는 NP(Nested Partition) 방법을 사용한 최적화 기반의 새로운 요소선택 방법을 NP 구조의 기본적인 이론 근거와 함께 제안한다. 또 한 편으로 많은 요소선택 방법들이 다중 형태의 데이터를 처리하는데 한계를 가지고 있는데, NP 기반의 요소선택 방법에 다중 형태의 데이터를 처리할 수 있도록 하는 요소 성능 평가도구(Evaluators)를 도입하여 이를 극복하고자 한다. 또한 어떤 평가도구가 특정 데이터 형태에서 더욱 좋은 결과를 보이는지를 실험 결과와 함께 제시하였다.

A Filter Lining Scheme for Efficient Skyline Computation

  • Kim, Ji-Hyun;Kim, Myung
    • Journal of Korea Multimedia Society
    • /
    • v.14 no.12
    • /
    • pp.1591-1600
    • /
    • 2011
  • The skyline of a multidimensional data set is the maximal subset whose elements are not dominated by other elements of the set. Skyline computation is considered to be very useful for a decision making system that deals with multidimensional data analyses. Recently, a great deal of interests has been shown to improve the performance of skyline computation algorithms. In order to speedup, the number of comparisons between data elements should be reduced. In this paper, we propose a filter lining scheme to accomplish such objectives. The scheme divides the multidimensional data space into angle-based partitions, and places a filter for each partition, and then connects them together in order to establish the final filter line. The filter line can be used to eliminate data, that are not part of the skyline, from the original data set in the preprocessing stage. The filter line is adaptively improved during the data scanning stage. In addition, skylines are computed for each remaining data partition, and are then merged to form the final skyline. Our scheme is an improvement of the previously reported simple preprocessing scheme using simple filters. The performance of the scheme is shown by experiments.

An Incremental Multi Partition Averaging Algorithm Based on Memory Based Reasoning (메모리 기반 추론 기법에 기반한 점진적 다분할평균 알고리즘)

  • Yih, Hyeong-Il
    • Journal of IKEEE
    • /
    • v.12 no.1
    • /
    • pp.65-74
    • /
    • 2008
  • One of the popular methods used for pattern classification is the MBR (Memory-Based Reasoning) algorithm. Since it simply computes distances between a test pattern and training patterns or hyperplanes stored in memory, and then assigns the class of the nearest training pattern, it is notorious for memory usage and can't learn additional information from new data. In order to overcome this problem, we propose an incremental learning algorithm (iMPA). iMPA divides the entire pattern space into fixed number partitions, and generates representatives from each partition. Also, due to the fact that it can not learn additional information from new data, we present iMPA which can learn additional information from new data and not require access to the original data, used to train. Proposed methods have been successfully shown to exhibit comparable performance to k-NN with a lot less number of patterns and better result than EACH system which implements the NGE theory using benchmark data sets from UCI Machine Learning Repository.

  • PDF

Soil-Water Partition Coefficients for Cadmium in Some Korean Soils (우리나라 일부 토양에 대한 카드뮴의 토양-물 분배계수)

  • Ok, Yong-Sik;Lee, Ok-Min;Jung, Jin-ho;Lim, Soo-kil;Kim, Jeong-Gyu
    • Korean Journal of Soil Science and Fertilizer
    • /
    • v.36 no.4
    • /
    • pp.200-209
    • /
    • 2003
  • Distribution coefficient ($K_d$) is an universal parameter estimating cadmium partition for a soil-water-crop system in agricultural lands. This study was performed to find some factors affecting soil-water partition coefficients for cadmium in some Korean soils. The distribution coefficients ($K_d$) of cadmium for the 15 series of agricultural soils were measured at quasi-steady state in the pH ranges from 2 to 11. The adsorption data of the selected soils showed a linear relationship between log $K_d$ and pH, which was well agreed with theoretically expected results ; $log\;K_d=0.6339pH+0.5532(r^2=0.70^{**})$. Normalization of the partition coefficients were performed in a range of pH 3.5 ~ 8.5 to minimize adverse effects of Al dissolution, cationic competition, and organic matter dissolution. The $K_d$-om, partition coefficients normalized for organic matter, improved this linearity to the pH of soils. The values of $K_d$-om measured from the field samples were significantly correlated with those of $K_d$ predicted from the sorption-edge experimental data ($r^2=0.68^{**}$).

A New Memory-based Learning using Dynamic Partition Averaging (동적 분할 평균을 이용한 새로운 메모리 기반 학습기법)

  • Yih, Hyeong-Il
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.18 no.4
    • /
    • pp.456-462
    • /
    • 2008
  • The classification is that a new data is classified into one of given classes and is one of the most generally used data mining techniques. Memory-Based Reasoning (MBR) is a reasoning method for classification problem. MBR simply keeps many patterns which are represented by original vector form of features in memory without rules for reasoning, and uses a distance function to classify a test pattern. If training patterns grows in MBR, as well as size of memory great the calculation amount for reasoning much have. NGE, FPA, and RPA methods are well-known MBR algorithms, which are proven to show satisfactory performance, but those have serious problems for memory usage and lengthy computation. In this paper, we propose DPA (Dynamic Partition Averaging) algorithm. it chooses partition points by calculating GINI-Index in the entire pattern space, and partitions the entire pattern space dynamically. If classes that are included to a partition are unique, it generates a representative pattern from partition, unless partitions relevant partitions repeatedly by same method. The proposed method has been successfully shown to exhibit comparable performance to k-NN with a lot less number of patterns and better result than EACH system which implements the NGE theory and FPA, and RPA.

Nonlinear Process Modeling Using Hard Partition-based Inference System (Hard 분산 분할 기반 추론 시스템을 이용한 비선형 공정 모델링)

  • Park, Keon-Jun;Kim, Yong-Kab
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.7 no.4
    • /
    • pp.151-158
    • /
    • 2014
  • In this paper, we introduce an inference system using hard scatter partition method and model the nonlinear process. To do this, we use the hard scatter partition method that partition the input space in the scatter form with the value of the membership degree of 0 or 1. The proposed method is implemented by C-Means clustering algorithm. and is used for the initial center values by means of binary split. by applying the LBG algorithm to compensate for shortcomings in the sensitive initial center value. Hard-scatter-partitioned input space forms the rules in the rule-based system modeling. The premise parameters of the rules are determined by membership matrix by means of C-Means clustering algorithm. The consequence part of the rules is expressed in the form of polynomial functions and the coefficient parameters of each rule are determined by the standard least-squares method. The data widely used in nonlinear process is used to model the nonlinear process and evaluate the characteristics of nonlinear process.

Fuzzy Nonlinear Regression Model (퍼지비선형회귀모형)

  • Hwang, Seung-Gook;Park, Young-Man;Seo, Yoo-Jin;Park, Kwang-Pak
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.8 no.6
    • /
    • pp.99-105
    • /
    • 1998
  • This paper is to propose the fuzzy regression model using genetic algorithm which is fuzzy nonlinear regression model. Genetic algorithm is used to classify the input data for better fuzzy regression analysis. From this partition. each data can be have the grade of membership function which is belonged to a divided data group. The data group, from optimal partition of the region of each variable, have different fuzzy parameters of fuzzy linear regression model one another. We compound the fuzzy output of each data group so as to obtain the final fuzzy number for a data. We show the efficiency of this method by means of demonstration of a case study.

  • PDF

An Attribute Replicating Vertical File Partition Method by Genetic Algorithm (유전알고리듬을 이용한 속성의 중복 허용 파일 수직분할 방법)

  • 김재련;유종찬
    • The Journal of Information Technology and Database
    • /
    • v.6 no.2
    • /
    • pp.71-86
    • /
    • 1999
  • The performance of relational database is measured by the number of disk accesses necessary to transfer data from disk to main memory. The paper proposes to vertically partition relations into fragments and to allow attribute replication to reduce the number of disk accesses. To reduce the computational time, heuristic search method using genetic algorithm is used. Genetic algorithm used employs a rank-based-sharing fitness function and elitism. Desirable parameters of genetic algorithm are obtained through experiments and used to find the solutions. Solutions of attribute replication and attribute non-replication problems are compared. Optimal solutions obtained by branch and bound method and by heuristic solutions(genetic algorithm) are also discussed. The solution method proposed is able to solve large-sized problems within acceptable time limit and shows solutions near the optimal value.

  • PDF

Meta Analysis of Usability Experimental Research Using New Bi-Clustering Algorithm

  • Kim, Kyung-A;Hwang, Won-Il
    • The Korean Journal of Applied Statistics
    • /
    • v.21 no.6
    • /
    • pp.1007-1014
    • /
    • 2008
  • Usability evaluation(UE) experiments are conducted to provide UE practitioners with guidelines for better outcomes. In UE research, significant quantities of empirical results have been accumulated in the past decades. While those results have been anticipated to integrate for producing generalized guidelines, traditional meta-analysis has limitations to combine UE empirical results that often show considerable heterogeneity. In this study, a new data mining method called weighted bi-clustering(WBC) was proposed to partition heterogeneous studies into homogeneous subsets. We applied the WBC to UE empirical results and identified two homogeneous subsets, each of which can be meta-analyzed. In addition, interactions between experimental conditions and UE methods were hypothesized based on the resulting partition and some interactions were confirmed via statistical tests.

Selection Method of Fuzzy Partitions in Fuzzy Rule-Based Classification Systems (퍼지 규칙기반 분류시스템에서 퍼지 분할의 선택방법)

  • Son, Chang-S.;Chung, Hwan-M.;Kwon, Soon-H.
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.18 no.3
    • /
    • pp.360-366
    • /
    • 2008
  • The initial fuzzy partitions in fuzzy rule-based classification systems are determined by considering the domain region of each attribute with the given data, and the optimal classification boundaries within the fuzzy partitions can be discovered by tuning their parameters using various learning processes such as neural network, genetic algorithm, and so on. In this paper, we propose a selection method for fuzzy partition based on statistical information to maximize the performance of pattern classification without learning processes where statistical information is used to extract the uncertainty regions (i.e., the regions which the classification boundaries in pattern classification problems are determined) in each input attribute from the numerical data. Moreover the methods for extracting the candidate rules which are associated with the partition intervals generated by statistical information and for minimizing the coupling problem between the candidate rules are additionally discussed. In order to show the effectiveness of the proposed method, we compared the classification accuracy of the proposed with those of conventional methods on the IRIS and New Thyroid Cancer data. From experimental results, we can confirm the fact that the proposed method only considering statistical information of the numerical patterns provides equal to or better classification accuracy than that of the conventional methods.