• Title/Summary/Keyword: neighbor selection

Search Result 130, Processing Time 0.028 seconds

Genetic diversity and divergence among Korean cattle breeds assessed using a BovineHD single-nucleotide polymorphism chip

  • Kim, Seungchang;Cheong, Hyun Sub;Shin, Hyoung Doo;Lee, Sung-Soo;Roh, Hee-Jong;Jeon, Da-Yeon;Cho, Chang-Yeon
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.31 no.11
    • /
    • pp.1691-1699
    • /
    • 2018
  • Objective: In Korea, there are three main cattle breeds, which are distinguished by coat color: Brown Hanwoo (BH), Brindle Hanwoo (BRH), and Jeju Black (JB). In this study, we sought to compare the genetic diversity and divergence among there Korean cattle breeds using a BovineHD chip genotyping array. Methods: Sample data were collected from 168 cattle in three populations of BH (48 cattle), BRH (96 cattle), and JB (24 cattle). The single-nucleotide polymorphism (SNP) genotyping was performed using the Illumina BovineHD SNP 777K Bead chip. Results: Heterozygosity, used as a measure of within-breed genetic diversity, was higher in BH (0.293) and BRH (0.296) than in JB (0.266). Linkage disequilibrium decay was more rapid in BH and BRH than in JB, reaching an average $r^2$ value of 0.2 before 26 kb in BH and BRH, whereas the corresponding value was reached before 32 kb in JB. Intra-population, interpopulation, and Fst analyses were used to identify candidate signatures of positive selection in the genome of a domestic Korean cattle population and 48, 11, and 11 loci were detected in the genomic region of the BRH breed, respectively. A Neighbor-Joining phylogenetic tree showed two main groups: a group comprising BH and BRH on one side and a group containing JB on the other. The runs of homozygosity analysis between Korean breeds indicated that the BRH and JB breeds have high inbreeding within breeds compared with BH. An analysis of differentiation based on a high-density SNP chip showed differences between Korean cattle breeds and the closeness of breeds corresponding to the geographic regions where they are evolving. Conclusion: Our results indicate that although the Korean cattle breeds have common features, they also show reliable breed diversity.

Construction of a artificial levee line in river zones using LiDAR Data (라이다 자료를 이용한 하천지역 인공 제방선 추출)

  • Choung, Yun-Jae;Park, Hyeon-Cheol;Jo, Myung-Hee
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2011.05a
    • /
    • pp.185-185
    • /
    • 2011
  • Mapping of artificial levee lines, one of major tasks in river zone mapping, is critical to prevention of river flood, protection of environments and eco systems in river zones. Thus, mapping of artificial levee lines is essential for management and development of river zones. Coastal mapping including river zone mapping has been historically carried out using surveying technologies. Photogrammetry, one of the surveying technologies, is recently used technology for national river zone mapping in Korea. Airborne laser scanning has been used in most advanced countries for coastal mapping due to its ability to penetrate shallow water and its high vertical accuracy. Due to these advantages, use of LiDAR data in coastal mapping is efficient for monitoring and predicting significant topographic change in river zones. This paper introduces a method for construction of a 3D artificial levee line using a set of LiDAR points that uses normal vectors. Multiple steps are involved in this method. First, a 2.5-dimensional Delaunay triangle mesh is generated based on three nearest-neighbor points in the LiDAR data. Second, a median filtering is applied to minimize noise. Third, edge selection algorithms are applied to extract break edges from a Delaunay triangle mesh using two normal vectors. In this research, two methods for edge selection algorithms using hypothesis testing are used to extract break edges. Fourth, intersection edges which are extracted using both methods at the same range are selected as the intersection edge group. Fifth, among intersection edge group, some linear feature edges which are not suitable to compose a levee line are removed as much as possible considering vertical distance, slope and connectivity of an edge. Sixth, with all line segments which are suitable to constitute a levee line, one river levee line segment is connected to another river levee line segment with the end points of both river levee line segments located nearest horizontally and vertically to each other. After linkage of all the river levee line segments, the initial river levee line is generated. Since the initial river levee line consists of the LiDAR points, the pattern of the initial river levee line is being zigzag along the river levee. Thus, for the last step, a algorithm for smoothing the initial river levee line is applied to fit the initial river levee line into the reference line, and the final 3D river levee line is constructed. After the algorithm is completed, the proposed algorithm is applied to construct the 3D river levee line in Zng-San levee nearby Ham-Ahn Bo in Nak-Dong river. Statistical results show that the constructed river levee line generated using a proposed method has high accuracy in comparison to the ground truth. This paper shows that use of LiDAR data for construction of the 3D river levee line for river zone mapping is useful and efficient; and, as a result, it can be replaced with ground surveying method for construction of the 3D river levee line.

  • PDF

Feature Selection to Predict Very Short-term Heavy Rainfall Based on Differential Evolution (미분진화 기반의 초단기 호우예측을 위한 특징 선택)

  • Seo, Jae-Hyun;Lee, Yong Hee;Kim, Yong-Hyuk
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.22 no.6
    • /
    • pp.706-714
    • /
    • 2012
  • The Korea Meteorological Administration provided the recent four-years records of weather dataset for our very short-term heavy rainfall prediction. We divided the dataset into three parts: train, validation and test set. Through feature selection, we select only important features among 72 features to avoid significant increase of solution space that arises when growing exponentially with the dimensionality. We used a differential evolution algorithm and two classifiers as the fitness function of evolutionary computation to select more accurate feature subset. One of the classifiers is Support Vector Machine (SVM) that shows high performance, and the other is k-Nearest Neighbor (k-NN) that is fast in general. The test results of SVM were more prominent than those of k-NN in our experiments. Also we processed the weather data using undersampling and normalization techniques. The test results of our differential evolution algorithm performed about five times better than those using all features and about 1.36 times better than those using a genetic algorithm, which is the best known. Running times when using a genetic algorithm were about twenty times longer than those when using a differential evolution algorithm.

Improving the Accuracy of Document Classification by Learning Heterogeneity (이질성 학습을 통한 문서 분류의 정확성 향상 기법)

  • Wong, William Xiu Shun;Hyun, Yoonjin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.3
    • /
    • pp.21-44
    • /
    • 2018
  • In recent years, the rapid development of internet technology and the popularization of smart devices have resulted in massive amounts of text data. Those text data were produced and distributed through various media platforms such as World Wide Web, Internet news feeds, microblog, and social media. However, this enormous amount of easily obtained information is lack of organization. Therefore, this problem has raised the interest of many researchers in order to manage this huge amount of information. Further, this problem also required professionals that are capable of classifying relevant information and hence text classification is introduced. Text classification is a challenging task in modern data analysis, which it needs to assign a text document into one or more predefined categories or classes. In text classification field, there are different kinds of techniques available such as K-Nearest Neighbor, Naïve Bayes Algorithm, Support Vector Machine, Decision Tree, and Artificial Neural Network. However, while dealing with huge amount of text data, model performance and accuracy becomes a challenge. According to the type of words used in the corpus and type of features created for classification, the performance of a text classification model can be varied. Most of the attempts are been made based on proposing a new algorithm or modifying an existing algorithm. This kind of research can be said already reached their certain limitations for further improvements. In this study, aside from proposing a new algorithm or modifying the algorithm, we focus on searching a way to modify the use of data. It is widely known that classifier performance is influenced by the quality of training data upon which this classifier is built. The real world datasets in most of the time contain noise, or in other words noisy data, these can actually affect the decision made by the classifiers built from these data. In this study, we consider that the data from different domains, which is heterogeneous data might have the characteristics of noise which can be utilized in the classification process. In order to build the classifier, machine learning algorithm is performed based on the assumption that the characteristics of training data and target data are the same or very similar to each other. However, in the case of unstructured data such as text, the features are determined according to the vocabularies included in the document. If the viewpoints of the learning data and target data are different, the features may be appearing different between these two data. In this study, we attempt to improve the classification accuracy by strengthening the robustness of the document classifier through artificially injecting the noise into the process of constructing the document classifier. With data coming from various kind of sources, these data are likely formatted differently. These cause difficulties for traditional machine learning algorithms because they are not developed to recognize different type of data representation at one time and to put them together in same generalization. Therefore, in order to utilize heterogeneous data in the learning process of document classifier, we apply semi-supervised learning in our study. However, unlabeled data might have the possibility to degrade the performance of the document classifier. Therefore, we further proposed a method called Rule Selection-Based Ensemble Semi-Supervised Learning Algorithm (RSESLA) to select only the documents that contributing to the accuracy improvement of the classifier. RSESLA creates multiple views by manipulating the features using different types of classification models and different types of heterogeneous data. The most confident classification rules will be selected and applied for the final decision making. In this paper, three different types of real-world data sources were used, which are news, twitter and blogs.

Determination of Genetic Diversity among Korean Hanwoo Cattle Based on Physical Characteristics

  • Choi, T.J.;Lee, S.S.;Yoon, D.H.;Kang, H.S.;Kim, C.D.;Hwang, I.H.;Kim, C.Y.;Jin, X.;Yang, C.G.;Seo, K.S.
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.25 no.9
    • /
    • pp.1205-1215
    • /
    • 2012
  • This study was conducted to establish genetic criteria for phenotypic characteristics of Hanwoo cattle based on allele frequencies and genetic variance analysis using microsatellite markers. Analysis of the genetic diversity among 399 Hanwoo cattle classified according to nose pigmentation and coat color was carried out using 22 microsatellite markers. The results revealed that the INRA035 locus was associated with the highest $F_{is}$ (0.536). Given that the $F_{is}$ value for the Hanwoo INRA035 population ranged from 0.533 (white) to 1.000 (white spotted), this finding was consistent with the loci being fixed in Hanwoo cattle. Expected heterozygosities of the Hanwoo groups classified by coat colors and degree of nose pigmentation ranged from $0.689{\pm}0.023$ (Holstein) to $0.743{\pm}0.021$ (nose pigmentation level of d). Normal Hanwoo and animals with a mixed white coat showed the closest relationship because the lowest $D_A$ value was observed between these groups. However, a pair-wise differentiation test of $F_{st}$ showed no significant difference among the Hanwoo groups classified by coat color and degree of nose pigmentation (p<0.01). Moreover, results of the neighbor-joining tree based on a $D_A$ genetic distance matrix within 399 Hanwoo individuals and principal component analyses confirmed that different groups of cattle with mixed coat color and nose pigmentation formed other specific groups representing Hanwoo genetic and phenotypic characteristics. The results of this study support a relaxation of policies regulating bull selection or animal registration in an effort to minimize financial loss, and could provide basic information that can be used for establishing criteria to classify Hanwoo phenotypes.

An Implementation of Automatic Genre Classification System for Korean Traditional Music (한국 전통음악 (국악)에 대한 자동 장르 분류 시스템 구현)

  • Lee Kang-Kyu;Yoon Won-Jung;Park Kyu-Sik
    • The Journal of the Acoustical Society of Korea
    • /
    • v.24 no.1
    • /
    • pp.29-37
    • /
    • 2005
  • This paper proposes an automatic genre classification system for Korean traditional music. The Proposed system accepts and classifies queried input music as one of the six musical genres such as Royal Shrine Music, Classcal Chamber Music, Folk Song, Folk Music, Buddhist Music, Shamanist Music based on music contents. In general, content-based music genre classification consists of two stages - music feature vector extraction and Pattern classification. For feature extraction. the system extracts 58 dimensional feature vectors including spectral centroid, spectral rolloff and spectral flux based on STFT and also the coefficient domain features such as LPC, MFCC, and then these features are further optimized using SFS method. For Pattern or genre classification, k-NN, Gaussian, GMM and SVM algorithms are considered. In addition, the proposed system adopts MFC method to settle down the uncertainty problem of the system performance due to the different query Patterns (or portions). From the experimental results. we verify the successful genre classification performance over $97{\%}$ for both the k-NN and SVM classifier, however SVM classifier provides almost three times faster classification performance than the k-NN.

Adaptive Block Recovery Based on Subband Energy and DC Value in Wavelet Domain (웨이블릿 부대역의 에너지와 DC 값에 근거한 적응적 블록 복구)

  • Hyun, Seung-Hwa;Eom, Il-Kyu;Kim, Yoo-Shin
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.42 no.5 s.305
    • /
    • pp.95-102
    • /
    • 2005
  • When images compressed with block-based compression techniques are transmitted over a noisy channel, unexpected block losses occur. In this paper, we present a post-processing-based block recovery scheme using Haar wavelet features. No consideration of the edge-direction, when recover the lost blocks, can cause block-blurring effects. The proposed directional recovery method in this paper is effective for the strong edge because exploit the varying neighboring blocks adaptively according to the edges and the directional information in the image. First, the adaptive selection of neighbor blocks is performed based on the energy of wavelet subbands (EWS) and difference of DC values (DDC). The lost blocks are recovered by the linear interpolation in the spatial domain using selected blocks. The method using only EWS performs well for horizontal and vertical edges, but not as well for diagonal edges. Conversely, only using DDC performs well diagonal edges with the exception of line- or roof-type edge profiles. Therefore, we combined EWS and DDC for better results. The proposed methods out performed the previous methods using fixed blocks.

Investigating Dynamic Mutation Process of Issues Using Unstructured Text Analysis (부도예측을 위한 KNN 앙상블 모형의 동시 최적화)

  • Min, Sung-Hwan
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.1
    • /
    • pp.139-157
    • /
    • 2016
  • Bankruptcy involves considerable costs, so it can have significant effects on a country's economy. Thus, bankruptcy prediction is an important issue. Over the past several decades, many researchers have addressed topics associated with bankruptcy prediction. Early research on bankruptcy prediction employed conventional statistical methods such as univariate analysis, discriminant analysis, multiple regression, and logistic regression. Later on, many studies began utilizing artificial intelligence techniques such as inductive learning, neural networks, and case-based reasoning. Currently, ensemble models are being utilized to enhance the accuracy of bankruptcy prediction. Ensemble classification involves combining multiple classifiers to obtain more accurate predictions than those obtained using individual models. Ensemble learning techniques are known to be very useful for improving the generalization ability of the classifier. Base classifiers in the ensemble must be as accurate and diverse as possible in order to enhance the generalization ability of an ensemble model. Commonly used methods for constructing ensemble classifiers include bagging, boosting, and random subspace. The random subspace method selects a random feature subset for each classifier from the original feature space to diversify the base classifiers of an ensemble. Each ensemble member is trained by a randomly chosen feature subspace from the original feature set, and predictions from each ensemble member are combined by an aggregation method. The k-nearest neighbors (KNN) classifier is robust with respect to variations in the dataset but is very sensitive to changes in the feature space. For this reason, KNN is a good classifier for the random subspace method. The KNN random subspace ensemble model has been shown to be very effective for improving an individual KNN model. The k parameter of KNN base classifiers and selected feature subsets for base classifiers play an important role in determining the performance of the KNN ensemble model. However, few studies have focused on optimizing the k parameter and feature subsets of base classifiers in the ensemble. This study proposed a new ensemble method that improves upon the performance KNN ensemble model by optimizing both k parameters and feature subsets of base classifiers. A genetic algorithm was used to optimize the KNN ensemble model and improve the prediction accuracy of the ensemble model. The proposed model was applied to a bankruptcy prediction problem by using a real dataset from Korean companies. The research data included 1800 externally non-audited firms that filed for bankruptcy (900 cases) or non-bankruptcy (900 cases). Initially, the dataset consisted of 134 financial ratios. Prior to the experiments, 75 financial ratios were selected based on an independent sample t-test of each financial ratio as an input variable and bankruptcy or non-bankruptcy as an output variable. Of these, 24 financial ratios were selected by using a logistic regression backward feature selection method. The complete dataset was separated into two parts: training and validation. The training dataset was further divided into two portions: one for the training model and the other to avoid overfitting. The prediction accuracy against this dataset was used to determine the fitness value in order to avoid overfitting. The validation dataset was used to evaluate the effectiveness of the final model. A 10-fold cross-validation was implemented to compare the performances of the proposed model and other models. To evaluate the effectiveness of the proposed model, the classification accuracy of the proposed model was compared with that of other models. The Q-statistic values and average classification accuracies of base classifiers were investigated. The experimental results showed that the proposed model outperformed other models, such as the single model and random subspace ensemble model.

Genetic Variation and Polymorphism in Rainbow Trout, Oncorhynchus mykiss Analysed by Amplified Fragment Length Polymorphism

  • Yoon, Jong-Man;Yoo, Jae-Young;Park, Jae-Il
    • Journal of Aquaculture
    • /
    • v.17 no.1
    • /
    • pp.69-80
    • /
    • 2004
  • The objective of the present study was to analyze genetic distances, variation and characteristics of individuals in rainbow trout, Oncorhynchus mykis using amplified fragment length polymorphism (AFLP) method as molecular genetic technique, to detect AFLP band patterns as genetic markers, and to compare the efficiency of agarosegel electrophoresis (AGE) and polyacrylamide gel electrophoresis (PAGE), respectively. Using 9 primer combinations, a total of 141 AFLP bands were produced, 108 bands (82.4%) of which were polymorphic in AGE. In PAGE, a total of 288 bands were detected, and 220 bands (76.4%) were polymorphic. The AFLP fingerprints of AGE were different from those of PAGE. Separation of the fragments with low molecular weight and genetic polymorphisms revealed a distinct pattern in the two gel systems. In the present study, the average bandsharing values of the individuals between two populations apart from the geographic sites in Kangwon-do ranged from 0.084 to 0.738 of AGE and PAGE. The bandsharing values between individuals No.9 and No. 10 showed the highest level within population, whereas the bandsharing values between individuals No.5 and No.7 showed the lowest level. As calculated by bandsharing analysis, an average of genetic difference (mean$\pm$SD) of individuals was approximately 0.590$\pm$0.125 in this population. In AGE, the single linkage dendrogram resulted from two primers (M11+H11 and M13+H11), indicating six genetic groupings composed of group 1 (No.9 and 10), group 2 (No. 1, 4, 5, 7, 10, 11, 16 and 17), group 3 (No. 2, 3, 6, 8, 12, 15 and 16), group 4 (No.9, 14 and 17), group 5 (No. 13, 19, 20 and 21) and group 6 (No. 23). In AGE, the genetic distances among individuals of between-population ranged from 0.108 to 0.392. In AGE, the shortest genetic distance (0.108) displaying significant molecular differences was between individuals No.9 and No. 10. Especially, the genetic distance between individuals No. 23 and the remnants among individuals within population was highest (0.392). Additionally, in the cluster analysis using the PAGE data, the single linkage dendrogram resulted from two primers (M12+H13 and M11+H13), indicating seven genetic groupings composed of group 1 (No. 15), group 2 (No. 14), group 3 (No. 11 and 12), group 4 (No.5, 6, 7, 8, 10 and 13), group 5 (No.1, 2, 3 and 4), group 6 (No.9) and group 7 (No. 16). By comparison with the individuals in PAGE, genetic distance between No. 10 and No. 7 showed the shortest value (0.071), also between No. 16 and No. 14 showed the highest value (0.242). As with the PAGE analysis, genetic differences were certainly apparent with 13 of 16 individuals showing greater than 80% AFLP-based similarity to their closest neighbor. The three individuals (No. 14, No. 15 and No. 16) of rainbow trout between two populations apart from the geographic sites in Kangwon-do formed distinct genetic distances as compared with other individuals. These results indicated that AFLP markers of this fish could be used as genetic information such as species identification, genetic relationship or analysis of genome structure, and selection aids for genetic improvement of economically important traits in fish species.

Ensemble of Nested Dichotomies for Activity Recognition Using Accelerometer Data on Smartphone (Ensemble of Nested Dichotomies 기법을 이용한 스마트폰 가속도 센서 데이터 기반의 동작 인지)

  • Ha, Eu Tteum;Kim, Jeongmin;Ryu, Kwang Ryel
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.4
    • /
    • pp.123-132
    • /
    • 2013
  • As the smartphones are equipped with various sensors such as the accelerometer, GPS, gravity sensor, gyros, ambient light sensor, proximity sensor, and so on, there have been many research works on making use of these sensors to create valuable applications. Human activity recognition is one such application that is motivated by various welfare applications such as the support for the elderly, measurement of calorie consumption, analysis of lifestyles, analysis of exercise patterns, and so on. One of the challenges faced when using the smartphone sensors for activity recognition is that the number of sensors used should be minimized to save the battery power. When the number of sensors used are restricted, it is difficult to realize a highly accurate activity recognizer or a classifier because it is hard to distinguish between subtly different activities relying on only limited information. The difficulty gets especially severe when the number of different activity classes to be distinguished is very large. In this paper, we show that a fairly accurate classifier can be built that can distinguish ten different activities by using only a single sensor data, i.e., the smartphone accelerometer data. The approach that we take to dealing with this ten-class problem is to use the ensemble of nested dichotomy (END) method that transforms a multi-class problem into multiple two-class problems. END builds a committee of binary classifiers in a nested fashion using a binary tree. At the root of the binary tree, the set of all the classes are split into two subsets of classes by using a binary classifier. At a child node of the tree, a subset of classes is again split into two smaller subsets by using another binary classifier. Continuing in this way, we can obtain a binary tree where each leaf node contains a single class. This binary tree can be viewed as a nested dichotomy that can make multi-class predictions. Depending on how a set of classes are split into two subsets at each node, the final tree that we obtain can be different. Since there can be some classes that are correlated, a particular tree may perform better than the others. However, we can hardly identify the best tree without deep domain knowledge. The END method copes with this problem by building multiple dichotomy trees randomly during learning, and then combining the predictions made by each tree during classification. The END method is generally known to perform well even when the base learner is unable to model complex decision boundaries As the base classifier at each node of the dichotomy, we have used another ensemble classifier called the random forest. A random forest is built by repeatedly generating a decision tree each time with a different random subset of features using a bootstrap sample. By combining bagging with random feature subset selection, a random forest enjoys the advantage of having more diverse ensemble members than a simple bagging. As an overall result, our ensemble of nested dichotomy can actually be seen as a committee of committees of decision trees that can deal with a multi-class problem with high accuracy. The ten classes of activities that we distinguish in this paper are 'Sitting', 'Standing', 'Walking', 'Running', 'Walking Uphill', 'Walking Downhill', 'Running Uphill', 'Running Downhill', 'Falling', and 'Hobbling'. The features used for classifying these activities include not only the magnitude of acceleration vector at each time point but also the maximum, the minimum, and the standard deviation of vector magnitude within a time window of the last 2 seconds, etc. For experiments to compare the performance of END with those of other methods, the accelerometer data has been collected at every 0.1 second for 2 minutes for each activity from 5 volunteers. Among these 5,900 ($=5{\times}(60{\times}2-2)/0.1$) data collected for each activity (the data for the first 2 seconds are trashed because they do not have time window data), 4,700 have been used for training and the rest for testing. Although 'Walking Uphill' is often confused with some other similar activities, END has been found to classify all of the ten activities with a fairly high accuracy of 98.4%. On the other hand, the accuracies achieved by a decision tree, a k-nearest neighbor, and a one-versus-rest support vector machine have been observed as 97.6%, 96.5%, and 97.6%, respectively.