• Title/Summary/Keyword: support vector cluster

Search Result 34, Processing Time 0.035 seconds

Emerging Machine Learning in Wearable Healthcare Sensors

  • Gandha Satria Adi;Inkyu Park
    • Journal of Sensor Science and Technology
    • /
    • v.32 no.6
    • /
    • pp.378-385
    • /
    • 2023
  • Human biosignals provide essential information for diagnosing diseases such as dementia and Parkinson's disease. Owing to the shortcomings of current clinical assessments, noninvasive solutions are required. Machine learning (ML) on wearable sensor data is a promising method for the real-time monitoring and early detection of abnormalities. ML facilitates disease identification, severity measurement, and remote rehabilitation by providing continuous feedback. In the context of wearable sensor technology, ML involves training on observed data for tasks such as classification and regression with applications in clinical metrics. Although supervised ML presents challenges in clinical settings, unsupervised learning, which focuses on tasks such as cluster identification and anomaly detection, has emerged as a useful alternative. This review examines and discusses a variety of ML algorithms such as Support Vector Machines (SVM), Random Forests (RF), Decision Trees (DT), Neural Networks (NN), and Deep Learning for the analysis of complex clinical data.

A Distributed High Dimensional Indexing Structure for Content-based Retrieval of Large Scale Data (대용량 데이터의 내용 기반 검색을 위한 분산 고차원 색인 구조)

  • Cho, Hyun-Hwa;Lee, Mi-Young;Kim, Young-Chang;Chang, Jae-Woo;Lee, Kyu-Chul
    • Journal of KIISE:Databases
    • /
    • v.37 no.5
    • /
    • pp.228-237
    • /
    • 2010
  • Although conventional index structures provide various nearest-neighbor search algorithms for high-dimensional data, there are additional requirements to increase search performances as well as to support index scalability for large scale data. To support these requirements, we propose a distributed high-dimensional indexing structure based on cluster systems, called a Distributed Vector Approximation-tree (DVA-tree), which is a two-level structure consisting of a hybrid spill-tree and VA-files. We also describe the algorithms used for constructing the DVA-tree over multiple machines and performing distributed k-nearest neighbors (NN) searches. To evaluate the performance of the DVA-tree, we conduct an experimental study using both real and synthetic datasets. The results show that our proposed method contributes to significant performance advantages over existing index structures on difference kinds of datasets.

Multi-Vector Document Embedding Using Semantic Decomposition of Complex Documents (복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 방법론)

  • Park, Jongin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.19-41
    • /
    • 2019
  • According to the rapidly increasing demand for text data analysis, research and investment in text mining are being actively conducted not only in academia but also in various industries. Text mining is generally conducted in two steps. In the first step, the text of the collected document is tokenized and structured to convert the original document into a computer-readable form. In the second step, tasks such as document classification, clustering, and topic modeling are conducted according to the purpose of analysis. Until recently, text mining-related studies have been focused on the application of the second steps, such as document classification, clustering, and topic modeling. However, with the discovery that the text structuring process substantially influences the quality of the analysis results, various embedding methods have actively been studied to improve the quality of analysis results by preserving the meaning of words and documents in the process of representing text data as vectors. Unlike structured data, which can be directly applied to a variety of operations and traditional analysis techniques, Unstructured text should be preceded by a structuring task that transforms the original document into a form that the computer can understand before analysis. It is called "Embedding" that arbitrary objects are mapped to a specific dimension space while maintaining algebraic properties for structuring the text data. Recently, attempts have been made to embed not only words but also sentences, paragraphs, and entire documents in various aspects. Particularly, with the demand for analysis of document embedding increases rapidly, many algorithms have been developed to support it. Among them, doc2Vec which extends word2Vec and embeds each document into one vector is most widely used. However, the traditional document embedding method represented by doc2Vec generates a vector for each document using the whole corpus included in the document. This causes a limit that the document vector is affected by not only core words but also miscellaneous words. Additionally, the traditional document embedding schemes usually map each document into a single corresponding vector. Therefore, it is difficult to represent a complex document with multiple subjects into a single vector accurately using the traditional approach. In this paper, we propose a new multi-vector document embedding method to overcome these limitations of the traditional document embedding methods. This study targets documents that explicitly separate body content and keywords. In the case of a document without keywords, this method can be applied after extract keywords through various analysis methods. However, since this is not the core subject of the proposed method, we introduce the process of applying the proposed method to documents that predefine keywords in the text. The proposed method consists of (1) Parsing, (2) Word Embedding, (3) Keyword Vector Extraction, (4) Keyword Clustering, and (5) Multiple-Vector Generation. The specific process is as follows. all text in a document is tokenized and each token is represented as a vector having N-dimensional real value through word embedding. After that, to overcome the limitations of the traditional document embedding method that is affected by not only the core word but also the miscellaneous words, vectors corresponding to the keywords of each document are extracted and make up sets of keyword vector for each document. Next, clustering is conducted on a set of keywords for each document to identify multiple subjects included in the document. Finally, a Multi-vector is generated from vectors of keywords constituting each cluster. The experiments for 3.147 academic papers revealed that the single vector-based traditional approach cannot properly map complex documents because of interference among subjects in each vector. With the proposed multi-vector based method, we ascertained that complex documents can be vectorized more accurately by eliminating the interference among subjects.

Study on the K-scale reflecting the confidence of survey responses (설문 응답에 대한 신뢰도를 반영한 K-척도에 관한 연구)

  • Park, Hye Jung;Pi, Su Young
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.1
    • /
    • pp.41-51
    • /
    • 2013
  • In the Information age, internet addiction has been a big issue in a modern society. The adverse effects of the internet addiction have been increasing at an exponential speed. Along with a great variety of internet-connected device supplies, K-scale diagnostic criteria have been used for the internet addiction self-diagnose tests in the high-speed wireless Internet service, netbooks, and smart phones, etc. The K-scale diagnostic criteria needed to be changed to meet the changing times, and the diagnostic criteria of K-scale was changed in March, 2012. In this paper, we analyze the internet addiction and K-scale features on the actual condition of Gyeongbuk collegiate areas using the revised K-scale diagnostic criteria in 2012. The diagnostic method on internet addiction is measured by the respondents' subjective estimation. Willful error of the respondents can be occurred to hide their truth. In this paper, we add the survey response to the trusted reliability values to reduce response errors on the K-scale on the K-scale, and enhance the reliability of the analysis.

Web access prediction based on parallel deep learning

  • Togtokh, Gantur;Kim, Kyung-Chang
    • Journal of the Korea Society of Computer and Information
    • /
    • v.24 no.11
    • /
    • pp.51-59
    • /
    • 2019
  • Due to the exponential growth of access information on the web, the need for predicting web users' next access has increased. Various models such as markov models, deep neural networks, support vector machines, and fuzzy inference models were proposed to handle web access prediction. For deep learning based on neural network models, training time on large-scale web usage data is very huge. To address this problem, deep neural network models are trained on cluster of computers in parallel. In this paper, we investigated impact of several important spark parameters related to data partitions, shuffling, compression, and locality (basic spark parameters) for training Multi-Layer Perceptron model on Spark standalone cluster. Then based on the investigation, we tuned basic spark parameters for training Multi-Layer Perceptron model and used it for tuning Spark when training Multi-Layer Perceptron model for web access prediction. Through experiments, we showed the accuracy of web access prediction based on our proposed web access prediction model. In addition, we also showed performance improvement in training time based on our spark basic parameters tuning for training Multi-Layer Perceptron model over default spark parameters configuration.

Implementation of a Raspberry-Pi-Sensor Network (라즈베리파이 센서 네트워크 구현)

  • Moon, Sangook
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2014.10a
    • /
    • pp.915-916
    • /
    • 2014
  • With the upcoming era of internet of things, the study of sensor network has been paid attention. Raspberry pi is a tiny versatile computer system which is able to act as a sensor node in hadoop cluster network. In this paper, we deployed 5 Raspberry pi's to construct an experimental testbed of hadoop sensor network with 5-node map-reduce hadoop software framework. We compared and analyzed the network architecture in terms of efficiency, resource management, and throughput using various parameters. We used a learning machine with support vector machine as test workload. In our experiments, Raspberry pi fulfilled the role of distributed computing sensor node in the sensor network.

  • PDF

Wellness Prediction in Diabetes Mellitus Risks Via Machine Learning Classifiers

  • Saravanakumar M, Venkatesh;Sabibullah, M.
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.4
    • /
    • pp.203-208
    • /
    • 2022
  • The occurrence of Type 2 Diabetes Mellitus (T2DM) is hoarding globally. All kinds of Diabetes Mellitus is controlled to disrupt over 415 million grownups worldwide. It was the seventh prime cause of demise widespread with a measured 1.6 million deaths right prompted by diabetes during 2016. Over 90% of diabetes cases are T2DM, with the utmost persons having at smallest one other chronic condition in UK. In valuation of contemporary applications of Big Data (BD) to Diabetes Medicare by sighted its upcoming abilities, it is compulsory to transmit out a bottomless revision over foremost theoretical literatures. The long-term growth in medicine and, in explicit, in the field of "Diabetology", is powerfully encroached to a sequence of differences and inventions. The medical and healthcare data from varied bases like analysis and treatment tactics which assistances healthcare workers to guess the actual perceptions about the development of Diabetes Medicare measures accessible by them. Apache Spark extracts "Resilient Distributed Dataset (RDD)", a vital data structure distributed finished a cluster on machines. Machine Learning (ML) deals a note-worthy method for building elegant and automatic algorithms. ML library involving of communal ML algorithms like Support Vector Classification and Random Forest are investigated in this projected work by using Jupiter Notebook - Python code, where significant quantity of result (Accuracy) is carried out by the models.

Impurity profiling and chemometric analysis of methamphetamine seizures in Korea

  • Shin, Dong Won;Ko, Beom Jun;Cheong, Jae Chul;Lee, Wonho;Kim, Suhkmann;Kim, Jin Young
    • Analytical Science and Technology
    • /
    • v.33 no.2
    • /
    • pp.98-107
    • /
    • 2020
  • Methamphetamine (MA) is currently the most abused illicit drug in Korea. MA is produced by chemical synthesis, and the final target drug that is produced contains small amounts of the precursor chemicals, intermediates, and by-products. To identify and quantify these trace compounds in MA seizures, a practical and feasible approach for conducting chromatographic fingerprinting with a suite of traditional chemometric methods and recently introduced machine learning approaches was examined. This was achieved using gas chromatography (GC) coupled with a flame ionization detector (FID) and mass spectrometry (MS). Following appropriate examination of all the peaks in 71 samples, 166 impurities were selected as the characteristic components. Unsupervised (principal component analysis (PCA), hierarchical cluster analysis (HCA), and K-means clustering) and supervised (partial least squares-discriminant analysis (PLS-DA), orthogonal partial least squares-discriminant analysis (OPLS-DA), support vector machines (SVM), and deep neural network (DNN) with Keras) chemometric techniques were employed for classifying the 71 MA seizures. The results of the PCA, HCA, K-means clustering, PLS-DA, OPLS-DA, SVM, and DNN methods for quality evaluation were in good agreement. However, the tested MA seizures possessed distinct features, such as chirality, cutting agents, and boiling points. The study indicated that the established qualitative and semi-quantitative methods will be practical and useful analytical tools for characterizing trace compounds in illicit MA seizures. Moreover, they will provide a statistical basis for identifying the synthesis route, sources of supply, trafficking routes, and connections between seizures, which will support drug law enforcement agencies in their effort to eliminate organized MA crime.

Bankruptcy Type Prediction Using A Hybrid Artificial Neural Networks Model (하이브리드 인공신경망 모형을 이용한 부도 유형 예측)

  • Jo, Nam-ok;Kim, Hyun-jung;Shin, Kyung-shik
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.3
    • /
    • pp.79-99
    • /
    • 2015
  • The prediction of bankruptcy has been extensively studied in the accounting and finance field. It can have an important impact on lending decisions and the profitability of financial institutions in terms of risk management. Many researchers have focused on constructing a more robust bankruptcy prediction model. Early studies primarily used statistical techniques such as multiple discriminant analysis (MDA) and logit analysis for bankruptcy prediction. However, many studies have demonstrated that artificial intelligence (AI) approaches, such as artificial neural networks (ANN), decision trees, case-based reasoning (CBR), and support vector machine (SVM), have been outperforming statistical techniques since 1990s for business classification problems because statistical methods have some rigid assumptions in their application. In previous studies on corporate bankruptcy, many researchers have focused on developing a bankruptcy prediction model using financial ratios. However, there are few studies that suggest the specific types of bankruptcy. Previous bankruptcy prediction models have generally been interested in predicting whether or not firms will become bankrupt. Most of the studies on bankruptcy types have focused on reviewing the previous literature or performing a case study. Thus, this study develops a model using data mining techniques for predicting the specific types of bankruptcy as well as the occurrence of bankruptcy in Korean small- and medium-sized construction firms in terms of profitability, stability, and activity index. Thus, firms will be able to prevent it from occurring in advance. We propose a hybrid approach using two artificial neural networks (ANNs) for the prediction of bankruptcy types. The first is a back-propagation neural network (BPN) model using supervised learning for bankruptcy prediction and the second is a self-organizing map (SOM) model using unsupervised learning to classify bankruptcy data into several types. Based on the constructed model, we predict the bankruptcy of companies by applying the BPN model to a validation set that was not utilized in the development of the model. This allows for identifying the specific types of bankruptcy by using bankruptcy data predicted by the BPN model. We calculated the average of selected input variables through statistical test for each cluster to interpret characteristics of the derived clusters in the SOM model. Each cluster represents bankruptcy type classified through data of bankruptcy firms, and input variables indicate financial ratios in interpreting the meaning of each cluster. The experimental result shows that each of five bankruptcy types has different characteristics according to financial ratios. Type 1 (severe bankruptcy) has inferior financial statements except for EBITDA (earnings before interest, taxes, depreciation, and amortization) to sales based on the clustering results. Type 2 (lack of stability) has a low quick ratio, low stockholder's equity to total assets, and high total borrowings to total assets. Type 3 (lack of activity) has a slightly low total asset turnover and fixed asset turnover. Type 4 (lack of profitability) has low retained earnings to total assets and EBITDA to sales which represent the indices of profitability. Type 5 (recoverable bankruptcy) includes firms that have a relatively good financial condition as compared to other bankruptcy types even though they are bankrupt. Based on the findings, researchers and practitioners engaged in the credit evaluation field can obtain more useful information about the types of corporate bankruptcy. In this paper, we utilized the financial ratios of firms to classify bankruptcy types. It is important to select the input variables that correctly predict bankruptcy and meaningfully classify the type of bankruptcy. In a further study, we will include non-financial factors such as size, industry, and age of the firms. Thus, we can obtain realistic clustering results for bankruptcy types by combining qualitative factors and reflecting the domain knowledge of experts.

A Study of Post-processing Methods of Clustering Algorithm and Classification of the Segmented Regions (클러스터링 알고리즘의 후처리 방안과 분할된 영역들의 분류에 대한 연구)

  • Oh, Jun-Taek;Kim, Bo-Ram;Kim, Wook-Hyun
    • The KIPS Transactions:PartB
    • /
    • v.16B no.1
    • /
    • pp.7-16
    • /
    • 2009
  • Some clustering algorithms have a problem that an image is over-segmented since both the spatial information between the segmented regions is not considered and the number of the clusters is defined in advance. Therefore, they are difficult to be applied to the applicable fields. This paper proposes the new post-processing methods, a reclassification of the inhomogeneous clusters and a region merging using Baysian algorithm, that improve the segmentation results of the clustering algorithms. The inhomogeneous cluster is firstly selected based on variance and between-class distance and it is then reclassified into the other clusters in the reclassification step. This reclassification is repeated until the optimal number determined by the minimum average within-class distance. And the similar regions are merged using Baysian algorithm based on Kullbeck-Leibler distance between the adjacent regions. So we can effectively solve the over-segmentation problem and the result can be applied to the applicable fields. Finally, we design a classification system for the segmented regions to validate the proposed method. The segmented regions are classified by SVM(Support Vector Machine) using the principal colors and the texture information of the segmented regions. In experiment, the proposed method showed the validity for various real-images and was effectively applied to the designed classification system.