• 제목/요약/키워드: methods of data analysis

검색결과 19,201건 처리시간 0.047초

Towards a Deep Analysis of High School Students' Outcomes

  • Barila, Adina;Danubianu, Mirela;Paraschiv, Andrei Marcel
    • International Journal of Computer Science & Network Security
    • /
    • 제21권6호
    • /
    • pp.71-76
    • /
    • 2021
  • Education is one of the pillars of sustainable development. For this reason, the discovery of useful information in its process of adaptation to new challenges is treated with care. This paper aims to present the initiation of a process of exploring the data collected from the results obtained by Romanian students at the BBaccalaureate (the Romanian high school graduation) exam, through data mining methods, in order to try an in-depth analysis to find and remedy some of the causes that lead to unsatisfactory results. Specifically, a set of public data was collected from the website of the Ministry of Education, on which several classification methods were tested in order to find the most efficient modeling algorithm. It is the first time that this type of data is subjected to such interests.

Comparison and Analysis of P2P Botnet Detection Schemes

  • Cho, Kyungsan;Ye, Wujian
    • 한국컴퓨터정보학회논문지
    • /
    • 제22권3호
    • /
    • pp.69-79
    • /
    • 2017
  • In this paper, we propose our four-phase life cycle of P2P botnet with corresponding detection methods and the future direction for more effective P2P botnet detection. Our proposals are based on the intensive analysis that compares existing P2P botnet detection schemes in different points of view such as life cycle of P2P botnet, machine learning methods for data mining based detection, composition of data sets, and performance matrix. Our proposed life cycle model composed of linear sequence stages suggests to utilize features in the vulnerable phase rather than the entire life cycle. In addition, we suggest the hybrid detection scheme with data mining based method and our proposed life cycle, and present the improved composition of experimental data sets through analysing the limitations of previous works.

Design and Implementation of Incremental Learning Technology for Big Data Mining

  • Min, Byung-Won;Oh, Yong-Sun
    • International Journal of Contents
    • /
    • 제15권3호
    • /
    • pp.32-38
    • /
    • 2019
  • We usually suffer from difficulties in treating or managing Big Data generated from various digital media and/or sensors using traditional mining techniques. Additionally, there are many problems relative to the lack of memory and the burden of the learning curve, etc. in an increasing capacity of large volumes of text when new data are continuously accumulated because we ineffectively analyze total data including data previously analyzed and collected. In this paper, we propose a general-purpose classifier and its structure to solve these problems. We depart from the current feature-reduction methods and introduce a new scheme that only adopts changed elements when new features are partially accumulated in this free-style learning environment. The incremental learning module built from a gradually progressive formation learns only changed parts of data without any re-processing of current accumulations while traditional methods re-learn total data for every adding or changing of data. Additionally, users can freely merge new data with previous data throughout the resource management procedure whenever re-learning is needed. At the end of this paper, we confirm a good performance of this method in data processing based on the Big Data environment throughout an analysis because of its learning efficiency. Also, comparing this algorithm with those of NB and SVM, we can achieve an accuracy of approximately 95% in all three models. We expect that our method will be a viable substitute for high performance and accuracy relative to large computing systems for Big Data analysis using a PC cluster environment.

Regression analysis of interval censored competing risk data using a pseudo-value approach

  • Kim, Sooyeon;Kim, Yang-Jin
    • Communications for Statistical Applications and Methods
    • /
    • 제23권6호
    • /
    • pp.555-562
    • /
    • 2016
  • Interval censored data often occur in an observational study where the subject is followed periodically. Instead of observing an exact failure time, two inspection times that include it are available. There are several methods to analyze interval censored failure time data (Sun, 2006). However, in the presence of competing risks, few methods have been suggested to estimate covariate effect on interval censored competing risk data. A sub-distribution hazard model is a commonly used regression model because it has one-to-one correspondence with a cumulative incidence function. Alternatively, Klein and Andersen (2005) proposed a pseudo-value approach that directly uses the cumulative incidence function. In this paper, we consider an extension of the pseudo-value approach into the interval censored data to estimate regression coefficients. The pseudo-values generated from the estimated cumulative incidence function then become response variables in a generalized estimating equation. Simulation studies show that the suggested method performs well in several situations and an HIV-AIDS cohort study is analyzed as a real data example.

A Study on Gamification Consumer Perception Analysis Using Big Data

  • Se-won Jeon;Youn Ju Ahn;Gi-Hwan Ryu
    • International Journal of Advanced Culture Technology
    • /
    • 제11권3호
    • /
    • pp.332-337
    • /
    • 2023
  • The purpose of the study was to analyze consumers' perceptions of gamification. Based on the analyzed data, we would like to provide data by systematically organizing the concept, game elements, and mechanisms of gamification. Recently, gamification can be easily found around medical care, corporate marketing, and education. This study collected keywords from social media portal sites Naver, Daum, and Google from 2018 to 2023 using TEXTOM, a social media analysis tool. In this study, data were analyzed using text mining, semantic network analysis, and CONCOR analysis methods. Based on the collected data, we looked at the relevance and clusters related to gamification. The clusters were divided into a total of four clusters: 'Awareness of Gamification', 'Gamification Program', 'Future Technology of Gamification', and 'Use of Gamification'. Through social media analysis, we want to investigate and identify consumers' perceptions of gamification use, and check market and consumer perceptions to make up for the shortcomings. Through this, we intend to develop a plan to utilize gamification.

Statistical analysis of metagenomics data

  • Calle, M. Luz
    • Genomics & Informatics
    • /
    • 제17권1호
    • /
    • pp.6.1-6.9
    • /
    • 2019
  • Understanding the role of the microbiome in human health and how it can be modulated is becoming increasingly relevant for preventive medicine and for the medical management of chronic diseases. The development of high-throughput sequencing technologies has boosted microbiome research through the study of microbial genomes and allowing a more precise quantification of microbiome abundances and function. Microbiome data analysis is challenging because it involves high-dimensional structured multivariate sparse data and because of its compositional nature. In this review we outline some of the procedures that are most commonly used for microbiome analysis and that are implemented in R packages. We place particular emphasis on the compositional structure of microbiome data. We describe the principles of compositional data analysis and distinguish between standard methods and those that fit into compositional data analysis.

재래식 철근콘크리트 공법과 조립식 콘크리트 공법에서의 사고 분석에 관한 조사 연구 (A Study on the Analysis of Accidents for Reinforced concrete Method and Pre-cast concrete Method)

  • 박종근
    • 한국안전학회지
    • /
    • 제10권4호
    • /
    • pp.81-86
    • /
    • 1995
  • In order to apply to analysis methods of mechanism and cross tabulation methods for the influence factors by the accident types to the object of accidents which occurred in R.C and P.C methods among the accidents in construction work sites, the latent hazards in P.C method are evaluated from the data of accidents in H Company from Jan. 1, 1993 to Dec. 31, 1993. The relationship between accident types and unsafe acts, unsafe conditions are recognized and the hazards of R.C method and P.C method are compared from the data acquired by the analysis of causes for a kind of occurrence mechanism. In conclusions, the items such as causes of accidents, accidents types, occurrence time, and the characteristics, are concentrated on one side in the P.C method, which is quite different from R.C method. Therefore the control method for the accident causes is easily established with a lot of effective advantages. The frequency and severity of accidents in P.C method are so low in comparison with R.C method.

  • PDF

Multi-block Analysis of Genomic Data Using Generalized Canonical Correlation Analysis

  • Jun, Inyoung;Choi, Wooree;Park, Mira
    • Genomics & Informatics
    • /
    • 제16권4호
    • /
    • pp.33.1-33.9
    • /
    • 2018
  • Recently, there have been many studies in medicine related to genetic analysis. Many genetic studies have been performed to find genes associated with complex diseases. To find out how genes are related to disease, we need to understand not only the simple relationship of genotypes but also the way they are related to phenotype. Multi-block data, which is a summation form of variable sets, is used for enhancing the analysis of the relationships of different blocks. By identifying relationships through a multi-block data form, we can understand the association between the blocks in comprehending the correlation between them. Several statistical analysis methods have been developed to understand the relationship between multi-block data. In this paper, we will use generalized canonical correlation methodology to analyze multi-block data from the Korean Association Resource project, which has a combination of single nucleotide polymorphism blocks, phenotype blocks, and disease blocks.

A Comparative Study of Predictive Factors for Hypertension using Logistic Regression Analysis and Decision Tree Analysis

  • SoHyun Kim;SungHyoun Cho
    • Physical Therapy Rehabilitation Science
    • /
    • 제12권2호
    • /
    • pp.80-91
    • /
    • 2023
  • Objective: The purpose of this study is to identify factors that affect the incidence of hypertension using logistic regression and decision tree analysis, and to build and compare predictive models. Design: Secondary data analysis study Methods: We analyzed 9,859 subjects from the Korean health panel annual 2019 data provided by the Korea Institute for Health and Social Affairs and National Health Insurance Service. Frequency analysis, chi-square test, binary logistic regression, and decision tree analysis were performed on the data. Results: In logistic regression analysis, those who were 60 years of age or older (Odds ratio, OR=68.801, p<0.001), those who were divorced/widowhood/separated (OR=1.377, p<0.001), those who graduated from middle school or younger (OR=1, reference), those who did not walk at all (OR=1, reference), those who were obese (OR=5.109, p<0.001), and those who had poor subjective health status (OR=2.163, p<0.001) were more likely to develop hypertension. In the decision tree, those over 60 years of age, overweight or obese, and those who graduated from middle school or younger had the highest probability of developing hypertension at 83.3%. Logistic regression analysis showed a specificity of 85.3% and sensitivity of 47.9%; while decision tree analysis showed a specificity of 81.9% and sensitivity of 52.9%. In classification accuracy, logistic regression and decision tree analysis showed 73.6% and 72.6% prediction, respectively. Conclusions: Both logistic regression and decision tree analysis were adequate to explain the predictive model. It is thought that both analysis methods can be used as useful data for constructing a predictive model for hypertension.

시계열 데이터 결측치 처리 기술 동향 (Technical Trends of Time-Series Data Imputation)

  • 김에덴;고석갑;손승철;이병탁
    • 전자통신동향분석
    • /
    • 제36권4호
    • /
    • pp.145-153
    • /
    • 2021
  • Data imputation is a crucial issue in data analysis because quality data are highly correlated with the performance of AI models. Particularly, it is difficult to collect quality time-series data for uncertain situations (for example, electricity blackout, delays for network conditions). Thus, it is necessary to research effective methods of time-series data imputation. Many studies on time-series data imputation can be divided into 5 parts, including statistical based, matrix-based, regression-based, deep learning (RNN and GAN) based methodologies. This study reviews and organizes these methodologies. Recently, deep learning-based imputation methods are developed and show excellent performance. However, it is associated to some computational problems that make it difficult to use in real-time system. Thus, the direction of future work is to develop low computational but high-performance imputation methods for application in the real field.