• Title/Summary/Keyword: methods:data analysis

Search Result 19,281, Processing Time 0.053 seconds

TRAPR: R Package for Statistical Analysis and Visualization of RNA-Seq Data

  • Lim, Jae Hyun;Lee, Soo Youn;Kim, Ju Han
    • Genomics & Informatics
    • /
    • v.15 no.1
    • /
    • pp.51-53
    • /
    • 2017
  • High-throughput transcriptome sequencing, also known as RNA sequencing (RNA-Seq), is a standard technology for measuring gene expression with unprecedented accuracy. Numerous bioconductor packages have been developed for the statistical analysis of RNA-Seq data. However, these tools focus on specific aspects of the data analysis pipeline, and are difficult to appropriately integrate with one another due to their disparate data structures and processing methods. They also lack visualization methods to confirm the integrity of the data and the process. In this paper, we propose an R-based RNA-Seq analysis pipeline called TRAPR, an integrated tool that facilitates the statistical analysis and visualization of RNA-Seq expression data. TRAPR provides various functions for data management, the filtering of low-quality data, normalization, transformation, statistical analysis, data visualization, and result visualization that allow researchers to build customized analysis pipelines.

Survival Analysis of Gastric Cancer Patients with Incomplete Data

  • Moghimbeigi, Abbas;Tapak, Lily;Roshanaei, Ghodaratolla;Mahjub, Hossein
    • Journal of Gastric Cancer
    • /
    • v.14 no.4
    • /
    • pp.259-265
    • /
    • 2014
  • Purpose: Survival analysis of gastric cancer patients requires knowledge about factors that affect survival time. This paper attempted to analyze the survival of patients with incomplete registered data by using imputation methods. Materials and Methods: Three missing data imputation methods, including regression, expectation maximization algorithm, and multiple imputation (MI) using Monte Carlo Markov Chain methods, were applied to the data of cancer patients referred to the cancer institute at Imam Khomeini Hospital in Tehran in 2003 to 2008. The data included demographic variables, survival times, and censored variable of 471 patients with gastric cancer. After using imputation methods to account for missing covariate data, the data were analyzed using a Cox regression model and the results were compared. Results: The mean patient survival time after diagnosis was $49.1{\pm}4.4$ months. In the complete case analysis, which used information from 100 of the 471 patients, very wide and uninformative confidence intervals were obtained for the chemotherapy and surgery hazard ratios (HRs). However, after imputation, the maximum confidence interval widths for the chemotherapy and surgery HRs were 8.470 and 0.806, respectively. The minimum width corresponded with MI. Furthermore, the minimum Bayesian and Akaike information criteria values correlated with MI (-821.236 and -827.866, respectively). Conclusions: Missing value imputation increased the estimate precision and accuracy. In addition, MI yielded better results when compared with the expectation maximization algorithm and regression simple imputation methods.

Recent deep learning methods for tabular data

  • Yejin Hwang;Jongwoo Song
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.2
    • /
    • pp.215-226
    • /
    • 2023
  • Deep learning has made great strides in the field of unstructured data such as text, images, and audio. However, in the case of tabular data analysis, machine learning algorithms such as ensemble methods are still better than deep learning. To keep up with the performance of machine learning algorithms with good predictive power, several deep learning methods for tabular data have been proposed recently. In this paper, we review the latest deep learning models for tabular data and compare the performances of these models using several datasets. In addition, we also compare the latest boosting methods to these deep learning methods and suggest the guidelines to the users, who analyze tabular datasets. In regression, machine learning methods are better than deep learning methods. But for the classification problems, deep learning methods perform better than the machine learning methods in some cases.

Informal Quality Data Analysis via Sentimental analysis and Word2vec method (감성분석과 Word2vec을 이용한 비정형 품질 데이터 분석)

  • Lee, Chinuk;Yoo, Kook Hyun;Mun, Byeong Min;Bae, Suk Joo
    • Journal of Korean Society for Quality Management
    • /
    • v.45 no.1
    • /
    • pp.117-128
    • /
    • 2017
  • Purpose: This study analyzes automobile quality review data to develop alternative analytical method of informal data. Existing methods to analyze informal data are based mainly on the frequency of informal data, however, this research tries to use correlation information of each informal data. Method: After sentimental analysis to acquire the user information for automobile products, three classification methods, that is, $na{\ddot{i}}ve$ Bayes, random forest, and support vector machine, were employed to accurately classify the informal user opinions with respect to automobile qualities. Additionally, Word2vec was applied to discover correlated information about informal data. Result: As applicative results of three classification methods, random forest method shows most effective results compared to the other classification methods. Word2vec method manages to discover closest relevant data with automobile components. Conclusion: The proposed method shows its effectiveness in terms of accuracy and sensitivity on the analysis of informal quality data, however, only two sentiments (positive or negative) can be categorized due to human errors. Further studies are required to derive more sentiments to accurately classify informal quality data. Word2vec method also shows comparative results to discover the relevance of components precisely.

Comparing the Results of Big-Data with Questionnaire Survey (빅데이터 분석결과와 실증조사 결과의 비교)

  • Kim, Do-Goan;Shin, Seong-Yoon
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.20 no.11
    • /
    • pp.2027-2032
    • /
    • 2016
  • The rapid diffusion of smart phones and the development of data storage and analysis technology have made the field of big-data a promising industry in the future. In the marketing field, big-data analysis on social data can be used for understanding the needs of consumers as an effective and efficient marketing tool. Before the age of big-data, companies had relied upon the traditional methods such as questionnaire survey and marketing test in which a small number of consumers had participated. The traditional methods have still been used. Although both of big-data analysis and traditional methods are useful to understand consumers. It is need to check whether the results from both include similar implications. In this point, this study attempts to compare the results of big-data analysis with that of questionnaire survey on some cosmetics brands methods. As the results of this study, both results of big-data analysis and questionnaire survey include similar implications.

A System for Medical Statistical Analysis Using Guide Maps and Interactive Visualization (가이드 맵과 인터랙티브 시각화를 이용한 의료 통계분석 시스템)

  • Lee Don-Soo;Choi Soo-Mi
    • Journal of Korea Multimedia Society
    • /
    • v.8 no.7
    • /
    • pp.1000-1011
    • /
    • 2005
  • This paper presents a system for medical statistical analysis that helps medical professionals analyze clinical data more easily and accurately. It is able to recommend proper methods according to the distribution of sample data, and provides guide maps composed of icons for the understanding of the process of analysis. Besides general statistical analysis, it includes commonly-used statistical methods for medical fields, such as survival analysis and methods for repetitive measurements. The results of analysis are interactively displayed by 3D glyph-based visualization with uncertainty.

  • PDF

Results of Discriminant Analysis with Respect to Cluster Analyses Under Dimensional Reduction

  • Chae, Seong-San
    • Communications for Statistical Applications and Methods
    • /
    • v.9 no.2
    • /
    • pp.543-553
    • /
    • 2002
  • Principal component analysis is applied to reduce p-dimensions into q-dimensions ( $q {\leq} p$). Any partition of a collection of data points with p and q variables generated by the application of six hierarchical clustering methods is re-classified by discriminant analysis. From the application of discriminant analysis through each hierarchical clustering method, correct classification ratios are obtained. The results illustrate which method is more reasonable in exploratory data analysis.

Comparison of missing data methods in clustered survival data using Bayesian adaptive B-Spline estimation

  • Yoo, Hanna;Lee, Jae Won
    • Communications for Statistical Applications and Methods
    • /
    • v.25 no.2
    • /
    • pp.159-172
    • /
    • 2018
  • In many epidemiological studies, missing values in the outcome arise due to censoring. Such censoring is what makes survival analysis special and differentiated from other analytical methods. There are many methods that deal with censored data in survival analysis. However, few studies have dealt with missing covariates in survival data. Furthermore, studies dealing with missing covariates are rare when data are clustered. In this paper, we conducted a simulation study to compare results of several missing data methods when data had clustered multi-structured type with missing covariates. In this study, we modeled unknown baseline hazard and frailty with Bayesian B-Spline to obtain more smooth and accurate estimates. We also used prior information to achieve more accurate results. We assumed the missing mechanism as MAR. We compared the performance of five different missing data techniques and compared these results through simulation studies. We also presented results from a Multi-Center study of Korean IBD patients with Crohn's disease(Lee et al., Journal of the Korean Society of Coloproctology, 28, 188-194, 2012).

A Study on the Construction of Database, Online Management System, and Analysis Instrument for Biological Diversity Data (생물다양성 자료의 데이터베이스화와 온라인 관리시스템 및 분석도구 구축에 관한 연구)

  • Bec Kee-Yul;Jung Jong-Chul;Park Seon-Joo;Lee Jong-Wook
    • Journal of Environmental Science International
    • /
    • v.14 no.12
    • /
    • pp.1119-1127
    • /
    • 2005
  • The management of data on biological diversity is presently complex and confusing. This study was initiated to construct a database so that such data could be stored in a data management, and analysis instrument to correct the problems inherent in the current incoherent storage methods. MySQL was used in DBMS(DataBase Management System), and the program was basically produced using Java technology Also, the program was developed so people could adapt to the requirements that are changing every minute. We hope this was accomplished by modifying easily and quickly the advanced programming technology and patterns. To this end, an effective and flexible database schema was devised to store and analyze diversity databases. Even users with no knowledge of databases should be able to access this management instrument and easily manage the database through the World Wide Web. On a basis of databases stored in this manner, it could become routinely used for various databases using this analysis instrument supplied on the World Wide Web. Supplying the derived results by using a simple table and making results visible using simple charts, researchers could easily adapt these methods to various data analyses. As the diversity data was stored in a database, not in a general file, this study makes the precise, error-free and high -quality storage in a consistent manner. The methods proposed here should also minimize the errors that might appear in each data search, data movement, or data conversion by supplying management instrumentation on the Web. Also, this study was to deduce the various results to the level we required and execute the comparative analysis without the lengthy time necessary to supply the analytical instrument with similar results as provided by various other methods of analysis. The results of this research may be summerized as follows: 1)This study suggests methods of storage by giving consistency to diversity data. 2)This study prepared a suggested foundation for comparative analysis of various data. 3)It may suggest further research, which could lead to more and better standardization of diversity data and to better methods for predicting changes in species diversity.

Big data platform for health monitoring systems of multiple bridges

  • Wang, Manya;Ding, Youliang;Wan, Chunfeng;Zhao, Hanwei
    • Structural Monitoring and Maintenance
    • /
    • v.7 no.4
    • /
    • pp.345-365
    • /
    • 2020
  • At present, many machine leaning and data mining methods are used for analyzing and predicting structural response characteristics. However, the platform that combines big data analysis methods with online and offline analysis modules has not been used in actual projects. This work is dedicated to developing a multifunctional Hadoop-Spark big data platform for bridges to monitor and evaluate the serviceability based on structural health monitoring system. It realizes rapid processing, analysis and storage of collected health monitoring data. The platform contains offline computing and online analysis modules, using Hadoop-Spark environment. Hadoop provides the overall framework and storage subsystem for big data platform, while Spark is used for online computing. Finally, the big data Hadoop-Spark platform computational performance is verified through several actual analysis tasks. Experiments show the Hadoop-Spark big data platform has good fault tolerance, scalability and online analysis performance. It can meet the daily analysis requirements of 5s/time for one bridge and 40s/time for 100 bridges.