• Title/Summary/Keyword: Lasso regression

Search Result 107, Processing Time 0.02 seconds

Efficient estimation and variable selection for partially linear single-index-coefficient regression models

  • Kim, Young-Ju
    • Communications for Statistical Applications and Methods
    • /
    • v.26 no.1
    • /
    • pp.69-78
    • /
    • 2019
  • A structured model with both single-index and varying coefficients is a powerful tool in modeling high dimensional data. It has been widely used because the single-index can overcome the curse of dimensionality and varying coefficients can allow nonlinear interaction effects in the model. For high dimensional index vectors, variable selection becomes an important question in the model building process. In this paper, we propose an efficient estimation and a variable selection method based on a smoothing spline approach in a partially linear single-index-coefficient regression model. We also propose an efficient algorithm for simultaneously estimating the coefficient functions in a data-adaptive lower-dimensional approximation space and selecting significant variables in the index with the adaptive LASSO penalty. The empirical performance of the proposed method is illustrated with simulated and real data examples.

Comparative study of prediction models for corporate bond rating (국내 회사채 신용 등급 예측 모형의 비교 연구)

  • Park, Hyeongkwon;Kang, Junyoung;Heo, Sungwook;Yu, Donghyeon
    • The Korean Journal of Applied Statistics
    • /
    • v.31 no.3
    • /
    • pp.367-382
    • /
    • 2018
  • Prediction models for a corporate bond rating in existing studies have been developed using various models such as linear regression, ordered logit, and random forest. Financial characteristics help build prediction models that are expected to be contained in the assigning model of the bond rating agencies. However, the ranges of bond ratings in existing studies vary from 5 to 20 and the prediction models were developed with samples in which the target companies and the observation periods are different. Thus, a simple comparison of the prediction accuracies in each study cannot determine the best prediction model. In order to conduct a fair comparison, this study has collected corporate bond ratings and financial characteristics from 2013 to 2017 and applied prediction models to them. In addition, we applied the elastic-net penalty for the linear regression, the ordered logit, and the ordered probit. Our comparison shows that data-driven variable selection using the elastic-net improves prediction accuracy in each corresponding model, and that the random forest is the most appropriate model in terms of prediction accuracy, which obtains 69.6% accuracy of the exact rating prediction on average from the 5-fold cross validation.

Application of Regularized Linear Regression Models Using Public Domain data for Cycle Life Prediction of Commercial Lithium-Ion Batteries (상업용 리튬 배터리의 수명 예측을 위한 고속대량충방전 데이터 정규화 선형회귀모델의 적용)

  • KIM, JANG-GOON;LEE, JONG-SOOK
    • Journal of Hydrogen and New Energy
    • /
    • v.32 no.6
    • /
    • pp.592-611
    • /
    • 2021
  • In this study a rarely available high-throughput cycling data set of 124 commercial lithium iron phosphate/graphite cells cycled under fast-charging conditions, with widely varying cycle lives ranging from 150 to 2,300 cycles including in-cycle temperature and per-cycle IR measurements. We worked out own Python codes which reproduced the various data plots and machine learning approaches for cycle life prediction using early cycles and more details not presented in the article and the supplementary information. Particularly, we applied regularized ridge, lasso and elastic net linear regression models using features extracted from capacity fade curves, discharge voltage curves, and other data such as internal resistance and cell can temperature. We found that due to the limitation in the quantity and quality of the data from costly and lengthy battery testing a careful hyperparameter tuning may be required and that model features need to be extracted based on the domain knowledge.

A Study on the Insolvency Prediction Model for Korean Shipping Companies

  • Myoung-Hee Kim
    • Journal of Navigation and Port Research
    • /
    • v.48 no.2
    • /
    • pp.109-115
    • /
    • 2024
  • To develop a shipping company insolvency prediction model, we sampled shipping companies that closed between 2005 and 2023. In addition, a closed company and a normal company with similar asset size were selected as a paired sample. For this study, data of a total of 82 companies, including 42 closed companies and 42 general companies, were obtained. These data were randomly divided into a training set (2/3 of data) and a testing set (1/3 of data). Training data were used to develop the model while test data were used to measure the accuracy of the model. In this study, a prediction model for Korean shipping insolvency was developed using financial ratio variables frequently used in previous studies. First, using the LASSO technique, main variables out of 24 independent variables were reduced to 9. Next, we set insolvent companies to 1 and normal companies to 0 and fitted logistic regression, LDA and QDA model. As a result, the accuracy of the prediction model was 82.14% for the QDA model, 78.57% for the logistic regression model, and 75.00% for the LDA model. In addition, variables 'Current ratio', 'Interest expenses to sales', 'Total assets turnover', and 'Operating income to sales' were analyzed as major variables affecting corporate insolvency.

A Study for the Drivers of Movie Box-office Performance (영화흥행 영향요인 선택에 관한 연구)

  • Kim, Yon Hyong;Hong, Jeong Han
    • The Korean Journal of Applied Statistics
    • /
    • v.26 no.3
    • /
    • pp.441-452
    • /
    • 2013
  • This study analyzed the relationship between key film and a box office record success factors based on movies released in the first quarter of 2013 in Korea. An over-fitting problem can happen if there are too many explanatory variables inserted to regression model; in addition, there is a risk that the estimator is instable when there is multi-collinearity among the explanatory variables. For this reason, optimal variable selection based on high explanatory variables in box-office performance is of importance. Among the numerous ways to select variables, LASSO estimation applied by a generalized linear model has the smallest prediction error that can efficiently and quickly find variables with the highest explanatory power to box-office performance in order.

On sampling algorithms for imbalanced binary data: performance comparison and some caveats (불균형적인 이항 자료 분석을 위한 샘플링 알고리즘들: 성능비교 및 주의점)

  • Kim, HanYong;Lee, Woojoo
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.5
    • /
    • pp.681-690
    • /
    • 2017
  • Various imbalanced binary classification problems exist such as fraud detection in banking operations, detecting spam mail and predicting defective products. Several sampling methods such as over sampling, under sampling, SMOTE have been developed to overcome the poor prediction performance of binary classifiers when the proportion of one group is dominant. In order to overcome this problem, several sampling methods such as over-sampling, under-sampling, SMOTE have been developed. In this study, we investigate prediction performance of logistic regression, Lasso, random forest, boosting and support vector machine in combination with the sampling methods for binary imbalanced data. Four real data sets are analyzed to see if there is a substantial improvement in prediction performance. We also emphasize some precautions when the sampling methods are implemented.

Feature selection and prediction modeling of drug responsiveness in Pharmacogenomics (약물유전체학에서 약물반응 예측모형과 변수선택 방법)

  • Kim, Kyuhwan;Kim, Wonkuk
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.2
    • /
    • pp.153-166
    • /
    • 2021
  • A main goal of pharmacogenomics studies is to predict individual's drug responsiveness based on high dimensional genetic variables. Due to a large number of variables, feature selection is required in order to reduce the number of variables. The selected features are used to construct a predictive model using machine learning algorithms. In the present study, we applied several hybrid feature selection methods such as combinations of logistic regression, ReliefF, TurF, random forest, and LASSO to a next generation sequencing data set of 400 epilepsy patients. We then applied the selected features to machine learning methods including random forest, gradient boosting, and support vector machine as well as a stacking ensemble method. Our results showed that the stacking model with a hybrid feature selection of random forest and ReliefF performs better than with other combinations of approaches. Based on a 5-fold cross validation partition, the mean test accuracy value of the best model was 0.727 and the mean test AUC value of the best model was 0.761. It also appeared that the stacking models outperform than single machine learning predictive models when using the same selected features.

A Modeling of Realtime Fuel Comsumption Prediction Using OBDII Data (OBDII 데이터 기반의 실시간 연료 소비량 예측 모델 연구)

  • Yang, Hee-Eun;Kim, Do-Hyun;Choe, Hoseop
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.10 no.2
    • /
    • pp.57-64
    • /
    • 2021
  • This study presents a method for realtime fuel consumption prediction using real data collected from OBDII. With the advent of the era of self-driving cars, electronic control units(ECU) are getting more complex, and various studies are being attempted to extract and analyze more accurate data from vehicles. But since ECU is getting more complex, it is getting harder to get the data from ECU. To solve this problem, the firmware was developed for acquiring accurate vehicle data in this study, which extracted 53,580 actual driving data sets from vehicles from January to February 2019. Using these data, the ensemble stacking technique was used to increase the accuracy of the realtime fuel consumption prediction model. In this study, Ridge, Lasso, XGBoost, and LightGBM were used as base models, and Ridge was used for meta model, and the predicted performance was MAE 0.011, RMSE 0.017.

Cox Model Improvement Using Residual Blocks in Neural Networks: A Study on the Predictive Model of Cervical Cancer Mortality (신경망 내 잔여 블록을 활용한 콕스 모델 개선: 자궁경부암 사망률 예측모형 연구)

  • Nang Kyeong Lee;Joo Young Kim;Ji Soo Tak;Hyeong Rok Lee;Hyun Ji Jeon;Jee Myung Yang;Seung Won Lee
    • The Transactions of the Korea Information Processing Society
    • /
    • v.13 no.6
    • /
    • pp.260-268
    • /
    • 2024
  • Cervical cancer is the fourth most common cancer in women worldwide, and more than 604,000 new cases were reported in 2020 alone, resulting in approximately 341,831 deaths. The Cox regression model is a major model widely adopted in cancer research, but considering the existence of nonlinear associations, it faces limitations due to linear assumptions. To address this problem, this paper proposes ResSurvNet, a new model that improves the accuracy of cervical cancer mortality prediction using ResNet's residual learning framework. This model showed accuracy that outperforms the DNN, CPH, CoxLasso, Cox Gradient Boost, and RSF models compared in this study. As this model showed accuracy that outperformed the DNN, CPH, CoxLasso, Cox Gradient Boost, and RSF models compared in this study, this excellent predictive performance demonstrates great value in early diagnosis and treatment strategy establishment in the management of cervical cancer patients and represents significant progress in the field of survival analysis.

Performance of Prediction Models for Diagnosing Severe Aortic Stenosis Based on Aortic Valve Calcium on Cardiac Computed Tomography: Incorporation of Radiomics and Machine Learning

  • Nam gyu Kang;Young Joo Suh;Kyunghwa Han;Young Jin Kim;Byoung Wook Choi
    • Korean Journal of Radiology
    • /
    • v.22 no.3
    • /
    • pp.334-343
    • /
    • 2021
  • Objective: We aimed to develop a prediction model for diagnosing severe aortic stenosis (AS) using computed tomography (CT) radiomics features of aortic valve calcium (AVC) and machine learning (ML) algorithms. Materials and Methods: We retrospectively enrolled 408 patients who underwent cardiac CT between March 2010 and August 2017 and had echocardiographic examinations (240 patients with severe AS on echocardiography [the severe AS group] and 168 patients without severe AS [the non-severe AS group]). Data were divided into a training set (312 patients) and a validation set (96 patients). Using non-contrast-enhanced cardiac CT scans, AVC was segmented, and 128 radiomics features for AVC were extracted. After feature selection was performed with three ML algorithms (least absolute shrinkage and selection operator [LASSO], random forests [RFs], and eXtreme Gradient Boosting [XGBoost]), model classifiers for diagnosing severe AS on echocardiography were developed in combination with three different model classifier methods (logistic regression, RF, and XGBoost). The performance (c-index) of each radiomics prediction model was compared with predictions based on AVC volume and score. Results: The radiomics scores derived from LASSO were significantly different between the severe AS and non-severe AS groups in the validation set (median, 1.563 vs. 0.197, respectively, p < 0.001). A radiomics prediction model based on feature selection by LASSO + model classifier by XGBoost showed the highest c-index of 0.921 (95% confidence interval [CI], 0.869-0.973) in the validation set. Compared to prediction models based on AVC volume and score (c-indexes of 0.894 [95% CI, 0.815-0.948] and 0.899 [95% CI, 0.820-0.951], respectively), eight and three of the nine radiomics prediction models showed higher discrimination abilities for severe AS. However, the differences were not statistically significant (p > 0.05 for all). Conclusion: Models based on the radiomics features of AVC and ML algorithms may perform well for diagnosing severe AS, but the added value compared to AVC volume and score should be investigated further.