• Title/Summary/Keyword: Classification Variables

Search Result 938, Processing Time 0.029 seconds

A Comparative Study of Prediction Models for College Student Dropout Risk Using Machine Learning: Focusing on the case of N university (머신러닝을 활용한 대학생 중도탈락 위험군의 예측모델 비교 연구 : N대학 사례를 중심으로)

  • So-Hyun Kim;Sung-Hyoun Cho
    • Journal of The Korean Society of Integrative Medicine
    • /
    • v.12 no.2
    • /
    • pp.155-166
    • /
    • 2024
  • Purpose : This study aims to identify key factors for predicting dropout risk at the university level and to provide a foundation for policy development aimed at dropout prevention. This study explores the optimal machine learning algorithm by comparing the performance of various algorithms using data on college students' dropout risks. Methods : We collected data on factors influencing dropout risk and propensity were collected from N University. The collected data were applied to several machine learning algorithms, including random forest, decision tree, artificial neural network, logistic regression, support vector machine (SVM), k-nearest neighbor (k-NN) classification, and Naive Bayes. The performance of these models was compared and evaluated, with a focus on predictive validity and the identification of significant dropout factors through the information gain index of machine learning. Results : The binary logistic regression analysis showed that the year of the program, department, grades, and year of entry had a statistically significant effect on the dropout risk. The performance of each machine learning algorithm showed that random forest performed the best. The results showed that the relative importance of the predictor variables was highest for department, age, grade, and residence, in the order of whether or not they matched the school location. Conclusion : Machine learning-based prediction of dropout risk focuses on the early identification of students at risk. The types and causes of dropout crises vary significantly among students. It is important to identify the types and causes of dropout crises so that appropriate actions and support can be taken to remove risk factors and increase protective factors. The relative importance of the factors affecting dropout risk found in this study will help guide educational prescriptions for preventing college student dropout.

IPMN-LEARN: A linear support vector machine learning model for predicting low-grade intraductal papillary mucinous neoplasms

  • Yasmin Genevieve Hernandez-Barco;Dania Daye;Carlos F. Fernandez-del Castillo;Regina F. Parker;Brenna W. Casey;Andrew L. Warshaw;Cristina R. Ferrone;Keith D. Lillemoe;Motaz Qadan
    • Annals of Hepato-Biliary-Pancreatic Surgery
    • /
    • v.27 no.2
    • /
    • pp.195-200
    • /
    • 2023
  • Backgrounds/Aims: We aimed to build a machine learning tool to help predict low-grade intraductal papillary mucinous neoplasms (IPMNs) in order to avoid unnecessary surgical resection. IPMNs are precursors to pancreatic cancer. Surgical resection remains the only recognized treatment for IPMNs yet carries some risks of morbidity and potential mortality. Existing clinical guidelines are imperfect in distinguishing low-risk cysts from high-risk cysts that warrant resection. Methods: We built a linear support vector machine (SVM) learning model using a prospectively maintained surgical database of patients with resected IPMNs. Input variables included 18 demographic, clinical, and imaging characteristics. The outcome variable was the presence of low-grade or high-grade IPMN based on post-operative pathology results. Data were divided into a training/validation set and a testing set at a ratio of 4:1. Receiver operating characteristics analysis was used to assess classification performance. Results: A total of 575 patients with resected IPMNs were identified. Of them, 53.4% had low-grade disease on final pathology. After classifier training and testing, a linear SVM-based model (IPMN-LEARN) was applied on the validation set. It achieved an accuracy of 77.4%, with a positive predictive value of 83%, a specificity of 72%, and a sensitivity of 83% in predicting low-grade disease in patients with IPMN. The model predicted low-grade lesions with an area under the curve of 0.82. Conclusions: A linear SVM learning model can identify low-grade IPMNs with good sensitivity and specificity. It may be used as a complement to existing guidelines to identify patients who could avoid unnecessary surgical resection.

An early fouling alarm method for a ceramic microfiltration pilot plant using machine learning (머신러닝을 활용한 세라믹 정밀여과 파일럿 플랜트의 파울링 조기 경보 방법)

  • Dohyun Tak;Dongkeon Kim;Jongmin Jeon;Suhan Kim
    • Journal of Korean Society of Water and Wastewater
    • /
    • v.37 no.5
    • /
    • pp.271-279
    • /
    • 2023
  • Fouling is an inevitable problem in membrane water treatment plant. It can be measured by trans-membrane pressure (TMP) in the constant flux operation, and chemical cleaning is carried out when TMP reaches a critical value. An early fouilng alarm is defined as warning the critical TMP value appearance in advance. The alarming method was developed using one of machine learning algorithms, decision tree, and applied to a ceramic microfiltration (MF) pilot plant. First, the decision tree model that classifies the normal/abnormal state of the filtration cycle of the ceramic MF pilot plant was developed and it was then used to make the early fouling alarm method. The accuracy of the classification model was up to 96.2% and the time for the early warning was when abnormal cycles occurred three times in a row. The early fouling alram can expect reaching a limit TMP in advance (e.g., 15-174 hours). By adopting TMP increasing rate and backwash efficiency as machine learning variables, the model accuracy and the reliability of the early fouling alarm method were increased, respectively.

Increasing trend of endoscopic drainage utilization for the management of pancreatic pseudocyst: insights from a nationwide database

  • Khaled Elfert;Salomon Chamay;Lamin Dos Santos;Mouhand Mohamed;Azizullah Beran;Fouad Jaber;Hazem Abosheaishaa;Suresh Nayudu;Sammy Ho
    • Clinical Endoscopy
    • /
    • v.57 no.1
    • /
    • pp.105-111
    • /
    • 2024
  • Background/Aims: The pancreatic pseudocyst (PP) is a type of fluid collection that typically develops as a delayed complication of acute pancreatitis. Drainage is indicated for symptomatic patients and/or associated complications, such as infection and bleeding. Drainage modalities include percutaneous, endoscopic, laparoscopic, and open drainage. This study aimed to assess trends in the utilization of different drainage modalities for treating PP from 2016 to 2020. The trends in mortality, mean length of hospital stay, and mean hospitalization costs were also assessed. Methods: The National Inpatient Sample database was used to obtain data. The variables were generated using International Classification of Diseases-10 diagnostic and procedural codes. Results: Endoscopic drainage was the most commonly used drainage modality in 2018-2020, with an increasing trend over time (385 procedures in 2018 to 515 in 2020; p=0.003). This is associated with a decrease in the use of other drainage modalities. A decrease in the hospitalization cost for PP requiring drainage was also noted (29,318 United States dollar [USD] in 2016 to 18,087 USD in 2020, p<0.001). Conclusions: Endoscopic drainage is becoming the most commonly used modality for the treatment of PP in hospitals located in the US. This new trend is associated with decreasing hospitalization costs.

Research on Mining Technology for Explainable Decision Making (설명가능한 의사결정을 위한 마이닝 기술)

  • Kyungyong Chung
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.24 no.4
    • /
    • pp.186-191
    • /
    • 2023
  • Data processing techniques play a critical role in decision-making, including handling missing and outlier data, prediction, and recommendation models. This requires a clear explanation of the validity, reliability, and accuracy of all processes and results. In addition, it is necessary to solve data problems through explainable models using decision trees, inference, etc., and proceed with model lightweight by considering various types of learning. The multi-layer mining classification method that applies the sixth principle is a method that discovers multidimensional relationships between variables and attributes that occur frequently in transactions after data preprocessing. This explains how to discover significant relationships using mining on transactions and model the data through regression analysis. It develops scalable models and logistic regression models and proposes mining techniques to generate class labels through data cleansing, relevance analysis, data transformation, and data augmentation to make explanatory decisions.

Variation of Hospital Costs and Product Heterogeneity

  • Shin, Young-Soo
    • Journal of Preventive Medicine and Public Health
    • /
    • v.11 no.1
    • /
    • pp.123-127
    • /
    • 1978
  • The major objective of this research is to identify those hospital characteristics that best explain cost variation among hospitals and to formulate linear models that can predict hospital costs. Specific emphasis is placed on hospital output, that is, the identification of diagnosis related patient groups (DRGs) which are medically meaningful and demonstrate similar patterns of hospital resource consumption. A casemix index is developed based on the DRGs identified. Considering the common problems encountered in previous hospital cost research, the following study requirements are estab-lished for fulfilling the objectives of this research: 1. Selection of hospitals that exercise similar medical and fiscal practices. 2. Identification of an appropriate data collection mechanism in which demographic and medical characteristics of individual patients as well as accurate and comparable cost information can be derived. 3. Development of a patient classification system in which all the patients treated in hospitals are able to be split into mutually exclusive categories with consistent and stable patterns of resource consumption. 4. Development of a cost finding mechanism through which patient groups' costs can be made comparable across hospitals. A data set of Medicare patients prepared by the Social Security Administration was selected for the study analysis. The data set contained 27,229 record abstracts of Medicare patients discharged from all but one short-term general hospital in Connecticut during the period from January 1, 1971, to December 31, 1972. Each record abstract contained demographic and diagnostic information, as well as charges for specific medical services received. The 'AUT-OGRP System' was used to generate 198 DRGs in which the entire range of Medicare patients were split into mutually exclusive categories, each of which shows a consistent and stable pattern of resource consumption. The 'Departmental Method' was used to generate cost information for the groups of Medicare patients that would be comparable across hospitals. To fulfill the study objectives, an extensive analysis was conducted in the following areas: 1. Analysis of DRGs: in which the level of resource use of each DRG was determined, the length of stay or death rate of each DRG in relation to resource use was characterized, and underlying patterns of the relationships among DRG costs were explained. 2. Exploration of resource use profiles of hospitals; in which the magnitude of differences in the resource uses or death rates incurred in the treatment of Medicare patients among the study hospitals was explored. 3. Casemix analysis; in which four types of casemix-related indices were generated, and the significance of these indices in the explanation of hospital costs was examined. 4. Formulation of linear models to predict hospital costs of Medicare patients; in which nine independent variables (i. e., casemix index, hospital size, complexity of service, teaching activity, location, casemix-adjusted death. rate index, occupancy rate, and casemix-adjusted length of stay index) were used for determining factors in hospital costs. Results from the study analysis indicated that: 1. The system of 198 DRGs for Medicare patient classification was demonstrated not only as a strong tool for determining the pattern of hospital resource utilization of Medicare patients, but also for categorizing patients by their severity of illness. 2. The wei틴fed mean total case cost (TOTC) of the study hospitals for Medicare patients during the study years was $11,27.02 with a standard deviation of $117.20. The hospital with the highest average TOTC ($1538.15) was 2.08 times more expensive than the hospital with the lowest average TOTC ($743.45). The weighted mean per diem total cost (DTOC) of the study hospitals for Medicare patients during the sutdy years was $107.98 with a standard deviation of $15.18. The hospital with the highest average DTOC ($147.23) was 1.87 times more expensive than the hospital with the lowest average DTOC ($78.49). 3. The linear models for each of the six types of hospital costs were formulated using the casemix index and the eight other hospital variables as the determinants. These models explained variance to the extent of 68.7 percent of total case cost (TOTC), 63.5 percent of room and board cost (RMC), 66.2 percent of total ancillary service cost (TANC), 66.3 percent of per diem total cost (DTOC), 56.9 percent of per diem room and board cost (DRMC), and 65.5 percent of per diem ancillary service cost (DTANC). The casemix index alone explained approximately one half of interhospital cost variation: 59.1 percent for TOTC and 44.3 percent for DTOC. Thsee results demonstrate that the casemix index is the most importand determinant of interhospital cost variation Future research and policy implications in regard to the results of this study is envisioned in the following three areas: 1. Utilization of casemix related indices in the Medicare data systems. 2. Refinement of data for hospital cost evaluation. 3. Development of a system for reimbursement and cost control in hospitals.

  • PDF

Technical Efficiency in Korea: Interindustry Determinants and Dynamic Stability (기술적(技術的) 효율성(效率性)의 결정요인(決定要因)과 동태적(動態的) 변화(變化))

  • Yoo, Seong-min
    • KDI Journal of Economic Policy
    • /
    • v.12 no.4
    • /
    • pp.21-46
    • /
    • 1990
  • This paper, a sequel to Yoo and Lee (1990), attempts to investigate the interindustry determinants of technical efficiency in Korea's manufacturing industries, and also to conduct an exploratory analysis on the stability of technical efficiency over time. The hypotheses set forth in this paper are most found in the existing literature on technical efficiency. They are, however, revised and shed a new light upon, whenever possible, to accommodate any Korea-specific conditions. The set of regressors used in the cross-sectional analysis are chosen and the hypotheses are posed in such a way that our result can be made comparable to those of similar studies conducted for the U.S. and Japan by Caves and Barton (1990) and Uekusa and Torii (1987), respectively. It is interesting to observe a certain degree of similarity as well as differentiation between the cross-section evidence on Korea's manufacturing industries and that on the U.S. and Japanese industries. As for the similarities, we can find positive and significant effects on technical efficiency of relative size of production and the extent of specialization in production, and negative and significant effect of the variations in capital-labor ratio within industries. The curvature influence of concentration ratio on technical efficiency is also confirmed in the Korean case. There are differences, too. We cannot find any significant effects of capital vintage, R&D and foreign competition on technical efficiency, all of which were shown to be robust determinants of technical efficiency in the U.S. case. We note, however, that the variables measuring capital vintage effect, R&D and the degree of foreign competition in Korean markets are suspected to suffer from serious measurement errors incurred in data collection and/or conversion of industrial classification system into the KSIC (Korea Standard Industrial Classification) system. Thus, we are reluctant to accept the findings on the effects of these variables as definitive conclusions on Korea's industrial organization. Another finding that interests us is that the cross-industry evidence becomes consistently strong when we use the efficiency estimates based on gross output instead of value added, which provides us with an ex post empirical criterion to choose an output measure between the two in estimating the production frontier. We also conduct exploratory analyses on the stability of the estimates of technical efficiency in Korea's manufacturing industries. Though the method of testing stability employed in this paper is never a complete one, we cannot find strong evidence that our efficiency estimates are stable over time. The outcome is both surprising and disappointing. We can also show that the instability of technical efficiency over time is partly explained by the way we constructed our measures of technical efficiency. To the extent that our efficiency estimates depend on the shape of the empirical distribution of plants in the input-output space, any movements of the production frontier over time are not reflected in the estimates, and possibilities exist of associating a higher level of technical efficiency with a downward movement of the production frontier over time, and so on. Thus, we find that efficiency measures that take into account not only the distributional changes, but also the shifts of the production frontier over time, increase the extent of stability, and are more appropriate for use in a dynamic context. The remaining portion of the instability of technical efficiency over time is not explained satisfactorily in this paper, and future research should address this question.

  • PDF

Anomaly Detection for User Action with Generative Adversarial Networks (적대적 생성 모델을 활용한 사용자 행위 이상 탐지 방법)

  • Choi, Nam woong;Kim, Wooju
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.43-62
    • /
    • 2019
  • At one time, the anomaly detection sector dominated the method of determining whether there was an abnormality based on the statistics derived from specific data. This methodology was possible because the dimension of the data was simple in the past, so the classical statistical method could work effectively. However, as the characteristics of data have changed complexly in the era of big data, it has become more difficult to accurately analyze and predict the data that occurs throughout the industry in the conventional way. Therefore, SVM and Decision Tree based supervised learning algorithms were used. However, there is peculiarity that supervised learning based model can only accurately predict the test data, when the number of classes is equal to the number of normal classes and most of the data generated in the industry has unbalanced data class. Therefore, the predicted results are not always valid when supervised learning model is applied. In order to overcome these drawbacks, many studies now use the unsupervised learning-based model that is not influenced by class distribution, such as autoencoder or generative adversarial networks. In this paper, we propose a method to detect anomalies using generative adversarial networks. AnoGAN, introduced in the study of Thomas et al (2017), is a classification model that performs abnormal detection of medical images. It was composed of a Convolution Neural Net and was used in the field of detection. On the other hand, sequencing data abnormality detection using generative adversarial network is a lack of research papers compared to image data. Of course, in Li et al (2018), a study by Li et al (LSTM), a type of recurrent neural network, has proposed a model to classify the abnormities of numerical sequence data, but it has not been used for categorical sequence data, as well as feature matching method applied by salans et al.(2016). So it suggests that there are a number of studies to be tried on in the ideal classification of sequence data through a generative adversarial Network. In order to learn the sequence data, the structure of the generative adversarial networks is composed of LSTM, and the 2 stacked-LSTM of the generator is composed of 32-dim hidden unit layers and 64-dim hidden unit layers. The LSTM of the discriminator consists of 64-dim hidden unit layer were used. In the process of deriving abnormal scores from existing paper of Anomaly Detection for Sequence data, entropy values of probability of actual data are used in the process of deriving abnormal scores. but in this paper, as mentioned earlier, abnormal scores have been derived by using feature matching techniques. In addition, the process of optimizing latent variables was designed with LSTM to improve model performance. The modified form of generative adversarial model was more accurate in all experiments than the autoencoder in terms of precision and was approximately 7% higher in accuracy. In terms of Robustness, Generative adversarial networks also performed better than autoencoder. Because generative adversarial networks can learn data distribution from real categorical sequence data, Unaffected by a single normal data. But autoencoder is not. Result of Robustness test showed that he accuracy of the autocoder was 92%, the accuracy of the hostile neural network was 96%, and in terms of sensitivity, the autocoder was 40% and the hostile neural network was 51%. In this paper, experiments have also been conducted to show how much performance changes due to differences in the optimization structure of potential variables. As a result, the level of 1% was improved in terms of sensitivity. These results suggest that it presented a new perspective on optimizing latent variable that were relatively insignificant.

Clinicoradiologic Characteristics of Intradural Extramedullary Conventional Spinal Ependymoma (경막내 척수외 뇌실막세포종의 임상 영상의학적 특징)

  • Seung Hyun Lee;Yoon Jin Cha;Yong Eun Cho;Mina Park;Bio Joo;Sang Hyun Suh;Sung Jun Ahn
    • Journal of the Korean Society of Radiology
    • /
    • v.84 no.5
    • /
    • pp.1066-1079
    • /
    • 2023
  • Purpose Distinguishing intradural extramedullary (IDEM) spinal ependymoma from myxopapillary ependymoma is challenging due to the location of IDEM spinal ependymoma. This study aimed to investigate the utility of clinical and MR imaging features for differentiating between IDEM spinal and myxopapillary ependymomas. Materials and Methods We compared tumor size, longitudinal/axial location, enhancement degree/pattern, tumor margin, signal intensity (SI) of the tumor on T2-weighted images and T1-weighted image (T1WI), increased cerebrospinal fluid (CSF) SI caudal to the tumor on T1WI, and CSF dissemination of pathologically confirmed 12 IDEM spinal and 10 myxopapillary ependymomas. Furthermore, classification and regression tree (CART) was performed to identify the clinical and MR features for differentiating between IDEM spinal and myxopapillary ependymomas. Results Patients with IDEM spinal ependymomas were older than those with myxopapillary ependymomas (48 years vs. 29.5 years, p < 0.05). A high SI of the tumor on T1W1 was more frequently observed in IDEM spinal ependymomas than in myxopapillary ependymomas (p = 0.02). Conversely, myxopapillary ependymomas show CSF dissemination. Increased CSF SI caudal to the tumor on T1WI was observed more frequently in myxopapillary ependymomas than in IDEM spinal ependymomas (p < 0.05). Dissemination to the CSF space and increased CSF SI caudal to the tumor on T1WI were the most important variables in CART analysis. Conclusion Clinical and radiological variables may help differentiate between IDEM spinal and myxopapillary ependymomas.

Development of Systematic Process for Estimating Commercialization Duration and Cost of R&D Performance (기술가치 평가를 위한 기술사업화 기간 및 비용 추정체계 개발)

  • Jun, Seoung-Pyo;Choi, Daeheon;Park, Hyun-Woo;Seo, Bong-Goon;Park, Do-Hyung
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.2
    • /
    • pp.139-160
    • /
    • 2017
  • Technology commercialization creates effective economic value by linking the company's R & D processes and outputs to the market. This technology commercialization is important in that a company can retain and maintain a sustained competitive advantage. In order for a specific technology to be commercialized, it goes through the stage of technical planning, technology research and development, and commercialization. This process involves a lot of time and money. Therefore, the duration and cost of technology commercialization are important decision information for determining the market entry strategy. In addition, it is more important information for a technology investor to rationally evaluate the technology value. In this way, it is very important to scientifically estimate the duration and cost of the technology commercialization. However, research on technology commercialization is insufficient and related methodology are lacking. In this study, we propose an evaluation model that can estimate the duration and cost of R & D technology commercialization for small and medium-sized enterprises. To accomplish this, this study collected the public data of the National Science & Technology Information Service (NTIS) and the survey data provided by the Small and Medium Business Administration. Also this study will develop the estimation model of commercialization duration and cost of R&D performance on using these data based on the market approach, one of the technology valuation methods. Specifically, this study defined the process of commercialization as consisting of development planning, development progress, and commercialization. We collected the data from the NTIS database and the survey of SMEs technical statistics of the Small and Medium Business Administration. We derived the key variables such as stage-wise R&D costs and duration, the factors of the technology itself, the factors of the technology development, and the environmental factors. At first, given data, we estimates the costs and duration in each technology readiness level (basic research, applied research, development research, prototype production, commercialization), for each industry classification. Then, we developed and verified the research model of each industry classification. The results of this study can be summarized as follows. Firstly, it is reflected in the technology valuation model and can be used to estimate the objective economic value of technology. The duration and the cost from the technology development stage to the commercialization stage is a critical factor that has a great influence on the amount of money to discount the future sales from the technology. The results of this study can contribute to more reliable technology valuation because it estimates the commercialization duration and cost scientifically based on past data. Secondly, we have verified models of various fields such as statistical model and data mining model. The statistical model helps us to find the important factors to estimate the duration and cost of technology Commercialization, and the data mining model gives us the rules or algorithms to be applied to an advanced technology valuation system. Finally, this study reaffirms the importance of commercialization costs and durations, which has not been actively studied in previous studies. The results confirm the significant factors to affect the commercialization costs and duration, furthermore the factors are different depending on industry classification. Practically, the results of this study can be reflected in the technology valuation system, which can be provided by national research institutes and R & D staff to provide sophisticated technology valuation. The relevant logic or algorithm of the research result can be implemented independently so that it can be directly reflected in the system, so researchers can use it practically immediately. In conclusion, the results of this study can be a great contribution not only to the theoretical contributions but also to the practical ones.