• Title/Summary/Keyword: 캐글

Search Result 12, Processing Time 0.028 seconds

Improvement Method of Classification Rate in ML Antivirus systems using Kaggle Datasets (캐글 데이터셋을 이용한 머신러닝 악성코드 분류시스템에서 분류정확도 향상방법)

  • Kim, Kyungshin
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2019.07a
    • /
    • pp.49-52
    • /
    • 2019
  • 머신러닝을 이용한 악성코드 분류 시스템의 대부분이 캐글 데이터셋 10,868건을 사용하여 분류의 정확도를 측정한다. 이 데이터셋에 포함된 바이러스 바이트코드에는 미확인(undefined)필드라는 부분이 과도하게 존재한다. 캐글 데이터셋 특정 Label의 미확인필드 포함도는 75%가 넘는 경우도 존재한다. 이 경우 미확인 필드를 어떻게 처리하느냐가 시스템의 성능에 가장 큰 영향을 끼친다. 본 연구에서는 이러한 캐글 데이터셋의 미확인필드 처리방법을 제시하고 그에 따른 분류 정확도를 연구하였다. 다양한 처리방법에 대한 정확도를 측정하여 제안한 방식의 타당성을 증명하였다.

  • PDF

On Building the Solar Dataset Form using the Kaggle Platform: The applicability of Machine Learning (캐글 플랫폼 활용한 태양광 데이터셋 형태 구축: 머신 러닝의 적용 가능성)

  • Ko, Ju-won;Park, Jung-jin;Park, Jin-woo;Oh, Do-hee;Kim, Mincheol
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.05a
    • /
    • pp.255-258
    • /
    • 2022
  • As environmental pollution continues, attention on renewable energy is on the constant rise in recent days. Although various kinds of renewable energy such as solar, wind power and biomass energy have been generated in Jeju, opening and analyzing cases on related data seem insufficient. Therefore, this study is being conducted to deduce the variables which have high relation with solar panel&s output and to understand machine learning methods that can be applied to solar power generation data by utilizing Kaggle platform, which is actively used by a number of scientists. Then, it is planned to propose a form of solar power generation dataset by researching machine learning methods that could be applied to the data. To be specific, analyzing solar power generation data with the Kaggle platform, this study will provide complements on gathering solar power data in Jeju. This study is anticipated to be utilized on data analysis for developing the solar power industry in Jeju. That is, this study is expected to reveal the room for improvement inherent in existing open datasets in Jeju, so that they could be constructed in a suitable form for machine learning for AI analytics. Through this process, a method to increase efficiency of solar power generation is anticipated to be prepared.

  • PDF

A Comparative Analysis of the Pre-Processing in the Kaggle Titanic Competition

  • Tai-Sung, Hur;Suyoung, Bang
    • Journal of the Korea Society of Computer and Information
    • /
    • v.28 no.3
    • /
    • pp.17-24
    • /
    • 2023
  • Based on the problem of 'Tatanic - Machine Learning from Disaster', a representative competition of Kaggle that presents challenges related to data science and solves them, we want to see how data preprocessing and model construction affect prediction accuracy and score. We compare and analyze the features by selecting seven top-ranked solutions with high scores, except when using redundant models or ensemble techniques. It was confirmed that most of the pretreatment has unique and differentiated characteristics, and although the pretreatment process was almost the same, there were differences in scores depending on the type of model. The comparative analysis study in this paper is expected to help participants in the kaggle competition and data science beginners by understanding the characteristics and analysis flow of the preprocessing methods of the top score participants.

Design of MBTI Job Recommendation Algorithm Based on Deep Learning (딥러닝 기반의 MBTI 직업 추천 알고리즘 설계)

  • June-Gyeom Kim;Young-Bok Cho
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2023.01a
    • /
    • pp.13-15
    • /
    • 2023
  • 본 논문에서는 성격, 성향을 근거로 사람의 성향에 따른 직업 및 전공에 대한 만족도를 분류한 데이터셋을 구축하여 사전에 사용자의 성향을 파악하여 직업을 추천하는 알고리즘을 제안한다. 성격유형검사 뿐만이 아닌 최근 게시한 SNS 텍스트를 사전에 학습한 데이터셋에 적용해 성격유형 결과의 정확도를 상승시키고자 한다. 사전에 생성한 데이터셋 외에 대상자가 작성한 정보(직업, 전공, 직엄 및 전공에 대한 만족도)로 연합학습을 진행하여 데이터셋의 정확도를 향상시키고자 한다. 모델의 학습 및 분류의 정확도 향상을 위해 SVM, NB, KNN, SDG 알고리즘들을 비교하였고 각각 67%, 21%, 28%, 69%의 정확도를 도출하였다. 데이터 셋은 캐글에서 제공받았다.

  • PDF

Development of a Data Analysis Program Using a Data Analysis Competition for Primary School Students (초등학생을 위한 데이터 분석대회를 활용한 데이터 분석 프로그램 개발)

  • HakNeung Go;JaeRi Jeong;Youngjun Lee
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2024.01a
    • /
    • pp.471-472
    • /
    • 2024
  • 본 논문에서는 초등학생을 위한 데이터 분석 대회를 활용한 데이터 분석 프로그램을 개발하였다. 데이터 분석 프로그램은 ADDIE 모형에 개발하였다. 분석 단계에서 G초등학교 학생들의 데이터 분석 도구인 스프레드 시트를 학습한 경험이 적고 배우고자 하는 동기가 없었다. 하지만 교육과정에서 공학도구로 활용하도록 제시되어 있다. 이를 바탕으로 디자인 학생들이 스프레드 시트를 학습할 수 있는 프로그램과 이를 실습할 수 있는 데이터 분석 대회를 디자인 하였다. 개발 단계에서는 LMS를 활용하여 학생들에게 학습을 위한 데이터를 제공하고 학습하며, 데이터 분석 대회에서는 학습한 데이터와 문제만 제공하여 대회에 참여하면서 실습할 수 있는 기회를 제공하였다. 평가 도구로는 데이터 리터러시 평가 도구를 선정하였다.

  • PDF

Developing of New a Tensorflow Tutorial Model on Machine Learning : Focusing on the Kaggle Titanic Dataset (텐서플로우 튜토리얼 방식의 머신러닝 신규 모델 개발 : 캐글 타이타닉 데이터 셋을 중심으로)

  • Kim, Dong Gil;Park, Yong-Soon;Park, Lae-Jeong;Chung, Tae-Yun
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.14 no.4
    • /
    • pp.207-218
    • /
    • 2019
  • The purpose of this study is to develop a model that can systematically study the whole learning process of machine learning. Since the existing model describes the learning process with minimum coding, it can learn the progress of machine learning sequentially through the new model, and can visualize each process using the tensor flow. The new model used all of the existing model algorithms and confirmed the importance of the variables that affect the target variable, survival. The used to classification training data into training and verification, and to evaluate the performance of the model with test data. As a result of the final analysis, the ensemble techniques is the all tutorial model showed high performance, and the maximum performance of the model was improved by maximum 5.2% when compared with the existing model using. In future research, it is necessary to construct an environment in which machine learning can be learned regardless of the data preprocessing method and OS that can learn a model that is better than the existing performance.

Detecting Fake Job Recruitment with a Machine Learning Approach (머신 러닝 접근 방식을 통한 가짜 채용 탐지)

  • Taghiyev Ilkin;Jae Heung Lee
    • Smart Media Journal
    • /
    • v.12 no.2
    • /
    • pp.36-41
    • /
    • 2023
  • With the advent of applicant tracking systems, online recruitment has become more popular, and recruitment fraud has become a serious problem. This research aims to develop a reliable model to detect recruitment fraud in online recruitment environments to reduce cost losses and enhance privacy. The main contribution of this paper is to provide an automated methodology that leverages insights gained from exploratory analysis of data to distinguish which job postings are fraudulent and which are legitimate. Using EMSCAD, a recruitment fraud dataset provided by Kaggle, we trained and evaluated various single-classifier and ensemble-classifier-based machine learning models, and found that the ensemble classifier, the random forest classifier, performed best with an accuracy of 98.67% and an F1 score of 0.81.

UCI Sensor Data Analysis based on Data Visualization (데이터 시각화 기반의 UCI Sensor Data 분석)

  • Chang, Il-Sik;Choi, Hee-jo;Park, Goo-man
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2020.11a
    • /
    • pp.21-24
    • /
    • 2020
  • 대용량의 데이터를 시각적 요소를 활용하여 눈으로 볼 수 있도록 하는 데이터 시각화에 대한 관심이 꾸준히 증가하고 있다. 데이터 시각화는 데이터의 전처리를 거쳐 차원 축소를 하여 데이터의 분포를 시각적으로 확인할 수 있다. 공개된 데이터 셋은 캐글(kaggle), 아마존 AWS 데이터셋(Amazon AWS datasets), UC 얼바인 머신러닝 저장소(UC irvine machine learning repository)등 다양하다. 본 논문에서는 UCI의 화학 가스의 데이터셋을 이용하여 딥러닝을 이용하여 다양한 환경 및 조건에서의 학습을 통한 데이터분석 및 학습 결과가 좋을 경우와 그렇지 않을 경우의 마지막 레이어의 특징 벡터를 시각화하여 직관적인 결과를 확인 가능 하도록 하였다. 또한 다차원 입력 데이터를 시각화 함으로써 시각화 된 결과가 딥러닝의 학습결과와 연관이 있는지를 확인 한다.

  • PDF

Detection of Bacteria in Blood in Darkfield Microscopy Image (암시야 현미경 영상에서 혈액 내 박테리아 검출 방법)

  • Park, Hyun-jun
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.10a
    • /
    • pp.183-185
    • /
    • 2021
  • Detecting bacteria in blood could be an important research area in medicine and computer vision. In this paper, we propose a method for detecting bacteria in blood from 366 darkfield microscopy images acquired at Kaggle. Generate a training dataset through preprocessing and data augmentation using image processing techniques, and define a deep learning model for learning it. As a result of the experiment, it was confirmed that the proposed deep learning model effectively detects red blood cells and bacteria in darkfield microscopy images. In this paper, we learned using a relatively simple model, but it seems that more accurate results can be obtained by using a deeper model.

  • PDF

Preliminary Test of Google Vertex Artificial Intelligence in Root Dental X-ray Imaging Diagnosis (구글 버텍스 AI을 이용한 치과 X선 영상진단 유용성 평가)

  • Hyun-Ja Jeong
    • Journal of the Korean Society of Radiology
    • /
    • v.18 no.3
    • /
    • pp.267-273
    • /
    • 2024
  • Using a cloud-based vertex AI platform that can develop an artificial intelligence learning model without coding, this study easily developed an artificial intelligence learning model by the non-professional general public and confirmed its clinical applicability. Nine dental diseases and 2,999 root disease X-ray images released on the Kaggle site were used for the learning data, and learning, verification, and test data images were randomly classified. Image classification and multi-label learning were performed through hyper-parameter tuning work using a learning pipeline in vertex AI's basic learning model workflow. As a result of performing AutoML(Automated Machine Learning), AUC(Area Under Curve) was found to be 0.967, precision was 95.6%, and reproduction rate was 95.2%. It was confirmed that the learned artificial intelligence model was sufficient for clinical diagnosis.