• Title/Summary/Keyword: dataset records

Search Result 98, Processing Time 0.023 seconds

Association measure of doubly interval censored data using a Kendall's 𝜏 estimator

  • Kang, Seo-Hyun;Kim, Yang-Jin
    • Communications for Statistical Applications and Methods
    • /
    • v.28 no.2
    • /
    • pp.151-159
    • /
    • 2021
  • In this article, our interest is to estimate the association between consecutive gap times which are subject to interval censoring. Such data are referred as doubly interval censored data (Sun, 2006). In a context of serial event, an induced dependent censoring frequently occurs, resulting in biased estimates. In this study, our goal is to propose a Kendall's 𝜏 based association measure for doubly interval censored data. For adjusting the impact of induced dependent censoring, the inverse probability censoring weighting (IPCW) technique is implemented. Furthermore, a multiple imputation technique is applied to recover unknown failure times owing to interval censoring. Simulation studies demonstrate that the suggested association estimator performs well with moderate sample sizes. The proposed method is applied to a dataset of children's dental records.

Iowa Liquor Sales Data Predictive Analysis Using Spark

  • Ankita Paul;Shuvadeep Kundu;Jongwook Woo
    • Asia pacific journal of information systems
    • /
    • v.31 no.2
    • /
    • pp.185-196
    • /
    • 2021
  • The paper aims to analyze and predict sales of liquor in the state of Iowa by applying machine learning algorithms to models built for prediction. We have taken recourse of Azure ML and Spark ML for our predictive analysis, which is legacy machine learning (ML) systems and Big Data ML, respectively. We have worked on the Iowa liquor sales dataset comprising of records from 2012 to 2019 in 24 columns and approximately 1.8 million rows. We have concluded by comparing the models with different algorithms applied and their accuracy in predicting the sales using both Azure ML and Spark ML. We find that the Linear Regression model has the highest precision and Decision Forest Regression has the fastest computing time with the sample data set using the legacy Azure ML systems. Decision Tree Regression model in Spark ML has the highest accuracy with the quickest computing time for the entire data set using the Big Data Spark systems.

Effect of Experience, Education, Record Keeping, Labor and Decision Making on Monthly Milk Yield and Revenue of Dairy Farms Supported by a Private Organization in Central Thailand

  • Yeamkong, S.;Koonawootrittriron, S.;Elzo, M.A.;Suwanasopee, T.
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.23 no.6
    • /
    • pp.814-824
    • /
    • 2010
  • The objective of this research was to assess the effect of experience, education, record keeping, labor, and decision making on monthly milk yield per farm (MYF), monthly milk yield per cow (MYC), monthly milk revenue per farm (MRF), and monthly revenue per cow (MRC) of dairy farms supported by a private organization in Central Thailand. The dataset contained 34,082 monthly milk yield and revenue records collected from January 2004 to December 2008 on 497 farms, and information on individual farmer experience and education, record keeping, and decision making obtained with a questionnaire. Farmer experience categories were i) no experience, ii) one year, iii) two to five years, iv) six to ten years, v) eleven to fifteen years, vi) sixteen to twenty years, and vii) more than twenty years. Farmer education categories were i) no education or primary school, ii) high school, and iii) bachelor or higher degree. Record keeping categories were: i) no records and ii) kept records. Labor categories were: i) family, ii) hired people, and iii) family and hired people. Decision making categories were: i) decisions made by farmers themselves, ii) decisions made with help from government officials, and iii) decisions made with help from organization staff. The mixed linear model contained the fixed effects of year-season, farm location-farm size subclass, experience, education, record keeping, labor, and decision making on sire selection, and the random effects of farm and residual. Results showed that longer experience increased (p<0.05) monthly milk yield (MYF and MYC) and revenue (MRF and MRC). Farms that hired people produced the highest (p<0.05) monthly milk yield (MYF and MYC) and revenue (MRF and MRC), followed by farms that used family, and the lowest values were for farms that used both family and hired people. Better educated farmers produced more MYC and MRC (p<0.05) than lower educated farmers. Farms that kept records had higher MYF and MRF (p<0.05) than those without records. Although differences among farms were non-significant, farms that received help from the organization staff had higher monthly milk yield (MYF and MYC) and revenue (MRF and MRC) than those that decided by themselves or with help from government officials. These findings suggested that dairy farmers needed systematic training and continuous support to improve farm milk production and revenues in a sustainable manner.

Estimation of Genetic Trend on Racing Time of Thoroughbred Racehorses (더러브렛 경주마의 주파기록에 대한 유전적 개량량의 추정)

  • Park, K.D.;Son, S.K.;Rho, S.H.;Cho, K.H.;Lee, Z.H.;Cho, B.W.
    • Journal of Animal Science and Technology
    • /
    • v.50 no.1
    • /
    • pp.27-32
    • /
    • 2008
  • The objective of this study was to estimate genetic trend on racing time of Thoroughbred racehorses in Korea, using a total of 209,725 racing records of 9,934 racehorses collection from January, 1990 to December, 2006. Phenotypic trends for all distances were negative at a rate of -0.148, -0.137, -0.137 and -0.139 second per race year for distances of 1,000m, 1,400m less than, 1,700m more than and overall dataset, respectively. Environmental trends were similar to phenotypic ones in all distances and trends in permanent environmental and jockey effects by race year were not found. Average genetic improvements for racing time were -0.037 and -0.030 second per race year at the 1,000m and overall dataset, respectively, which is low. But Genetic trends were decreased consistently. There is need to establish the genetic improvement program for quality of racehorses.

Combining Ego-centric Network Analysis and Dynamic Citation Network Analysis to Topic Modeling for Characterizing Research Trends (자아 중심 네트워크 분석과 동적 인용 네트워크를 활용한 토픽모델링 기반 연구동향 분석에 관한 연구)

  • Yu, So-Young
    • Journal of the Korean Society for information Management
    • /
    • v.32 no.1
    • /
    • pp.153-169
    • /
    • 2015
  • The combined approach of using ego-centric network analysis and dynamic citation network analysis for refining the result of LDA-based topic modeling was suggested and examined in this study. Tow datasets were constructed by collecting Web of Science bibliographic records of White LED and topic modeling was performed by setting a different number of topics on each dataset. The multi-assigned top keywords of each topic were re-assigned to one specific topic by applying an ego-centric network analysis algorithm. It was found that the topical cohesion of the result of topic modeling with the number of topic corresponding to the lowest value of perplexity to the dataset extracted by SPLC network analysis was the strongest with the best values of internal clustering evaluation indices. Furthermore, it demonstrates the possibility of developing the suggested approach as a method of multi-faceted research trend detection.

A Study on Establishment Method of Smart Factory Dataset for Artificial Intelligence (인공지능형 스마트공장 데이터셋 구축 방법에 관한 연구)

  • Park, Youn-Soo;Lee, Sang-Deok;Choi, Jeong-Hun
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.21 no.5
    • /
    • pp.203-208
    • /
    • 2021
  • At the manufacturing site, workers have been operating by inputting materials into the manufacturing process and leaving input records according to the work instructions, but product LOT tracking has been not possible due to many omissions. Recently, it is being carried out as a system to automatically input materials using RFID-Tag. In particular, the initial automatic recognition rate was good at 97 percent by automatically generating input information through RACK (TAG) ID and RACK input time analysis, but the automatic recognition rate continues to decrease due to multi-material RACK, TAG loss, and new product input issues. It is expected that it will contribute to increasing speed and yield (normal product ratio) in the overall production process by improving automatic recognition rate and real-time monitoring through the establishment of artificial intelligent smart factory datasets.

CNN based data anomaly detection using multi-channel imagery for structural health monitoring

  • Shajihan, Shaik Althaf V.;Wang, Shuo;Zhai, Guanghao;Spencer, Billie F. Jr.
    • Smart Structures and Systems
    • /
    • v.29 no.1
    • /
    • pp.181-193
    • /
    • 2022
  • Data-driven structural health monitoring (SHM) of civil infrastructure can be used to continuously assess the state of a structure, allowing preemptive safety measures to be carried out. Long-term monitoring of large-scale civil infrastructure often involves data-collection using a network of numerous sensors of various types. Malfunctioning sensors in the network are common, which can disrupt the condition assessment and even lead to false-negative indications of damage. The overwhelming size of the data collected renders manual approaches to ensure data quality intractable. The task of detecting and classifying an anomaly in the raw data is non-trivial. We propose an approach to automate this task, improving upon the previously developed technique of image-based pre-processing on one-dimensional (1D) data by enriching the features of the neural network input data with multiple channels. In particular, feature engineering is employed to convert the measured time histories into a 3-channel image comprised of (i) the time history, (ii) the spectrogram, and (iii) the probability density function representation of the signal. To demonstrate this approach, a CNN model is designed and trained on a dataset consisting of acceleration records of sensors installed on a long-span bridge, with the goal of fault detection and classification. The effect of imbalance in anomaly patterns observed is studied to better account for unseen test cases. The proposed framework achieves high overall accuracy and recall even when tested on an unseen dataset that is much larger than the samples used for training, offering a viable solution for implementation on full-scale structures where limited labeled-training data is available.

Exploring indicators of genetic selection using the sniffer method to reduce methane emissions from Holstein cows

  • Yoshinobu Uemoto;Tomohisa Tomaru;Masahiro Masuda;Kota Uchisawa;Kenji Hashiba;Yuki Nishikawa;Kohei Suzuki;Takatoshi Kojima;Tomoyuki Suzuki;Fuminori Terada
    • Animal Bioscience
    • /
    • v.37 no.2
    • /
    • pp.173-183
    • /
    • 2024
  • Objective: This study aimed to evaluate whether the methane (CH4) to carbon dioxide (CO2) ratio (CH4/CO2) and methane-related traits obtained by the sniffer method can be used as indicators for genetic selection of Holstein cows with lower CH4 emissions. Methods: The sniffer method was used to simultaneously measure the concentrations of CH4 and CO2 during milking in each milking box of the automatic milking system to obtain CH4/CO2. Methane-related traits, which included CH4 emissions, CH4 per energy-corrected milk, methane conversion factor (MCF), and residual CH4, were calculated. First, we investigated the impact of the model with and without body weight (BW) on the lactation stage and parity for predicting methane-related traits using a first on-farm dataset (Farm 1; 400 records for 74 Holstein cows). Second, we estimated the genetic parameters for CH4/CO2 and methane-related traits using a second on-farm dataset (Farm 2; 520 records for 182 Holstein cows). Third, we compared the repeatability and environmental effects on these traits in both farm datasets. Results: The data from Farm 1 revealed that MCF can be reliably evaluated during the lactation stage and parity, even when BW is excluded from the model. Farm 2 data revealed low heritability and moderate repeatability for CH4/CO2 (0.12 and 0.46, respectively) and MCF (0.13 and 0.38, respectively). In addition, the estimated genetic correlation of milk yield with CH4/CO2 was low (0.07) and that with MCF was moderate (-0.53). The on-farm data indicated that CH4/CO2 and MCF could be evaluated consistently during the lactation stage and parity with moderate repeatability on both farms. Conclusion: This study demonstrated the on-farm applicability of the sniffer method for selecting cows with low CH4 emissions.

Construction of a Standard Dataset for Liver Tumors for Testing the Performance and Safety of Artificial Intelligence-Based Clinical Decision Support Systems (인공지능 기반 임상의학 결정 지원 시스템 의료기기의 성능 및 안전성 검증을 위한 간 종양 표준 데이터셋 구축)

  • Seung-seob Kim;Dong Ho Lee;Min Woo Lee;So Yeon Kim;Jaeseung Shin;Jin‑Young Choi;Byoung Wook Choi
    • Journal of the Korean Society of Radiology
    • /
    • v.82 no.5
    • /
    • pp.1196-1206
    • /
    • 2021
  • Purpose To construct a standard dataset of contrast-enhanced CT images of liver tumors to test the performance and safety of artificial intelligence (AI)-based algorithms for clinical decision support systems (CDSSs). Materials and Methods A consensus group of medical experts in gastrointestinal radiology from four national tertiary institutions discussed the conditions to be included in a standard dataset. Seventy-five cases of hepatocellular carcinoma, 75 cases of metastasis, and 30-50 cases of benign lesions were retrieved from each institution, and the final dataset consisted of 300 cases of hepatocellular carcinoma, 300 cases of metastasis, and 183 cases of benign lesions. Only pathologically confirmed cases of hepatocellular carcinomas and metastases were enrolled. The medical experts retrieved the medical records of the patients and manually labeled the CT images. The CT images were saved as Digital Imaging and Communications in Medicine (DICOM) files. Results The medical experts in gastrointestinal radiology constructed the standard dataset of contrast-enhanced CT images for 783 cases of liver tumors. The performance and safety of the AI algorithm can be evaluated by calculating the sensitivity and specificity for detecting and characterizing the lesions. Conclusion The constructed standard dataset can be utilized for evaluating the machine-learning-based AI algorithm for CDSS.

Modelling of Wind Wave Pressure and Free-surface Elevation using System Identification (시스템 식별기법을 활용한 파압과 해수면 모델링)

  • Cieslikiewicz, Witold;Badur, Jordan
    • Journal of Korean Society of Coastal and Ocean Engineers
    • /
    • v.25 no.6
    • /
    • pp.422-432
    • /
    • 2013
  • A System Identification method to develop parametric models linking free surface elevation and wave pressure is presented and two models are built allowing for either wave pressure or free surface elevation simulation. Linear, time invariant model structures with static nonlinearities are assumed and solutions are sought in a form of autoregressive model with extra input (ARX). An arbitrary chosen free-surface elevation and wave pressure dataset is used for estimation of the models, which are subsequently verified against datasets with similar pressure gauge depth but different free-surface elevation spectra due to different meteorological conditions. It is shown that free-surface simulation using System Identification methods can perform better than traditional linear transfer function derived from linear wave theory (LTF), while wave pressure simulation quality using presented methods is generally similar to that obtained with corrected LTF.