• Title/Summary/Keyword: Information Repository

Search Result 651, Processing Time 0.029 seconds

A Technique to Detect Change-Coupled Files Using the Similarity of Change Types and Commit Time (변경 유형의 유사도 및 커밋 시간을 이용한 파일 변경 결합도)

  • Kim, Jung Il;Lee, Eun Joo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.3 no.2
    • /
    • pp.65-72
    • /
    • 2014
  • Change coupling is a measure to show how strongly change-related two entities are. When two source files have been frequently changed together, they are regarded as change-coupled files and they will probably be changed together in the near future. In the previous studies, the change coupling between two files is defined with the number of common changed time, that is, common commit time of the files. However, the frequency-based technique has limitations because of 'tangled changes', which frequently happens in the development environments with version control systems. The tangled change means that several code hunks have been changed at the same time, though they have no relation with each other. In this paper, the change types of the code hunks are also used to define change coupling, in addition to the common commit time of target files. First, the frequency vector based on change types are defined with the extracted change types, and then, the similarity of change patterns are calculated using the cosine similarity measure. We conducted experiments on open source project Eclipse JDT and CDT for case studies. The result shows that the applicability of the proposed method, compared to the previous studies.

High-performance computing for SARS-CoV-2 RNAs clustering: a data science-based genomics approach

  • Oujja, Anas;Abid, Mohamed Riduan;Boumhidi, Jaouad;Bourhnane, Safae;Mourhir, Asmaa;Merchant, Fatima;Benhaddou, Driss
    • Genomics & Informatics
    • /
    • v.19 no.4
    • /
    • pp.49.1-49.11
    • /
    • 2021
  • Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.

Adding AGC Case Studies to the Educator's Tool Chest

  • Schaufelberger, John;Rybkowski, Zofia K.;Clevenger, Caroline
    • International conference on construction engineering and project management
    • /
    • 2022.06a
    • /
    • pp.1226-1236
    • /
    • 2022
  • Because students majoring in construction-related fields must develop a broad repository of knowledge and skills, effective transferal of these is the primary focus of most academic programs. While inculcation of this body of knowledge is certainly critical, actual construction projects are complicated ventures that involve levels of risk and uncertainty, such as resistant neighboring communities, unforeseen weather conditions, escalating material costs, labor shortages and strikes, accidents on jobsites, challenges with emerging forms of technology, etc. Learning how to develop a level of discernment about potential ways to handle such uncertainty often takes years of costly trial-and-error in the proverbial "school of hard knocks." There is therefore a need to proactively expedite the development of a sharpened intuition when making decisions. The AGC Education and Research Foundation case study committee was formed to address this need. Since its inception in 2011, 14 freely downloadable case studies have thus far been jointly developed by an academics and industry practitioners to help educators elicit varied responses from students about potential ways to respond when facing an actual project dilemma. AGC case studies are typically designed to focus on a particular concern and topics have thus far included: ethics, site logistics planning, financial management, prefabrication and modularization, safety, lean practices, preconstruction planning, subcontractor management, collaborative teamwork, sustainable construction, mobile technology, and building information modeling (BIM). This session will include an overview of the history and intent of the AGC case study program, as well as lively interactive demonstrations and discussions on how case studies can be used both by educators within a typical academic setting, as well as by industry practitioners seeking a novel tool for their in-house training programs.

  • PDF

Wine Quality Prediction by Using Backward Elimination Based on XGBoosting Algorithm

  • Umer Zukaib;Mir Hassan;Tariq Khan;Shoaib Ali
    • International Journal of Computer Science & Network Security
    • /
    • v.24 no.2
    • /
    • pp.31-42
    • /
    • 2024
  • Different industries mostly rely on quality certification for promoting their products or brands. Although getting quality certification, specifically by human experts is a tough job to do. But the field of machine learning play a vital role in every aspect of life, if we talk about quality certification, machine learning is having a lot of applications concerning, assigning and assessing quality certifications to different products on a macro level. Like other brands, wine is also having different brands. In order to ensure the quality of wine, machine learning plays an important role. In this research, we use two datasets that are publicly available on the "UC Irvine machine learning repository", for predicting the wine quality. Datasets that we have opted for our experimental research study were comprised of white wine and red wine datasets, there are 1599 records for red wine and 4898 records for white wine datasets. The research study was twofold. First, we have used a technique called backward elimination in order to find out the dependency of the dependent variable on the independent variable and predict the dependent variable, the technique is useful for predicting which independent variable has maximum probability for improving the wine quality. Second, we used a robust machine learning algorithm known as "XGBoost" for efficient prediction of wine quality. We evaluate our model on the basis of error measures, root mean square error, mean absolute error, R2 error and mean square error. We have compared the results generated by "XGBoost" with the other state-of-the-art machine learning techniques, experimental results have showed, "XGBoost" outperform as compared to other state of the art machine learning techniques.

A Decision Model for BRE Introduction (BRE 도입을 위한 의사결정 모델)

  • Ju, Jung-Eun;Koo, Sang-Hoe
    • Journal of Intelligence and Information Systems
    • /
    • v.11 no.3
    • /
    • pp.103-115
    • /
    • 2005
  • For today's enterprises to survive in the current rapidly changing business environments, it is imperative to make quick and successful decisions to various challenges. In making important business decisions, if enterprises utilize business rules and knowledge, properly and promptly, they may effectively reduce the chance of failures. However, in most of today's information systems, these rules and knowledge are not managed in centralized and systemic manner. They disperse over entire enterprises' information systems, and sometimes reside in the heads or memos of enterprises' employees. BRE (Business Rule Engines) is a solution that systematically and centrally manages these business knowledge and rules of an enterprise. With BREs, any business user is able to store, edit, retrieve and utilize business rules and knowledge in centralized repository, without IT development skills. And with BRE, enterprises could improve business intelligence and attain strategic advantages over other enterprises. However, since there is no clear criteria for BRE introductions, it is not easy to decide whether or not to introduce the expensive BRE solution to an enterprise. In this research we propose a decision model for BRE introduction. Using this model, business analysts considering BRE introduction, readily make decisions on BRE introduction.

  • PDF

Empirical Research on Search model of Web Service Repository (웹서비스 저장소의 검색기법에 관한 실증적 연구)

  • Hwang, You-Sub
    • Journal of Intelligence and Information Systems
    • /
    • v.16 no.4
    • /
    • pp.173-193
    • /
    • 2010
  • The World Wide Web is transitioning from being a mere collection of documents that contain useful information toward providing a collection of services that perform useful tasks. The emerging Web service technology has been envisioned as the next technological wave and is expected to play an important role in this recent transformation of the Web. By providing interoperable interface standards for application-to-application communication, Web services can be combined with component-based software development to promote application interaction and integration within and across enterprises. To make Web services for service-oriented computing operational, it is important that Web services repositories not only be well-structured but also provide efficient tools for an environment supporting reusable software components for both service providers and consumers. As the potential of Web services for service-oriented computing is becoming widely recognized, the demand for an integrated framework that facilitates service discovery and publishing is concomitantly growing. In our research, we propose a framework that facilitates Web service discovery and publishing by combining clustering techniques and leveraging the semantics of the XML-based service specification in WSDL files. We believe that this is one of the first attempts at applying unsupervised artificial neural network-based machine-learning techniques in the Web service domain. We have developed a Web service discovery tool based on the proposed approach using an unsupervised artificial neural network and empirically evaluated the proposed approach and tool using real Web service descriptions drawn from operational Web services repositories. We believe that both service providers and consumers in a service-oriented computing environment can benefit from our Web service discovery approach.

Study on the Openness of International Academic Papers by Researchers in Library and Information Science Using POI (Practical Openness Index) (POI(Practical Openness Index)를 활용한 문헌정보학 연구자 국제학술논문의 개방성 연구)

  • Cho, Jane
    • Journal of Korean Library and Information Science Society
    • /
    • v.52 no.2
    • /
    • pp.25-44
    • /
    • 2021
  • In a situation where OA papers are increasing, POI, which indexes how open the research activities of individual researchers are, is drawing attention. This study investigated the existence of OA papers and the OA method published in international academic journals by domestic LIS researchers, and derived the researchers' POI based on this. In addition, by examining the relationship between the POI index and the researcher's amount of research papers, the research sub field, and the foreign co-authors, it was analyzed whether these factors are relevant to the researcher's POI. As a result, there were 492 papers by 82 researchers whose OA status and method were normally identified through Unpaywall. Second, only 20.7% of papers published in international journals were open accessed, and almost cases were gold and green methods. Third, there were many papers in text mining in medical journals, and the papers opened in the green method are open in institutional repositories of foreign co-authors or transnational subject repositories such as PMC. Third, the POI index was relatively higher for researchers in the field of informetrics, machine learning than other fields. In addition, it was analyzed that the presence or absence of overseas co-authors is related to OA.

Analysis of Overseas Data Management Systems for High Level Radioactive Waste Disposal (고준위방사성폐기물 처분 관련 자료 관리 해외사례 분석)

  • MinJeong Kim;SunJu Park;HyeRim Kim;WoonSang Yoon;JungHoon Park;JeongHwan Lee
    • The Journal of Engineering Geology
    • /
    • v.33 no.2
    • /
    • pp.323-334
    • /
    • 2023
  • The vast volumes of data that are generated during site characterization and associated research for the disposal of high-level radioactive waste require effective data management to properly chronicle and archive this information. The Swedish Nuclear Fuel and Waste Management Company, SKB, established the SICADA database for site selection, evaluation, analysis, and modeling. The German Federal Company for Radioactive Waste Disposal, BGE, established ArbeitsDB, a database and document management system, and the ELO data system to manage data collected according to the Repository Site Selection Act. The U.K. Nuclear Waste Services established the Data Management System to manage any research and survey data pertaining to nuclear waste storage and disposal. The U.S. Department of Energy and Office of Civilian Radioactive Waste Management established the Technical Data Management System for data management and subsequent licensing procedures during site characterization surveys. The presented cases undertaken by these national agencies highlight the importance of data quality management and the scalability of data utilization to ensure effective data management. Korea should also pursue the establishment of both a data management concept for radioactive waste disposal that considers data quality management and scalability from a long-term perspective and an associated data management system.

The Effect of Meta-Features of Multiclass Datasets on the Performance of Classification Algorithms (다중 클래스 데이터셋의 메타특징이 판별 알고리즘의 성능에 미치는 영향 연구)

  • Kim, Jeonghun;Kim, Min Yong;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.23-45
    • /
    • 2020
  • Big data is creating in a wide variety of fields such as medical care, manufacturing, logistics, sales site, SNS, and the dataset characteristics are also diverse. In order to secure the competitiveness of companies, it is necessary to improve decision-making capacity using a classification algorithm. However, most of them do not have sufficient knowledge on what kind of classification algorithm is appropriate for a specific problem area. In other words, determining which classification algorithm is appropriate depending on the characteristics of the dataset was has been a task that required expertise and effort. This is because the relationship between the characteristics of datasets (called meta-features) and the performance of classification algorithms has not been fully understood. Moreover, there has been little research on meta-features reflecting the characteristics of multi-class. Therefore, the purpose of this study is to empirically analyze whether meta-features of multi-class datasets have a significant effect on the performance of classification algorithms. In this study, meta-features of multi-class datasets were identified into two factors, (the data structure and the data complexity,) and seven representative meta-features were selected. Among those, we included the Herfindahl-Hirschman Index (HHI), originally a market concentration measurement index, in the meta-features to replace IR(Imbalanced Ratio). Also, we developed a new index called Reverse ReLU Silhouette Score into the meta-feature set. Among the UCI Machine Learning Repository data, six representative datasets (Balance Scale, PageBlocks, Car Evaluation, User Knowledge-Modeling, Wine Quality(red), Contraceptive Method Choice) were selected. The class of each dataset was classified by using the classification algorithms (KNN, Logistic Regression, Nave Bayes, Random Forest, and SVM) selected in the study. For each dataset, we applied 10-fold cross validation method. 10% to 100% oversampling method is applied for each fold and meta-features of the dataset is measured. The meta-features selected are HHI, Number of Classes, Number of Features, Entropy, Reverse ReLU Silhouette Score, Nonlinearity of Linear Classifier, Hub Score. F1-score was selected as the dependent variable. As a result, the results of this study showed that the six meta-features including Reverse ReLU Silhouette Score and HHI proposed in this study have a significant effect on the classification performance. (1) The meta-features HHI proposed in this study was significant in the classification performance. (2) The number of variables has a significant effect on the classification performance, unlike the number of classes, but it has a positive effect. (3) The number of classes has a negative effect on the performance of classification. (4) Entropy has a significant effect on the performance of classification. (5) The Reverse ReLU Silhouette Score also significantly affects the classification performance at a significant level of 0.01. (6) The nonlinearity of linear classifiers has a significant negative effect on classification performance. In addition, the results of the analysis by the classification algorithms were also consistent. In the regression analysis by classification algorithm, Naïve Bayes algorithm does not have a significant effect on the number of variables unlike other classification algorithms. This study has two theoretical contributions: (1) two new meta-features (HHI, Reverse ReLU Silhouette score) was proved to be significant. (2) The effects of data characteristics on the performance of classification were investigated using meta-features. The practical contribution points (1) can be utilized in the development of classification algorithm recommendation system according to the characteristics of datasets. (2) Many data scientists are often testing by adjusting the parameters of the algorithm to find the optimal algorithm for the situation because the characteristics of the data are different. In this process, excessive waste of resources occurs due to hardware, cost, time, and manpower. This study is expected to be useful for machine learning, data mining researchers, practitioners, and machine learning-based system developers. The composition of this study consists of introduction, related research, research model, experiment, conclusion and discussion.

Development of Cyber R&D Platform on Total System Performance Assessment for a Potential HLW Repository ; Application for Development of Scenario through QA Procedures (고준위 방사성폐기물 처분 종합 성능 평가 (TSPA)를 위한 Cyber R&D Platform 개발 ; 시나리오 도출 과정에서의 품질보증 적용 사례)

  • Seo Eun-Jin;Hwang Yong-soo;Kang Chul-Hyung
    • Proceedings of the Korean Radioactive Waste Society Conference
    • /
    • 2005.06a
    • /
    • pp.311-318
    • /
    • 2005
  • Transparency on the Total System Performance Assessment (TSPA) is the key issue to enhance the public acceptance for a permanent high level radioactive repository. To approve it, all performances on TSPA through Quality Assurance is necessary. The integrated Cyber R&D Platform is developed by KAERI using the T2R3 principles applicable for five major steps in R&D's. The proposed system is implemented in the web-based system so that all participants in TSPA are able to access the system. It is composed of FEAS (FEp to Assessment through Scenario development) showing systematic approach from the FEPs to Assessment methods flow chart, PAID (Performance Assessment Input Databases) showing PA(Performance Assessment) input data set in web based system and QA system receding those data. All information is integrated into Cyber R&D Platform so that every data in the system can be checked whenever necessary. For more user-friendly system, system upgrade included input data & documentation package is under development. Throughout the next phase R&D, Cyber R&D Platform will be connected with the assessment tool for TSPA so that it will be expected to search the whole information in one unified system.

  • PDF