• Title/Summary/Keyword: Data Scientists

Search Result 3,360, Processing Time 0.027 seconds

An Efficient Large Graph Clustering Technique based on Min-Hash (Min-Hash를 이용한 효율적인 대용량 그래프 클러스터링 기법)

  • Lee, Seok-Joo;Min, Jun-Ki
    • Journal of KIISE
    • /
    • v.43 no.3
    • /
    • pp.380-388
    • /
    • 2016
  • Graph clustering is widely used to analyze a graph and identify the properties of a graph by generating clusters consisting of similar vertices. Recently, large graph data is generated in diverse applications such as Social Network Services (SNS), the World Wide Web (WWW), and telephone networks. Therefore, the importance of graph clustering algorithms that process large graph data efficiently becomes increased. In this paper, we propose an effective clustering algorithm which generates clusters for large graph data efficiently. Our proposed algorithm effectively estimates similarities between clusters in graph data using Min-Hash and constructs clusters according to the computed similarities. In our experiment with real-world data sets, we demonstrate the efficiency of our proposed algorithm by comparing with existing algorithms.

Mutational Data Loading Routines for Human Genome Databases: the BRCA1 Case

  • Van Der Kroon, Matthijs;Ramirez, Ignacio Lereu;Levin, Ana M.;Pastor, Oscar;Brinkkemper, Sjaak
    • Journal of Computing Science and Engineering
    • /
    • v.4 no.4
    • /
    • pp.291-312
    • /
    • 2010
  • The last decades a large amount of research has been done in the genomics domain which has and is generating terabytes, if not exabytes, of information stored globally in a very fragmented way. Different databases use different ways of storing the same data, resulting in undesired redundancy and restrained information transfer. Adding to this, keeping the existing databases consistent and data integrity maintained is mainly left to human intervention which in turn is very costly, both in time and money as well as error prone. Identifying a fixed conceptual dictionary in the form of a conceptual model thus seems crucial. This paper presents an effort to integrate the mutational data from the established genomic data source HGMD into a conceptual model driven database HGDB, thereby providing useful lessons to improve the already existing conceptual model of the human genome.

Equivalence Heuristics for Malleability-Aware Skylines

  • Lofi, Christoph;Balke, Wolf-Tilo;Guntzer, Ulrich
    • Journal of Computing Science and Engineering
    • /
    • v.6 no.3
    • /
    • pp.207-218
    • /
    • 2012
  • In recent years, the skyline query paradigm has been established as a reliable method for database query personalization. While early efficiency problems have been solved by sophisticated algorithms and advanced indexing, new challenges in skyline retrieval effectiveness continuously arise. In particular, the rise of the Semantic Web and linked open data leads to personalization issues where skyline queries cannot be applied easily. We addressed the special challenges presented by linked open data in previous work; and now further extend this work, with a heuristic workflow to boost efficiency. This is necessary; because the new view on linked open data dominance has serious implications for the efficiency of the actual skyline computation, since transitivity of the dominance relationships is no longer granted. Therefore, our contributions in this paper can be summarized as: we present an intuitive skyline query paradigm to deal with linked open data; we provide an effective dominance definition, and establish its theoretical properties; we develop innovative skyline algorithms to deal with the resulting challenges; and we design efficient heuristics for the case of predicate equivalences that may often happen in linked open data. We extensively evaluate our new algorithms with respect to performance, and the enriched skyline semantics.

Visualization and Analysis of Public Bicycle Rental Data in Daejeon(Tashu) (대전시 공공 자전거(타슈) 공개 데이터 시각화 및 분석)

  • Mun, Hyunsu;Lee, Youngseok
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.6
    • /
    • pp.253-267
    • /
    • 2016
  • The world's major cities operate public rental bicycle systems to complement the existing problems of public transport in the city. Disclosing the rental history data in Daejeon has opened new analytical possibilities. In this paper, we proposed a method to analyze the data using the visualization. We found a positional feature of the station according to the bicycle usage. In addition, we examined the bicycle usage patterns according to the time/day/month. On the other hand, the usage patterns between each of the bicycle stations were identified through a path analysis. The specific objectives were identified through each stop destination ratio analysis. Based on these data, we suggest a direction of Daejeon public bicycle rental system development.

An Efficient Cache Coherence Protocol for Multi-Core Processors with Ring Interconnects (링 연결구조 기반의 멀티코어 프로세서를 위한 캐시 일관성 유지 기법)

  • Park, Jin-Young;Choi, Lynn
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.14 no.8
    • /
    • pp.768-772
    • /
    • 2008
  • Today's microprocessor normally includes several processing cores to reduce the energy consumption without losing performance. In this paper, data transfer ordering mechanism can be efficiently used for cache coherence solution in unidirectional ring interconnect. RING-DATA ORDER combines the simplicity of GREEDY-ORDER and the performance of RING-ORDER. RING-DATA ORDER can be easily applicable to multicore processor with unidirectional ring interconnect.

A Data-Consistency Scheme for the Distributed-Cache Storage of the Memcached System

  • Liao, Jianwei;Peng, Xiaoning
    • Journal of Computing Science and Engineering
    • /
    • v.11 no.3
    • /
    • pp.92-99
    • /
    • 2017
  • Memcached, commonly used to speed up the data access in big-data and Internet-web applications, is a system software of the distributed-cache mechanism. But it is subject to the severe challenge of the loss of recently uncommitted updates in the case where the Memcached servers crash due to some reason. Although the replica scheme and the disk-log-based replay mechanism have been proposed to overcome this problem, they generate either the overhead of the replica synchronization or the persistent-storage overhead that is caused by flushing related logs. This paper proposes a scheme of backing up the write requests (i.e., set and add) on the Memcached client side, to reduce the overhead resulting from the making of disk-log records or performing the replica consistency. If the Memcached server fails, a timestamp-based recovery mechanism is then introduced to replay the write requests (buffered by relevant clients), for regaining the lost-data updates on the rebooted Memcached server, thereby meeting the data-consistency requirement. More importantly, compared with the mechanism of logging the write requests to the persistent storage of the master server and the server-replication scheme, the newly proposed approach of backing up the logs on the client side can greatly decrease the time overhead by up to 116.8% when processing the write workloads.

A Multivariate Decision Tree using Support Vector Machines (지지 벡터 머신을 이용한 다변수 결정 트리)

  • Kang, Sung-Gu;Lee, B.W.;Na, Y.C.;Jo, H.S.;Yoon, C.M.;Yang, Ji-Hoon
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2006.10b
    • /
    • pp.278-283
    • /
    • 2006
  • 결정 트리는 큰 가설 공간을 가지고 있어 유연하고 강인한 성능을 지닐 수 있다. 하지만 결정트리가 학습 데이터에 지나치게 적응되는 경향이 있다. 학습데이터에 과도하게 적응되는 경향을 없애기 위해 몇몇 가지치기 알고리즘이 개발되었다. 하지만, 데이터가 속성 축에 평행하지 않아서 오는 공간 낭비의 문제는 이러한 방법으로 해결할 수 없다. 따라서 본 논문에서는 다변수 노드를 사용한 선형 분류기를 이용하여 이러한 문제점을 해결하는 방법을 제시하였으며, 결정트리의 성능을 높이고자 지지 벡터 머신을 도입하였다(SVMDT). 본 논문에서 제시한 알고리즘은 세 가지 부분으로 이루어졌다. 첫째로, 각 노드에서 사용할 속성을 선택하는 부분과 둘째로, ID3를 이 목적에 맞게 바꾼 알고리즘과 마지막으로 기본적인 형태의 가지치기 알고리즘을 개발하였다. UCI 데이터 셋을 이용하여 OC1, C4.5, SVM과 비교한 결과, SVMDT는 개선된 결과를 보였다.

  • PDF

Read-only Transaction Processing in Wireless Data Broadcast Environments (무선 데이타 방송 환경에서 읽기-전용 트랜잭션 처리 기법)

  • Lee, Sang-Geun;Kim, Seong-Seok;Hwang, Jong-Seon
    • Journal of KIISE:Databases
    • /
    • v.29 no.5
    • /
    • pp.404-415
    • /
    • 2002
  • In this paper, we address the issue of ensuring consistency of multiple data items requested in a certain order by read-only transactions in a wireless data broadcast environment. To handle the inherent property in a data broadcast environment that data can only be accessed strictly sequential by users, we explore a predeclaration-based query optimization and devise two practical transaction processing methods in the context of local caching. We also evaluate the performance of the proposed methods by an analytical study Evaluation results show that the predeclaration technique we introduce reduces response time significantly and adapts to dynamic changes in workload.

Rapid Data Allocation Technique for Multiple Memory Bank Architectures (다중 메모리 뱅크 구조를 위한 고속의 자료 할당 기법)

  • 조정훈;백윤홍;최준식
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2003.10a
    • /
    • pp.196-198
    • /
    • 2003
  • Virtually every digital signal processors(DSPs) support on-chip multi- memory banks that allow the processor to access multiple words of data from memory in a single instruction cycle. Also, all existing fixed-point DSPs have irregular architecture of heterogeneous register which contains multiple register files that are distributed and dedicated to different sets of instructions. Although there have been several studies conducted to efficiently assign data to multi-memory banks, most of them assumed processors with relatively simple, homogeneous general-purpose resisters. Therefore, several vendor-provided compilers fer DSPs were unable to efficiently assign data to multiple data memory banks. thereby often failing to generate highly optimized code fer their machines. This paper presents an algorithm that helps the compiler to efficiently assign data to multi- memory banks. Our algorithm differs from previous work in that it assigns variables to memory banks in separate, decoupled code generation phases, instead of a single, tightly-coupled phase. The experimental results have revealed that our decoupled algorithm greatly simplifies our code generation process; thus our compiler runs extremely fast, yet generates target code that is comparable In quality to the code generated by a coupled approach

  • PDF

Automatic UML-based Test Data Generating Tool: AUTEG (UML기반의 테스트 데이타 자동생성 도구 : AUTEG)

  • Kim, Cheong-Ah;Choi, Byoung-Ju
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.8 no.3
    • /
    • pp.268-276
    • /
    • 2002
  • In this paper we suggest a method to produce automatically teat data using UML development diagrams, and analytically describe the application of a tool, Automatic UML-based Test Data Generation (AUTEG) developed using XML technology, to the examples of insurance system. Our AUTEG automatically generates test diagrams that enable to detect errors existing at the interface area between modules composing the whole system, along with test data by applying the existing white-box test technique to the test diagram. Our AUTEG can be applied to the integration test as well as the system test and using the tool, users may make the unit modules of the integration test into several groups.