• Title/Summary/Keyword: Data Sequence

Search Result 3,107, Processing Time 0.029 seconds

A Protein Sequence Prediction Method by Mining Sequence Data (서열 데이타마이닝을 통한 단백질 서열 예측기법)

  • Cho, Sun-I;Lee, Do-Heon;Cho, Kwang-Hwi;Won, Yong-Gwan;Kim, Byoung-Ki
    • The KIPS Transactions:PartD
    • /
    • v.10D no.2
    • /
    • pp.261-266
    • /
    • 2003
  • A protein, which is a linear polymer of amino acids, is one of the most important bio-molecules composing biological structures and regulating bio-chemical reactions. Since the characteristics and functions of proteins are determined by their amino acid sequences in principle, protein sequence determination is the starting point of protein function study. This paper proposes a protein sequence prediction method based on data mining techniques, which can overcome the limitation of previous bio-chemical sequencing methods. After applying multiple proteases to acquire overlapped protein fragments, we can identify candidate fragment sequences by comparing fragment mass values with peptide databases. We propose a method to construct multi-partite graph and search maximal paths to determine the protein sequence by assembling proper candidate sequences. In addition, experimental results based on the SWISS-PROT database showing the validity of the proposed method is presented.

Physical Database Design for DFT-Based Multidimensional Indexes in Time-Series Databases (시계열 데이터베이스에서 DFT-기반 다차원 인덱스를 위한 물리적 데이터베이스 설계)

  • Kim, Sang-Wook;Kim, Jin-Ho;Han, Byung-ll
    • Journal of Korea Multimedia Society
    • /
    • v.7 no.11
    • /
    • pp.1505-1514
    • /
    • 2004
  • Sequence matching in time-series databases is an operation that finds the data sequences whose changing patterns are similar to that of a query sequence. Typically, sequence matching hires a multi-dimensional index for its efficient processing. In order to alleviate the dimensionality curse problem of the multi-dimensional index in high-dimensional cases, the previous methods for sequence matching apply the Discrete Fourier Transform(DFT) to data sequences, and take only the first two or three DFT coefficients as organizing attributes of the multi-dimensional index. This paper first points out the problems in such simple methods taking the firs two or three coefficients, and proposes a novel solution to construct the optimal multi -dimensional index. The proposed method analyzes the characteristics of a target database, and identifies the organizing attributes having the best discrimination power based on the analysis. It also determines the optimal number of organizing attributes for efficient sequence matching by using a cost model. To show the effectiveness of the proposed method, we perform a series of experiments. The results show that the Proposed method outperforms the previous ones significantly.

  • PDF

Parallelization of Genome Sequence Data Pre-Processing on Big Data and HPC Framework (빅데이터 및 고성능컴퓨팅 프레임워크를 활용한 유전체 데이터 전처리 과정의 병렬화)

  • Byun, Eun-Kyu;Kwak, Jae-Hyuck;Mun, Jihyeob
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.8 no.10
    • /
    • pp.231-238
    • /
    • 2019
  • Analyzing next-generation genome sequencing data in a conventional way using single server may take several tens of hours depending on the data size. However, in order to cope with emergency situations where the results need to be known within a few hours, it is required to improve the performance of a single genome analysis. In this paper, we propose a parallelized method for pre-processing genome sequence data which can reduce the analysis time by utilizing the big data technology and the highperformance computing cluster which is connected to the high-speed network and shares the parallel file system. For the reliability of analytical data, we have chosen a strategy to parallelize the existing analytical tools and algorithms to the new environment. Parallelized processing, data distribution, and parallel merging techniques have been developed and performance improvements have been confirmed through experiments.

A Comparison of Three Fixed-Length Sequence Generators of Synthetic Self-Similar Network Traffic (Synthetic Self-Similar 네트워크 Traffic의 세 가지 고정길이 Sequence 생성기에 대한 비교)

  • Jeong, Hae-Duck J.;Lee, Jong-Suk R.
    • The KIPS Transactions:PartC
    • /
    • v.10C no.7
    • /
    • pp.899-914
    • /
    • 2003
  • It is generally accepted that self-similar (or fractal) processes may provide better models for teletraffic in modern telecommunication networks than Poisson Processes. If this is not taken into account, it can lead to inaccurate conclusions about performance of telecommunication networks. Thus, an important requirement for conducting simulation studies of telecommunication networks is the ability to generate long synthetic stochastic self-similar sequences. Three generators of pseudo-random self-similar sequences, based on the FFT〔20〕, RMD〔12〕 and SRA methods〔5, 10〕, are compared and analysed in this paper. Properties of these generators were experimentally studied in the sense of their statistical accuracy and times required to produce sequences of a given (long) length. While all three generators show similar levels of accuracy of the output data (in the sense of relative accuracy of the Horst parameter), the RMD- and SRA-based generators appear to be much faster than the generator based on FFT. Our results also show that a robust method for comparative studies of self-similarity in pseudo-random sequences is needed.

A Method for Mining Interval Event Association Rules from a Set of Events Having Time Property (시간 속성을 갖는 이벤트 집합에서 인터벌 연관 규칙 마이닝 기법)

  • Han, Dae-Young;Kim, Dae-In;Kim, Jae-In;Na, Chol-Su;Hwang, Bu-Hyun
    • The KIPS Transactions:PartD
    • /
    • v.16D no.2
    • /
    • pp.185-190
    • /
    • 2009
  • The event sequence of the same type from a set of events having time property can be summarized in one event. But if the event sequence having an interval, It is reasonable to be summarized more than one in independent sub event sequence of each other. In this paper, we suggest a method of temporal data mining that summarizes the interval events based on Allen's interval algebra and finds out interval event association rule from interval events. It provides better knowledge than others by using concept of an independent sub sequence and finding interval event association rules.

Development of the Recommender System of Arabic Books Based on the Content Similarity

  • Alotaibi, Shaykhah Hajed;Khan, Muhammad Badruddin
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.8
    • /
    • pp.175-186
    • /
    • 2022
  • This research article develops an Arabic books' recommendation system, which is based on the content similarity that assists users to search for the right book and predict the appropriate and suitable books pertaining to their literary style. In fact, the system directs its users toward books, which can meet their needs from a large dataset of Information. Further, this system makes its predictions based on a set of data that is gathered from different books and converts it to vectors by using the TF-IDF system. After that, the recommendation algorithms such as the cosine similarity, the sequence matcher similarity, and the semantic similarity aggregate data to produce an efficient and effective recommendation. This approach is advantageous in recommending previously unrated books to users with unique interests. It is found to be proven from the obtained results that the results of the cosine similarity of the full content of books, the results of the sequence matcher similarity of Arabic titles of the books, and the results of the semantic similarity of English titles of the books are the best obtained results, and extremely close to the average of the result related to the human assigned/annotated similarity. Flask web application is developed with a simple interface to show the recommended Arabic books by using cosine similarity, sequence matcher similarity, and semantic similarity algorithms with all experiments that are conducted.

Discrete HMM Training Algorithm for Incomplete Time Series Data (불완전 시계열 데이터를 위한 이산 HMM 학습 알고리듬)

  • Sin, Bong-Kee
    • Journal of Korea Multimedia Society
    • /
    • v.19 no.1
    • /
    • pp.22-29
    • /
    • 2016
  • Hidden Markov Model is one of the most successful and popular tools for modeling real world sequential data. Real world signals come in a variety of shapes and variabilities, among which temporal and spectral ones are the prime targets that the HMM aims at. A new problem that is gaining increasing attention is characterizing missing observations in incomplete data sequences. They are incomplete in that there are holes or omitted measurements. The standard HMM algorithms have been developed for complete data with a measurements at each regular point in time. This paper presents a modified algorithm for a discrete HMM that allows substantial amount of omissions in the input sequence. Basically it is a variant of Baum-Welch which explicitly considers the case of isolated or a number of omissions in succession. The algorithm has been tested on online handwriting samples expressed in direction codes. An extensive set of experiments show that the HMM so modeled are highly flexible showing a consistent and robust performance regardless of the amount of omissions.

345/154 kV Transmission Line model choice and calculation using TMLC (TMLC용 345, 154kV 송전선로 모델 작성 및 계산)

  • Choi, H.K.;Moon, Y.H.;Yoon, J.Y.;Choo, J.B.;Yun, Y.B.;Kim, Y.H.
    • Proceedings of the KIEE Conference
    • /
    • 2001.07a
    • /
    • pp.336-339
    • /
    • 2001
  • Transmission line data are very important for studying loadflow. Short circuit data(positive sequence, zero sequence) of 345kV and 154kV line were calulated and compared with KEPCO's line characteristics data. This Paper presents method of verification and complement of line data in PSS/E loadflow data using TMLC (Transmission Line Characteritics) program.

  • PDF

Graphical exploratory data analysis for ball games in sports

  • Yi, Seongbaek;Jang, Dae-Heung
    • Journal of the Korean Data and Information Science Society
    • /
    • v.27 no.5
    • /
    • pp.1413-1421
    • /
    • 2016
  • In this paper graphical exploratory data analyses are proposed for ball games in sports. The plot of sequence of scoring points of each team can be used to see how the playing game has been processed until the end of each set or quarter. With the plot of sequential score differences through all the games we can see a dominance of each team and the times of score changes, i.e., turnovers. The ternary plots show the contours of scoring compositions for each player and enable us to compare the scoring patterns of each team if any. Using the score sequence plot we also can see the score pattern distribution of players. For demonstration we use the results of the gold medal match between Russia and Brazil for men's volleyball and between USA and Spain for men's basketball at the London 2012 Summer Olympics.

X3D Based Web Visualization by Data Fusion of 3D Spatial Information and Video Sequence (3D 공간정보와 비디오 융합에 의한 X3D기반 웹 가시화)

  • Sohn, Hong-Gyoo;Kim, Seong-Sam;Yoo, Byoung-Hyun;Kim, Sang-Min
    • Journal of Korean Society for Geospatial Information Science
    • /
    • v.17 no.4
    • /
    • pp.95-103
    • /
    • 2009
  • Global interests for construction of 3 dimensional spatial information has risen due to development of measurement sensors and data processing technologies. In spite of criticism for the violation of personal privacy, CCTV cameras equipped in outdoor public space of urban area are used as a fundamental sensor for traffic management, crime prevention or hazard monitoring. For safety guarantee in urban environment and disaster prevention, a surveillance system integrating pre-constructed 3 dimensional spatial information with CCTV data or video sequence is needed for monitoring and observing emergent situation interactively in real time. In this study, we proposed applicability of the prototype system for web visualization based on X3D, an international standard of real time web visualization, by integrating 3 dimensional spatial information with video sequence.

  • PDF