• Title/Summary/Keyword: Biological data mining

검색결과 68건 처리시간 0.021초

동적 로드 밸런싱을 이용한 그리드 기반의 생물학 데이터 마이닝 (Grid-based Biological Data Mining using Dynamic Load Balancing)

  • 마용범;김태영;이종식
    • 한국시뮬레이션학회논문지
    • /
    • 제19권2호
    • /
    • pp.81-89
    • /
    • 2010
  • 생물학 데이터 마이닝은 생물학 데이터의 볼륨이 급격하게 증가함에 따라 최근 주목받고 있다. 그리드 기술은 계산 자원과 데이터 공유와 활용을 가능하게 한다. 이 논문에서는 생물학 데이터 마이닝과 그리드 기술을 결합한 혼합형 시스템을 제안한다. 특히, 생물학 데이터 마이닝의 처리 효율성을 위해 결정 범위 조정 알고리즘을 사용한다. 우리는 이 알고리즘을 통해 빠르고 자동으로 신뢰할 만한 데이터 마이닝 인식률을 얻는다. 게다가 그리드 환경에서는 지리적으로 분산된 자원들을 연동하기 때문에 통신량과 자원 할당이 이슈가 된다. 우리는 동적 로드 밸런싱을 제안하고 그리드 기반 생물학 데이터 마이닝 기법에 적용한 다. 성능 평가를 위해 우리는 평균 처리 시간, 평균 통신 시간, 평균 자원 활용도를 측정한다. 측정 실험의 결과는 제안된 두 알고리즘을 적용한 우리의 기법이 처리 시간과 비용 측면에서 이점을 제공한다는 것을 보여준다.

PubMiner: Machine Learning-based Text Mining for Biomedical Information Analysis

  • Eom, Jae-Hong;Zhang, Byoung-Tak
    • Genomics & Informatics
    • /
    • 제2권2호
    • /
    • pp.99-106
    • /
    • 2004
  • In this paper we introduce PubMiner, an intelligent machine learning based text mining system for mining biological information from the literature. PubMiner employs natural language processing techniques and machine learning based data mining techniques for mining useful biological information such as protein­protein interaction from the massive literature. The system recognizes biological terms such as gene, protein, and enzymes and extracts their interactions described in the document through natural language processing. The extracted interactions are further analyzed with a set of features of each entity that were collected from the related public databases to infer more interactions from the original interactions. An inferred interaction from the interaction analysis and native interaction are provided to the user with the link of literature sources. The performance of entity and interaction extraction was tested with selected MEDLINE abstracts. The evaluation of inference proceeded using the protein interaction data of S. cerevisiae (bakers yeast) from MIPS and SGD.

Mining Maximal Frequent Contiguous Sequences in Biological Data Sequences

  • Kang, Tae-Ho;Yoo, Jae-Soo;Kim, Hak-Yong;Lee, Byoung-Yup
    • International Journal of Contents
    • /
    • 제3권2호
    • /
    • pp.18-24
    • /
    • 2007
  • Biological sequences such as DNA and amino acid sequences typically contain a large number of items. They have contiguous sequences that ordinarily consist of more than hundreds of frequent items. In biological sequences analysis(BSA), a frequent contiguous sequence search is one of the most important operations. Many studies have been done for mining sequential patterns efficiently. Most of the existing methods for mining sequential patterns are based on the Apriori algorithm. In particular, the prefixSpan algorithm is one of the most efficient sequential pattern mining schemes based on the Apriori algorithm. However, since the algorithm expands the sequential patterns from frequent patterns with length-1, it is not suitable for biological datasets with long frequent contiguous sequences. In recent years, the MacosVSpan algorithm was proposed based on the idea of the prefixSpan algorithm to significantly reduce its recursive process. However, the algorithm is still inefficient for mining frequent contiguous sequences from long biological data sequences. In this paper, we propose an efficient method to mine maximal frequent contiguous sequences in large biological data sequences by constructing the spanning tree with a fixed length. To verify the superiority of the proposed method, we perform experiments in various environments. The experiments show that the proposed method is much more efficient than MacosVSpan in terms of retrieval performance.

Q-omics: Smart Software for Assisting Oncology and Cancer Research

  • Lee, Jieun;Kim, Youngju;Jin, Seonghee;Yoo, Heeseung;Jeong, Sumin;Jeong, Euna;Yoon, Sukjoon
    • Molecules and Cells
    • /
    • 제44권11호
    • /
    • pp.843-850
    • /
    • 2021
  • The rapid increase in collateral omics and phenotypic data has enabled data-driven studies for the fast discovery of cancer targets and biomarkers. Thus, it is necessary to develop convenient tools for general oncologists and cancer scientists to carry out customized data mining without computational expertise. For this purpose, we developed innovative software that enables user-driven analyses assisted by knowledge-based smart systems. Publicly available data on mutations, gene expression, patient survival, immune score, drug screening and RNAi screening were integrated from the TCGA, GDSC, CCLE, NCI, and DepMap databases. The optimal selection of samples and other filtering options were guided by the smart function of the software for data mining and visualization on Kaplan-Meier plots, box plots and scatter plots of publication quality. We implemented unique algorithms for both data mining and visualization, thus simplifying and accelerating user-driven discovery activities on large multiomics datasets. The present Q-omics software program (v0.95) is available at http://qomics.sookmyung.ac.kr.

A bio-text mining system using keywords and patterns in a grid environment

  • Kwon, Hyuk-Ryul;Jung, Tae-Sung;Kim, Kyoung-Ran;Jahng, Hye-Kyoung;Cho, Wan-Sup;Yoo, Jae-Soo
    • 한국산업정보학회:학술대회논문집
    • /
    • 한국산업정보학회 2007년도 춘계학술대회
    • /
    • pp.48-52
    • /
    • 2007
  • As huge amount of literature including biological data is being generated after post genome era, it becomes difficult for researcher to find useful knowledge from the biological databases. Bio-text mining and related natural language processing technique are the key issues in the intelligent knowledge retrieval from the biological databases. We propose a bio-text mining technique for the biologists who find Knowledge from the huge literature. At first, web robot is used to extract and transform related literature from remote databases. To improve retrieval speed, we generate an inverted file for keywords in the literature. Then, text mining system is used for extracting given knowledge patterns and keywords. Finally, we construct a grid computing environment to guarantee processing speed in the text mining even for huge literature databases. In the real experiment for 10,000 bio-literatures, the system shows 95% precision and 98% recall.

  • PDF

생물학적 데이터 서열들에서 빈번한 최대길이 연속 서열 마이닝 (Mining Maximal Frequent Contiguous Sequences in Biological Data Sequences)

  • 강태호;유재수
    • 정보처리학회논문지D
    • /
    • 제15D권2호
    • /
    • pp.155-162
    • /
    • 2008
  • DNA 염기 서열이나 단백질 아미노산 서열과 같은 생물학적 서열 데이터들은 일반적으로 많은 수의 항목들을 가지고 있다. 생물학적 데이터 서열들에는 보통 빈번하게 발생하는 수 백개의 항목으로 이루어진 연속된 서열들이 존재한다. 이들 서열들에서 빈번하게 발생하는 연속 서열을 검색하는 것은 생물학적 서열 분석에서 중요한 부분을 차지하고 있다. 이전에는 순차 패턴을 효과적으로 발견하고자 하는 많은 연구들이 수행되었으며 대부분의 기존 순차패턴 마이닝 기법들은 Apriori 알고리즘을 기반으로 한다. PrefixSpan 알고리즘은 Apriori 기반의 가장 효율적인 순차패턴 마이닝 기법이다. 하지만 이 알고리즘은 길이-1인 빈발 패턴들로 부터 서열 패턴을 확장해나가는 방식이다. 따라서 길이가 긴 연속 서열을 포함하는 생물학적 데이터서열들에 대한 검색방법으로는 적합하지 않다. 최근에는 기존의 PrefixSpan방식을 이용하면서도 반복적인 처리과정을 줄인 MacosVSpan이 제안되었다. 하지만 이 알고리즘 또한 길이가 긴 생물학적 데이터 서열들로부터 빈번하게 발생하는 연속 서열들을 검색하기에는 효율적이지 않다. 본 논문에서는 많은 양의 생물학적 데이터 서열들로부터 빈번한 연속서열을 고정길이 확장 트리를 이용하여 효과적으로 찾아내는 방법을 제안한다. 그리고 다양한 환경에서 실험을 통해 제안하는 방식이 MacosVSpan알고리즘에 비해 검색성능이 보다 우수함을 보인다.

Genome data mining for everyone

  • Lee, Gir-Won;Kim, Sang-Soo
    • BMB Reports
    • /
    • 제41권11호
    • /
    • pp.757-764
    • /
    • 2008
  • The genomic sequences of a huge number of species have been determined. Typically, these genome sequences and the associated annotation data are accessed through Internet-based genome browsers that offer a user-friendly interface. Intelligent use of the data should expedite biological knowledge discovery. Such activity is collectively called data mining and involves queries that can be simple, complex, and even combinational. Various tools have been developed to make genome data mining available to computational and experimental biologists alike. In this mini-review, some tools that have proven successful will be introduced along with examples taken from published reports.

A Survey of Transfer and Multitask Learning in Bioinformatics

  • Xu, Qian;Yang, Qiang
    • Journal of Computing Science and Engineering
    • /
    • 제5권3호
    • /
    • pp.257-268
    • /
    • 2011
  • Machine learning and data mining have found many applications in biological domains, where we look to build predictive models based on labeled training data. However, in practice, high quality labeled data is scarce, and to label new data incurs high costs. Transfer and multitask learning offer an attractive alternative, by allowing useful knowledge to be extracted and transferred from data in auxiliary domains helps counter the lack of data problem in the target domain. In this article, we survey recent advances in transfer and multitask learning for bioinformatics applications. In particular, we survey several key bioinformatics application areas, including sequence classification, gene expression data analysis, biological network reconstruction and biomedical applications.

BINGO: Biological Interpretation Through Statistically and Graph-theoretically Navigating Gene $Ontology^{TM}$

  • Lee, Sung-Geun;Yang, Jae-Seong;Chung, Il-Kyung;Kim, Yang-Seok
    • Molecular & Cellular Toxicology
    • /
    • 제1권4호
    • /
    • pp.281-283
    • /
    • 2005
  • Extraction of biologically meaningful data and their validation are very important for toxicogenomics study because it deals with huge amount of heterogeneous data. BINGO is an annotation mining tool for biological interpretation of gene groups. Several statistical modeling approaches using Gene Ontology (GO) have been employed in many programs for that purpose. The statistical methodologies are useful in investigating the most significant GO attributes in a gene group, but the coherence of the resultant GO attributes over the entire group is rarely assessed. BINGO complements the statistical methods with graph-theoretic measures using the GO directed acyclic graph (DAG) structure. In addition, BINGO visualizes the consistency of a gene group more intuitively with a group-based GO subgraph. The input group can be any interesting list of genes or gene products regardless of its generation process if the group is built under a functional congruency hypothesis such as gene clusters from DNA microarray analysis.

PubMine: An Ontology-Based Text Mining System for Deducing Relationships among Biological Entities

  • Kim, Tae-Kyung;Oh, Jeong-Su;Ko, Gun-Hwan;Cho, Wan-Sup;Hou, Bo-Kyeng;Lee, Sang-Hyuk
    • Interdisciplinary Bio Central
    • /
    • 제3권2호
    • /
    • pp.7.1-7.6
    • /
    • 2011
  • Background: Published manuscripts are the main source of biological knowledge. Since the manual examination is almost impossible due to the huge volume of literature data (approximately 19 million abstracts in PubMed), intelligent text mining systems are of great utility for knowledge discovery. However, most of current text mining tools have limited applicability because of i) providing abstract-based search rather than sentence-based search, ii) improper use or lack of ontology terms, iii) the design to be used for specific subjects, or iv) slow response time that hampers web services and real time applications. Results: We introduce an advanced text mining system called PubMine that supports intelligent knowledge discovery based on diverse bio-ontologies. PubMine improves query accuracy and flexibility with advanced search capabilities of fuzzy search, wildcard search, proximity search, range search, and the Boolean combinations. Furthermore, PubMine allows users to extract multi-dimensional relationships between genes, diseases, and chemical compounds by using OLAP (On-Line Analytical Processing) techniques. The HUGO gene symbols and the MeSH ontology for diseases, chemical compounds, and anatomy have been included in the current version of PubMine, which is freely available at http://pubmine.kobic.re.kr. Conclusions: PubMine is a unique bio-text mining system that provides flexible searches and analysis of biological entity relationships. We believe that PubMine would serve as a key bioinformatics utility due to its rapid response to enable web services for community and to the flexibility to accommodate general ontology.