Probabilistic filtering for a biological knowledge discovery system with text mining and automatic inference

텍스트 마이닝 및 자동 추론 기반 생물학 지식 발견 시스템을 위한 확률 기반 필터링

  • Received : 2011.11.08
  • Accepted : 2011.11.22
  • Published : 2012.02.29


In this paper, we discuss the structure of biological knowledge discovery system based on text mining and automatic inference. Given a set of biology documents, the system produces a new hypothesis in an integrated manner. The text mining module of the system first extracts the 'event' information of predefined types from the documents. The inference module then produces a new hypothesis based on the extracted results. Such an integrated system can use information more up-to-date and diverse than other automatic knowledge discovery systems use. However, for the success of such an integrated system, the precision of the text mining module becomes crucial, as any hypothesis based on a single piece of false positive information would highly likely be erroneous. In this paper, we propose a probabilistic filtering method that filters out false positives from the extraction results. Our proposed method shows higher performance over an occurrence-based baseline method.


Supported by : 한국연구재단, 한국학술진흥재단


  1. P.Zweigenbaum and D.Demner-Fushman, Advanced literature-mining tools, In J.E.Stajich, D.Edwards and D.Hansen, eitors, "Bioinformatics: Tools and Applications," pp.347-381, Springer, Sep. 2009.
  2. E.Antezana, M.Kuiper, and V.Mironovm, "Biological knowledge management: the emerging role of the semantic web technologies," Briefings in Bioinformatics, Vol. 10, No. 4, pp.392-407, May 2009.
  3. T.Slater, C.Bouton, and E.S.Huang, "Beyond data integration," Drug Discovery Today, Vol. 13, No. 1314, pp.584-589, March 2008.
  4. Q.Zhu, Y.Sun, S.Challa, Y.Ding, M.Lajiness, and D.Wild, "Semantic inference using chemogenomics data for drug discovery," BMC Bioinformatics, Vol. 12, No. 1, pp.256, June 2011.
  5. C.B.Giles and J.D.Wren, "Large scale directional relationship extraction and resolution," BMC Bioinformatics, Vol. 9, No. suppl 9, pp.S11, Aug. 2008.
  6. D.R.Swanson, "Two medical literatures that are logically but not bibliographically connected," Journal of the American Society for Information Science, Vol. 38, No. 4, pp.228-233, July 1987.<228::AID-ASI2>3.0.CO;2-G
  7. D.R.Swanson, "Complementary structures in disjoint science literatures," In Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, Oct. 1991.
  8. D.R.Swanson, and N.R.Smalheiser, "An interactive system for finding complementary literatures: a stimulus to scientific discovery," Artif. Intell., Vol. 91, No. 2, pp.183--203, April 1997.
  9. K.Seiki and J.Mostafa, "Discovering implicit associations between gens and hereditary diseases," In Proceedings of the Pacific Symposium on Biocomputing 2007, Jan. 2007.
  10. M.Yetisgen-Yildiz and W.Pratt, "Using statistical and knowledge based approaches for literature based discovery," Journal of Biomedical Informatics, Vol. 39, No. 6, pp.600-611, Jan. 2006.
  11. D.Hristovski, C.Friedman, T.C.Rindflesch, and B.Peterlin, "Exploiting semantic relations for literature based discovery," In AMIA Annual Symposium Proceedings, Nov. 2006.
  12. L.Tari, S.Anwar, S.Liang, J.Cai, and C.Baral, "Discovering drug drug interactions: a text mining and reasoning approach based on properties of drug metabolism," Bioinformatics, Vol. 26, No. 18, pp.i547-i553, Sep. 2010.
  13. J.D.Kim, S.Kraines, W.Guo, and J.Tsujii. "Inference for bioie: Genia meets ekoss," In Proceedings of the 3rd International Symposium on Language in Biology and Medicine, Nov. 2009.
  14. H.J.Lee and J.C.Park, "Towards Knowledge Discovery through Automatic Inference with Text Mining in Biology and Medicine," In Proceedings of the 3rd International Symposium on Semantic Mining in Biomedicine, Sep. 2008.
  15. J.Bjorne, F.Ginter, J.Heimonen, A.Airola, T.Pahikkala and T.Salakoski, "Extracting Complex Biological Events with Rich Graph-Based Features Sets," In Proceedings of the BioNLP'09 Shared Task on Event Extraction, pp.10-18, June 2009.
  16. A.Cimatti et al., "NuSMV 2: An opensource tool for symbolic model checking," In Proceedings of CAV 2002, pp.27-31. July 2002.
  17. J.D. Kim, S.Pyysalo, T.Ohta, R.Bossy, N.Nguyen and J.Tsujii, "Overview of BioNLP Shared Task 2011," In Proceedings of BioNLP Shared Task 2011 Workshop, pp. 1-6, June 2011.
  18. S.Povey, R.Lovering, E.Bruford, M.Wright, M.Lush and He.Wain, "The HUGO Gene Nomenclature Committee (HGNC)," Human Genetics Vol. 109, No. 6, pp.678-680, Oct. 2001.
  19. S.Leem, K.Wee, "Prediction of SNP interactions in complex diseases with mutual information and boolean algebra," Journal of The Korea Society of Computer and Information, Vol.15, No.11, pp.215-224, Nov. 2010.
  20. H.Jeong, Y.Yoon, "Class prediction of an indepen dent sample using a set of gene modules consisting of gene-pairs which were condition(Tumor, Normal) specific," Journal of The Korea Society of Computer and Information, Vol.15, No.12, pp.197-207, Dec. 2010.