DOI QR코드

DOI QR Code

Benchmarking of BioPerl, Perl, BioJava, Java, BioPython, and Python for Primitive Bioinformatics Tasks and Choosing a Suitable Language

  • Ryu, Tae-Wan (Dept of Computer Science, California State University)
  • Published : 2009.06.30

Abstract

Recently many different programming languages have emerged for the development of bioinformatics applications. In addition to the traditional languages, languages from open source projects such as BioPerl, BioPython, and BioJava have become popular because they provide special tools for biological data processing and are easy to use. However, it is not well-studied which of these programming languages will be most suitable for a given bioinformatics task and which factors should be considered in choosing a language for a project. Like many other application projects, bioinformatics projects also require various types of tasks. Accordingly, it will be a challenge to characterize all the aspects of a project in order to choose a language. However, most projects require some common and primitive tasks such as file I/O, text processing, and basic computation for counting, translation, statistics, etc. This paper presents the benchmarking results of six popular languages, Perl, BioPerl, Python, BioPython, Java, and BioJava, for several common and simple bioinformatics tasks. The experimental results of each language are compared through quantitative evaluation metrics such as execution time, memory usage, and size of the source code. Other qualitative factors, including writeability, readability, portability, scalability, and maintainability, that affect the success of a project are also discussed. The results of this research can be useful for developers in choosing an appropriate language for the development of bioinformatics applications.

Keywords

References

  1. P. Baldi and S. Brunak, Bioinformatics: The Machine Learning Approach, 2nd edition, MIT Press, 2001.
  2. A.D. Baxevanis and B.F.F. Ouellette, eds., Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, 3rd edition, Wiley, 2005.
  3. D.A. Benson, M. Boguski, D. J. Lipman, J. Ostell, B.F. Ouellete, B.A. Rapp, and D.L. Wheeler, "GenBank," Nucleic Acids Research, 27, 12-17, 1999. https://doi.org/10.1093/nar/27.1.12
  4. R.C.G. Holland, T. Down, M. Pocock, A. Prlic, D. Huen, K. James, S. Foisy, A. Drager, A. Yates, M. Heuer, M.J. Schreiber, "BioJava: an Open-Source Framework for Bioinformatics," Bioinformatics, Vol. 24(18), 2008, pp. 2096-2097. https://doi.org/10.1093/bioinformatics/btn397
  5. BioJava Web information, http://www.biojava.org, 2009.
  6. BioPerl Web information, http://www.bioperl.org, 2009.
  7. BioPython Web information, http://www.biopython.org, 2009.
  8. M. Catanho, D. Mascarenhas, W. Degrave, A. Miranda, "BioParser: a tool for processing of sequence similarity analysis reports," Applied Bioinformatics, 5 (1): 49–53, 2006. https://doi.org/10.2165/00822942-200605010-00007
  9. N. Cristianini and M.W. Hahn, Introduction to Computational Genomics, Cambridge University Press, 2006.
  10. O. Croce, M. Lamarre, R. Christen, "Querying the public databases for sequences using complex keywords contained in the feature lines," BMC Bioinformatics 7:45, 2006. https://doi.org/10.1186/1471-2105-7-45
  11. H. Deitel, P. Deitel, J. Liperi, and B. Wiedermann, B., Python: How to Program, Prentice Hall, 2002.
  12. J. Dugan, Open Source Initiatives in Bioinformatics, A report submitted to health science initiative application working group Internet2, 2001.
  13. M. Fourment and M.R. Gillings, "A comparison of common programming languages used in bioinformatics," BMC Bioinformatics, Vol. 9:82, 2008. https://doi.org/10.1186/1471-2105-9-82
  14. W. Keedwell, Intelligent Bioinformatics: The Application of Artificial Intelligence Techniques to Bioinformatics Problems, Wiley, 2005.
  15. R. Khaja, J. MacDonald, J. Zhang, S. Scherer, "Methods for identifying and mapping recent segmental and gene duplications in eukaryotic genomes," Methods Molecular Biology 338: 9–20, 2006.
  16. B. Landsteiner, M. Olson, R. Rutherford, "Current Comparative Table (CCT) automates customized searches of dynamic biological databases," Nucleic Acids Research 33, 2005.
  17. B. Lenhard, W. Wasserman, "TFBS: Computational framework for transcription factor binding site analysis," Bioinformatics 18(8): 1135–6, 2002. https://doi.org/10.1093/bioinformatics/18.8.1135
  18. A.M. Lesk, Introduction to Bioinformatics, Oxford University Press, 2008.
  19. D.W. Mount, Bioinformatics: Sequence and Genome Analysis, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York, 2001.
  20. Open Bioinformatics Foundation, http://www.open-bio.org, 2009.
  21. L. Pachter and B. Sturmfels, Algebraic Statistics for Computational Biology, Cambridge University Press, 2005.
  22. P.A. Pevzner, Computational Molecular Biology: An Algorithmic Approach, The MIT Press, 2001.
  23. L. Prechelt, "An empirical comparison of C, C++, Java, Perl, Python, Rexx, and Tcl," IEEE Computer Vol. 33, 23-29, 2000. https://doi.org/10.1109/2.876288
  24. R. Schwartz, T. Phoenix, and B. Foy, Learning Perl, 5th Edition, O'Reilly, 2008.
  25. R.W. Sebesta, Concepts of programming languages, Addison Wesley, 206-208, 2006.
  26. S. Shah, G. McVicker, A. Mackworth, S. Rogic, B. Ouellette, B., "GeneComber: combining outputs of gene prediction programs for improved results," Bioinformatics 19 (10): 1296–7, 2003. https://doi.org/10.1093/bioinformatics/btg139
  27. J. Shirazi, Java Performance Tuning, O'Reilly, 2003.
  28. J. Stajich, D. Block, K. Boulez, S. Brenner, S. Chervitz, C. Dagdigian, G. Fuellen, J. Gilbert, I. Korf, H. Lapp, H. Lehvaslaiho, C. Matsalla, C. Mungall, B. Osborne, M. Pocock, P. Schattner, M. Senger, L. Stein, E. Stupka, M. Wilkinson, E. Birney, "The Bioperl toolkit: Perl modules for the life sciences" Genome Res 12(10): 1611–8, 2002. https://doi.org/10.1101/gr.361602
  29. J.D. Tisdall, Beginning Perl for Bioinformatics, O'Reilly, 2001.
  30. N. Trivedi, K.T. Pedretti, T.A. Braun, T.E. Scheetz, and T.L. Casavant, "Alternative parallelization strategies in EST clustering," Lecture Notes in Computer Science, Vol. 2763, 384 – 394, 2003. https://doi.org/10.1007/978-3-540-45145-7_36
  31. M.S. Waterman, Introduction to Computational Biology: Sequences, Maps and Genomes, CRC Press, 1995.
  32. J. Zobel, S. Heinz, and H.E. Williams, "In-memory hash tables for accumulating text vocabularies," Information Processing Letters, Vol. 80:6, 271 – 277, 2001. https://doi.org/10.1016/S0020-0190(01)00239-3