BioProjects

The Bio-projects

(editor PjotrPrins)

Introduction

BioRuby, Bioperl, Biopython and Bioconductor (for R) are all very popular and active opensource projects for interpreted languages. These libraries are part of the [WWW] Open Bioinformatics Foundation.

In this section we aim to provide a quick overview of these libraries which may prove useful when selecting tools for a new project. In this book we make a strong case for using Ruby as a programming language. Nevertheless the choice of library may be more conclusive regarding final productivity. It is a choice that should not be underestimated.

Pjotr writes:

I have used Perl for a long time, but now use Ruby, Python and R when appropriate. If I were to look purely at the language I would prefer Ruby at any point as it allows the best level of expression while delivering easy to maintain code. R is my second choice - the language is not as nice, but it is extremely well geared for statistics and using matrix data. The R language is conceptually different from Ruby, Python and Perl and combined with its huge statistical functionality it allows for rapid data analysis. It is a cool toolbox to use in conjunction with one of the others. Finally I use Python when it has tools that Ruby lacks. In particular with ZOPE, a component framework for web development. Python is close kin to Ruby and productivity should be in the same order. Nevertheless Ruby is just a bit nicer and better designed as a language. Last is Perl which I used extensively but dropped in the end (apart from the brilliant Mail::Box library). Programming Perl is harder, in my opinion, and leads to harder to maintain code. Before going into language wars I would like to add that all these languages are easy to learn and apply. You should learn them all to appreciate their relative strengths and weaknesses. In the end it will make you a better programmer anyway.

BioRuby

[WWW] BioRuby, while listed first here, is actually one of the newer kids on the block. The libraries are not as extensive as the others, but a lot of the fundamental material is there and, because Ruby encourages it, crafted in an object oriented way.

Bioperl

The [WWW] Bioperl project is the grand daddy of libraries in Biology and still something of a trend setter. It is rather large and for many bioinformaticians a first point of call. It was used in the Human Genome Project.

Biopython

[WWW] Biopython is also a large and serious library and chasing Bioperl. Many scientists who don't want to use Perl opt for Biopython.

R and Bioconductor

[WWW] Bioconductor is another large project with very strong statistical tools. It is a popular library for microarray analysis since it supports most formats and comes with many graphical tools.

Programming R differs from programming the other languages. First and foremost R's libraries have a limited number of interfaces, but each is rich in 'configuration'. So there are few functions to learn, but each requires a lot of parameter tweaking. R's variable parameter list encourages that. Read the help page on [WWW] reshape, for example. You'll write things like

fortunately most interfaces are rather well documented, but a lot of functionality is packed behind subtle differences in parameters. With write.table row.names=FALSE does something quite different from row.names=NA. This contrasts quite clearly with Ruby's object oriented approach which aims to provide clean and simple interfaces.

R's strength comes mostly from its rich and well tested libraries. If you have to solve a problem it can often be done in a few lines of R code and that can be a big gain. On the other hand R's interpreter is slow. A loop iterating 1 million records took 36 hours in plain R. Rewriting the routine to use built in functions brought that down to a minute. But to use built in functions cleverly took a working day - it would have been faster to program in Ruby. Also Ruby has much better support for writing maintainable code and unit tests. Finally the OOP implementation of R is thin - it hardly deserves the name.

Rule of thumb for bioinformaticiancs: R is great for quick one-offs. If you need to calculate something and maybe make figures use R first. Remember that for almost every task there already exists an R package. If you program something that takes more than a day consider programming in Ruby instead. And, maybe, once you have it debugged and when you want to publish it rewrite it in R and make an R package, so others can benefit from your work.

For more information on using Ruby with R see also the section on using RubyWithRlang.

[WWW] edit table

NOTE: All projects are in a state of flux and things change over time. This comparison is probably (already) out of date. If you have amendments/additions please mail them to PjotrPrins.
NOTE: this is not a Biopython vs. Bioperl comparison, nor a BioRuby versus Bioconductor comparison. In many ways the functionalities are hard to compare because library design, object representation and usage differ. Also implementation details like efficiency, or the storage in memory or on disk are not displayed. Note also that some items may not be available as part of the 'Bio' package, but come as separate modules instead.

The point of this table is that if someone needs to solve a problem in biology this list should help to get him/her started.

Conclusion


Question

Q: What facilities do BioPerl, BioRuby and BioPython have for "literate programming?" I view that as one of BioConductor's great strengths ... they have the ability to produce what they call a "Compendium", which is driven by "Noweb" "tangle" and "weave" capabilities.

A: All these languages allow for inline documentation - which in turn allows the use of noweb or nuweb based literate programming. A literate programming example for Ruby can be found [WWW] here. But I don't know anyone who uses it. In my opinion: well written code documents itself. But I realise it is not exactly what literate programming tries to address.

Ed Borasky adds:

1. I have a collection of Literate Programming links, mostly appled to LP in R. Here they are:

[WWW] http://gentleman.fhcrc.org/Fld-talks/RGRepRes.pdf [WWW] http://sepwww.stanford.edu/research/redoc/ [WWW] http://webpages.charter.net/edreamleo/front.html [WWW] http://www-cs-faculty.stanford.edu/~knuth/lp.html [WWW] http://www-stat.stanford.edu/~donoho/Reports/1995/wavelab.pdf [WWW] http://www.ad-astra.ro/journal/2/vlad_reproducibility.pdf [WWW] http://www.bepress.com/bioconductor/paper2/ [WWW] http://www.bioconductor.org/docs/papers/2003/Compendium [WWW] http://www.bioconductor.org/docs/papers/2003/Compendium/Golub.pdf [WWW] http://www.biostat.harvard.edu/~rgentlem/Pdf/RR.pdf [WWW] http://www.ci.tuwien.ac.at/Conferences/DSC-2001/Proceedings/NeuwirthBaier.pdf [WWW] http://www.ci.tuwien.ac.at/Conferences/DSC-2001/Proceedings/Rossini.pdf [WWW] http://www.econ.uiuc.edu/~roger/repro.html [WWW] http://www.econ.uiuc.edu/~roger/research/repro/repro.ps [WWW] http://www.literateprogramming.com/ [WWW] http://www.stat.umn.edu/~charlie/Sweave/ [WWW] http://www.stat.washington.edu/jaw/jaw.research.reproducible.html [WWW] http://www.vivtek.com/lpml/language.html

2. Specific to Python, there is a wonderful tool called Leo. See

[WWW] http://webpages.charter.net/edreamleo/front.html

In theory, Leo works with all programming languages, but it is written in Python and scripted in Python. You can use it without knowing Python, but I've found it a challenge being Python-illiterate. I am plugging away at it in R and Ruby, though, because it looks like a "better way" to do literate programming.