Why Biologists Want to Program Computers
by James Tisdall, author of Beginning Perl for
Bioinformatics
As part of my work in bioinformatics over the last ten years, I've been
teaching biologists to program, and I've given courses and workshops and
offered one-on-one advice towards that end.
The students (from vice presidents to principle investigators to junior lab
assistants) who attend these courses do so to learn about programming for
biology research. I've often been asked to give my perspective on the
benefits of learning programming, considering the expenditure of time and
effort that is required to learn this new and important laboratory skill.
Over the last decade there has been an accelerating interest in acquiring
programming skills on the part of biologists. My new book Beginning Perl
for Bioinformatics from O'Reilly & Associates is designed to
address the need for training in this area by teaching programming in the
context of biologically relevant data and results.
This article will examine why a biologist would want to learn to
program. There are two main reasons: scientific, and economic. I hope that
the discussion will also be of some use to programmers thinking of entering
the bioinformatics field. But first, I'll take a short tour of some history,
define some terms, and make some general comments about how programming
fits into biology research.
Definitions and History
Bioinformatics is the somewhat new and rather unfortunate term that
is commonly employed for referring to the use of computers in
biological research, especially in such fields as genomics, sequencing, and
genetics. Oh well, we're stuck with it. Informatics comes from a
European term for computer science.
Computational biology is an alternative term for the use of
computers in biology research. While we could quibble over an exact
definition of these and other terms, I tend to just take the word
"bioinformatics" as all-inclusive, and use it to refer in general to the
use of computers in biology research.
There is a long history of the use of computers in biology research, dating
back to the early days of digital computers over 50 years ago. About ten
years ago the large-scale international human genome project began to
generate data of a volume and of a kind that required a carefully planned
and significant computer programming effort. Since then, the field of
bioinformatics has been growing at an increasing rate.
This growth can be measured in several ways. There are now many academic
positions filled by bioinformaticians (also called bioinformaticists or
computational biologists). Training programs have become fairly common, from
workshops to Ph.D. programs to post-doctoral positions. Industrial concerns are staffing many bioinformatics positions, especially in the pharmaceutical,
agricultural, and biotechnology industries. Programming staff positions in
biology research organizations have increased. Coverage of bioinformatics in
scientific journals; scientific conferences dedicated to or incorporating
bioinformatics; bioinformatics Web sites; published books on the
subject--all have seen significant and even accelerating growth.
Two Pursuits: Biology and Computer Science
There's a bit of a cultural divide between biologists and computer
scientists. This is only natural; the two disciplines are really fairly
different, and have their own literatures, techniques, and fundamental
principles.
To combine both disciplines is necessary, but the fit is sometimes a little
tight across the shoulders. It is relatively rare to find an individual who
has solid academic training in both fields; most folks are trained in one,
and find themselves wandering into the other field.
It is also important for researchers to understand enough of each other's discipline to successfully collaborate with one another. Biologists
who know enough computer science to be able to explain a problem in terms a
programmer can find useful--and programmers who know enough biology to
understand what their biologist colleagues need in a program--are valuable.
Biologists learning programming can sometimes encounter significant snags.
Some very talented biologists just don't "take to" programming; while
others find they can absorb it without undue difficulty. For example, if
you went into biology because you perceived that it was a science that
didn't require too much math, then computer science may be an uncomfortable
thing to learn. Although I must point out that you can do a lot of
programming with only a bare minimum of math. (Also, lest it be thought I
am slurring biologists, it is certainly true that many are well trained in
mathematics, especially statistics, and use it extensively in their work.)
Finally, it's important for biologists to realize that there's a lot more
to computer science than simply learning a programming language or two.
That's the bad news. The good news is that most biologists aren't
interested in learning the computer science concepts used in compiler design,
structural complexity theory, advanced algorithms, or such; they just want
to learn enough of a programming language to do some practical, useful
tasks that will advance their research. And that is possible to do without
a graduate degree in computer science. It will come as no surprise that I
recommend the Perl programming language as a good place to begin for such
practical, results-oriented programming skills. And even those biologists who
are interested in pursuing the deeper results of computer science have to start at the beginning with learning a programming language.
If you're interested in learning Perl, don't miss O'Reilly's best-selling
Learning Perl, 3rd
Edition, which has been updated to cover Perl version 5.6 and rewritten
to reflect the needs of programmers learning Perl today. For a complete
list of O'Reilly's books on Perl, go to www.oreilly.de/perl/.
In my opinion, programmers going into biology often have the harder time of
it. I should mention a common pitfall that programmers entering biology
research encounter. Biology is subtle, and it can take lots of work to begin to get a handle on the variety of living organisms. Programmers new to the field
sometimes write a perfectly good program for what turns out to be the wrong
problem! I recommend to programmers that they at least study a book like
Recombinant DNA, 2nd edition by Watson,
Gilman, Witkowski, Zoller, and Witkowski; and that they ask a lot of
questions of the biologists for whom they're writing their programs.
Use Programs, Don't Write Them
You can do bioinformatics without learning how to program. It is not
uncommon for bioinformatics specialists to become adept at using existing
bioinformatics programs, without ever learning the programming skills
necessary to actually build such tools. There are now many programs
available on Web sites or elsewhere that give convenient access to a
significant amount of biological data.
More and more biology researchers in all fields are finding that the use of
bioinformatics computer tools has become a regular part of their research.
Many, if not most, of these researchers do not have programming
skills, and are getting along quite well without them. For this group, the
answer to the topic of this article is "They don't want to program
computers!".
Write Programs When There's Nothing to Use
 |
 |
 |
 |
Over the last decade there has
been an accelerating interest in acquiring programming skills on the part of
biologists. |
 |
 |
 |
However, many biologists do want to learn how to program. They believe that
programming skills can significantly help them to achieve their research
goals. In research, questions often arise that could be answered or
facilitated by a computer program--but the program doesn't yet exist. So a
programmer is needed.
These research questions can range from straightforward ones easily
programmed, all the way to complex questions requiring the invention of new
algorithms. Advanced programming skills are not usually necessary, however.
A basic skill set will do nicely to advance the work of most labs.
The question then becomes not "shall we program?" but rather "who's going
to do the programming?" Perhaps the PI has the interest to learn and
practice this new skill; this happens fairly often, despite the demands of
research and grant writing, since the payoffs can be considerable. Or a
staff scientist, postdoc, or student may leap into the breach. Many
times a department, or institution, will maintain staff bioinformatics
specialists who divide their time between the various labs that need their
labor.
Programming Takes Time
There's an important fact that is sometimes underappreciated by biologists
who are new to programming. That fact is programming is
labor-intensive. Writing a substantial program still takes skilled
people a nontrivial amount of time. Think of the time it can take to work
the kinks out of a new experimental protocol; writing a significant program
can often take a similar amount of work. Although powerful computer
hardware is now quite inexpensive, writing the programs that make the
hardware useful can still require considerable resources.
Visit www.perl.com for the latest Perl
technology news and CPAN updates.
One of the reasons Perl has become a popular bioinformatics programming
language lies in its suitability for rapid prototyping, that is, the
ability to quickly write a working program, thus saving precious time.
Despite all the attention that the computer world focuses on the clock
speed in megahertz of various computer models, in today's research the most
important speed to look at is, usually, the speed of programming.
Choosing the right language for a programming job is an important part of
this, and often (not always) Perl is the right language in a biology
research setting.
Before I get pulled further into this digression on software
engineering, or the study of the art of programming, let me now return to
my main point, that many biologists want to learn programming.
What are the scientific reasons for this?
Scientific Reasons for Learning Bioinformatics Programming
I'll touch on these representative (not comprehensive) reasons for
learning bioinformatics programming:
- Quantity of existing data
- Dealing with new data
- Automating the automation
- Evaluating many targets
Quantity of Existing Data
 |
 |
 |
 |
Although powerful computer
hardware is now quite inexpensive, writing the programs that make the
hardware useful can still require considerable resources. |
 |
 |
 |
There is now a huge amount of basic biological data, and without computers
to store and search this data, we'd be severely constrained. The most
well-known biological data is the map and sequence of the human genome; and
there's a lot of other biological data as well, for humans as well as for
other organisms. Just to use this data (and its use is an essential part of
many research programs) requires computer storage, as well as programs to
make it convenient to search, retrieve, and study the data. Imagine
searching for a motif in the human genome (3 billion base pairs) without
a computer. This point is obvious and well known, so I won't belabor it.
But I will add that the specific data you want to retrieve for your research
may exist in several different databases. Depending on how you want to
select and compare that data, there may not be an already existing program
that you can use. For all but the most common tasks, you may need to write
your own programs to handle the selection and comparison of data from the
database or databases of interest to you. I'll give an example of this
shortly.
Dealing with New Data
Many biologists find that their laboratory notebooks still work just fine
for recording their results.
But for an increasing number of researchers, new laboratory techniques
(microarrays, dHPLC, gene chips, high-throughput sequencing, and so on) are
generating a volume of data that requires computer technology. This can run
the gamut from fairly simple tools such as spreadsheets, all the way to
complex relational databases, and beyond. Designing and implementing
complex databases is a specialized skill, and a career, in itself; but most
programmers have at least a working knowledge of these skills.
Some biology research areas have long had a need for computing with large
amounts of experimental data. For instance the determination of a protein
structure by X-ray crystallography generates large amounts of data, and
requires sophisticated algorithms to determine the structure from that
data.
The point is that handling the experimental data of a lab may require
computer programming to make the data readily accessible, to perform
statistical analyses of the data, and to share the data with colleagues.
Many research grants now include provisions to make the data and results of
a project available for public inspection. This is often accomplished by
storing the data in a database, and providing a Web page and interactive
programs to enable visitors to explore the results.
A very common approach to this task is to use the no-cost open
source Perl language for the programming (using the ever popular
CGI.pm module), perhaps combined with the no-cost Apache Web server and a
no-cost database such as MySQL or PostgreSQL and a no-cost platform such as
Linux or BSD. In other words, apart from the hardware, all the software is
free (and high quality), which does wonders for the lab budget. For those
with Macintosh or Windows computers, the same approach will also work, as
Perl, Apache, and the databases are also available on those platforms.
Automating the Automation
One of the most valuable, time-saving programming skills that a biologist
can learn is how to automate the automation. This means writing
programs that run, and collect the output from, other programs, thereby
eliminating the need to run the other programs in person while sitting at
the computer.
Let's give a simple example of a case in which you have some large number
of items that need to be examined closely by a range of programs.
Evaluating Many Targets
Say you have two hundred candidate targets for some biological experiment.
The experiment is lengthy and costly, and so you need to evaluate the
targets in order to select the most promising ones. You have several
programs whose combined results allow you to make a reasonable selection. If
running the needed programs, and examining and prioritizing the results,
takes an hour per target, then you've got about a month's worth of work
sitting in front of a computer screen ahead of you.
So instead, you write a program. This program takes each target one at a
time, then runs the auxiliary programs and collects and collates the
results. After it's done, it prioritizes the targets according to the
criteria you have programmed, and then it presents you with a "top ten" list
of targets. The program may take a couple of days, or a week, to write; it
takes 30 minutes to run. You've saved yourself the rest of the month for
doing "real" biology, actually performing the experiments on the targets;
and you're on track to get to the finish line three weeks sooner.
Economic Reasons for Learning Bioinformatics Programming
Bioinformatics skills are commanding a premium in the marketplace, with a
lot of the demand coming from the private sector.
For many biologists now getting their training in graduate school, or doing
their postdocs, it is an unpleasant fact that an oversupply of trained
people, compared to the demand, may result in a relatively low rate of pay,
depending of course on their area of specialization. Reports in leading
journals have decried the overproduction of biology PhDs relative to the
level of funding for biology in general. For some of these biologists,
bioinformatics skills can significantly enhance their job prospects and
their salaries, because there is an lack of trained bioinformatics people
relative to the demand.
And that fact of supply and demand in the labor market for biology
researchers is the economic reason that biologists want to learn
programming. I've even seen young Ph.D.s in my classes who, despairing of
finding a decent position, are learning programming with the intention of
leaving biology research altogether. Of course, we could all lobby for an
increase in funding for biology research. (This is the golden age of
biological research, when such funding can be expected to yield great
results.) But the competition for jobs and grants is likely to remain quite
heated for some time to come.
It is not hard to find (say with a search engine like Google looking up the words
"bioinformatics salary") lots of reports on the premium being paid for
trained bioinformaticians, and especially for experienced bioinformaticians.
The field is simply growing faster than the number of qualified individuals
who can fill the need for bioinformatics skills.
Trends and Predictions
The upward trend of the use of computers in biology research has been going
on for several years now. In the time-honored tradition of futurists
everywhere, I predict that the current trend will continue!
Actually, there are solid reasons to suppose that the prediction is true. If
a simple appeal to authority is acceptable, then you should note that
continuing growth in bioinformatics has been predicted by many scientific
leaders, commissions, universities, businesses, and granting agencies. Nor,
come to think of it, can I recall anyone making a prediction of no growth,
or decline, in the demand for bioinformatics. I'm not an economist, so I'll
defer on the details of such predictions to those who are.
If your work is in biology research, then you must decide for yourself
whether learning programming is a practical and effective way to advance
your research. That is the bottom line, after all. For many bench
experimentalists, programming is this kind of useful and productive research
skill.
James Tisdall has worked as a musician, as a programmer and Member
of Technical Staff at Bell Labs (where he programmed for speech research
and discovered a formal language for musical rhythm), as a programmer and
systems manager at the Human Genome Project in the Computational Biology and
Informatics Laboratory (where he began using Perl for bioinformatics in 1991
with his program DNA WorkBench), as computational biologist at Mercator
Genetics in Menlo Park, California (where his Perl programs helped discover the gene involved in the common hereditary disease hemochromatosis), as manager of
bioinformatics at the Fox Chase Cancer Center in Philadelphia, and most
recently as a consultant for Biocomputing Associates of Kimberton,
Pennsylvania, and the Burke Medical Research Institute affiliated with Cornell
University, working on neurodegenerative diseases such as Alzheimer's and
Parkinson's.
O'Reilly & Associates will soon release (October 2001) Beginning Perl for
Bioinformatics.