|
Why Biologists Want to Program
Computers
by James Tisdall, author of Beginning Perl for
Bioinformatics
As part of my work in bioinformatics over the last ten years,
I've been teaching biologists to program, and I've given courses
and workshops and offered one-on-one advice towards that end.
The students (from vice presidents to principle investigators
to junior lab assistants) who attend these courses do so to learn
about programming for biology research. I've often been asked to
give my perspective on the benefits of learning programming,
considering the expenditure of time and effort that is required
to learn this new and important laboratory skill.
Over the last decade there has been an accelerating interest
in acquiring programming skills on the part of biologists. My new
book Beginning Perl for Bioinformatics from O'Reilly &
Associates is designed to address the need for training in this
area by teaching programming in the context of biologically
relevant data and results.
This article will examine why a biologist would want to learn
to program. There are two main reasons: scientific, and economic.
I hope that the discussion will also be of some use to
programmers thinking of entering the bioinformatics field. But
first, I'll take a short tour of some history, define some terms,
and make some general comments about how programming fits into
biology research.
Definitions and History
Bioinformatics is the somewhat new and rather
unfortunate term that is commonly employed for referring to the
use of computers in biological research, especially in such
fields as genomics, sequencing, and genetics. Oh well, we're
stuck with it. Informatics comes from a European term for
computer science.
Computational biology is an alternative term for the
use of computers in biology research. While we could quibble over
an exact definition of these and other terms, I tend to just take
the word "bioinformatics" as all-inclusive, and use it to refer
in general to the use of computers in biology research.
There is a long history of the use of computers in biology
research, dating back to the early days of digital computers over
50 years ago. About ten years ago the large-scale international
human genome project began to generate data of a volume and of a
kind that required a carefully planned and significant computer
programming effort. Since then, the field of bioinformatics has
been growing at an increasing rate.
This growth can be measured in several ways. There are now
many academic positions filled by bioinformaticians (also called
bioinformaticists or computational biologists). Training programs
have become fairly common, from workshops to Ph.D. programs to
post-doctoral positions. Industrial concerns are staffing many
bioinformatics positions, especially in the pharmaceutical,
agricultural, and biotechnology industries. Programming staff
positions in biology research organizations have increased.
Coverage of bioinformatics in scientific journals; scientific
conferences dedicated to or incorporating bioinformatics;
bioinformatics Web sites; published books on the subject--all
have seen significant and even accelerating growth.
Two Pursuits: Biology and Computer Science
There's a bit of a cultural divide between biologists and
computer scientists. This is only natural; the two disciplines
are really fairly different, and have their own literatures,
techniques, and fundamental principles.
To combine both disciplines is necessary, but the fit is
sometimes a little tight across the shoulders. It is relatively
rare to find an individual who has solid academic training in
both fields; most folks are trained in one, and find themselves
wandering into the other field.
It is also important for researchers to understand enough of
each other's discipline to successfully collaborate with one
another. Biologists who know enough computer science to be able
to explain a problem in terms a programmer can find useful--and
programmers who know enough biology to understand what their
biologist colleagues need in a program--are valuable.
Biologists learning programming can sometimes encounter
significant snags. Some very talented biologists just don't "take
to" programming; while others find they can absorb it without
undue difficulty. For example, if you went into biology because
you perceived that it was a science that didn't require too much
math, then computer science may be an uncomfortable thing to
learn. Although I must point out that you can do a lot of
programming with only a bare minimum of math. (Also, lest it be
thought I am slurring biologists, it is certainly true that many
are well trained in mathematics, especially statistics, and use
it extensively in their work.) Finally, it's important for
biologists to realize that there's a lot more to computer science
than simply learning a programming language or two.
That's the bad news. The good news is that most biologists
aren't interested in learning the computer science concepts used
in compiler design, structural complexity theory, advanced
algorithms, or such; they just want to learn enough of a
programming language to do some practical, useful tasks that will
advance their research. And that is possible to do without a
graduate degree in computer science. It will come as no surprise
that I recommend the Perl programming language as a good place to
begin for such practical, results-oriented programming skills.
And even those biologists who are interested in pursuing the
deeper results of computer science have to start at the beginning
with learning a programming language.
If you're interested in learning Perl, don't miss O'Reilly's
best-selling Learning Perl, 3rd
Edition, which has been updated to cover Perl version 5.6 and
rewritten to reflect the needs of programmers learning Perl
today. For a complete list of O'Reilly's books on Perl, go to www.oreilly.de/perl/.
In my opinion, programmers going into biology often have the
harder time of it. I should mention a common pitfall that
programmers entering biology research encounter. Biology is
subtle, and it can take lots of work to begin to get a handle on
the variety of living organisms. Programmers new to the field
sometimes write a perfectly good program for what turns out to be
the wrong problem! I recommend to programmers that they at least
study a book like Recombinant DNA, 2nd edition by Watson, Gilman,
Witkowski, Zoller, and Witkowski; and that they ask a lot of
questions of the biologists for whom they're writing their
programs.
Use Programs, Don't Write Them
You can do bioinformatics without learning how to program. It
is not uncommon for bioinformatics specialists to become adept at
using existing bioinformatics programs, without ever learning the
programming skills necessary to actually build such tools. There
are now many programs available on Web sites or elsewhere that
give convenient access to a significant amount of biological
data.
More and more biology researchers in all fields are finding
that the use of bioinformatics computer tools has become a
regular part of their research. Many, if not most, of these
researchers do not have programming skills, and are getting along
quite well without them. For this group, the answer to the topic
of this article is "They don't want to program computers!".
Write Programs When There's Nothing to Use
 |
 |
 |
 |
Over the last decade
there has been an accelerating interest in acquiring programming
skills on the part of biologists. |
 |
 |
 |
However, many biologists do want to learn how to program. They
believe that programming skills can significantly help them to
achieve their research goals. In research, questions often arise
that could be answered or facilitated by a computer program--but
the program doesn't yet exist. So a programmer is needed.
These research questions can range from straightforward ones
easily programmed, all the way to complex questions requiring the
invention of new algorithms. Advanced programming skills are not
usually necessary, however. A basic skill set will do nicely to
advance the work of most labs.
The question then becomes not "shall we program?" but rather
"who's going to do the programming?" Perhaps the PI has the
interest to learn and practice this new skill; this happens
fairly often, despite the demands of research and grant writing,
since the payoffs can be considerable. Or a staff scientist,
postdoc, or student may leap into the breach. Many times a
department, or institution, will maintain staff bioinformatics
specialists who divide their time between the various labs that
need their labor.
Programming Takes Time
There's an important fact that is sometimes underappreciated
by biologists who are new to programming. That fact is
programming is labor-intensive. Writing a substantial
program still takes skilled people a nontrivial amount of time.
Think of the time it can take to work the kinks out of a new
experimental protocol; writing a significant program can often
take a similar amount of work. Although powerful computer
hardware is now quite inexpensive, writing the programs that make
the hardware useful can still require considerable resources.
Visit www.perl.com for the
latest Perl technology news and CPAN updates.
One of the reasons Perl has become a popular bioinformatics
programming language lies in its suitability for rapid
prototyping, that is, the ability to quickly write a working
program, thus saving precious time. Despite all the attention
that the computer world focuses on the clock speed in megahertz
of various computer models, in today's research the most
important speed to look at is, usually, the speed of
programming. Choosing the right language for a programming
job is an important part of this, and often (not always) Perl is
the right language in a biology research setting.
Before I get pulled further into this digression on software
engineering, or the study of the art of programming, let me now
return to my main point, that many biologists want to learn
programming.
What are the scientific reasons for this?
Scientific Reasons for Learning Bioinformatics
Programming
I'll touch on these representative (not comprehensive) reasons
for learning bioinformatics programming:
- Quantity of existing data
- Dealing with new data
- Automating the automation
- Evaluating many targets
Quantity of Existing Data
 |
 |
 |
 |
Although powerful
computer hardware is now quite inexpensive, writing the programs
that make the hardware useful can still require considerable
resources. |
 |
 |
 |
There is now a huge amount of basic biological data, and
without computers to store and search this data, we'd be severely
constrained. The most well-known biological data is the map and
sequence of the human genome; and there's a lot of other
biological data as well, for humans as well as for other
organisms. Just to use this data (and its use is an essential
part of many research programs) requires computer storage, as
well as programs to make it convenient to search, retrieve, and
study the data. Imagine searching for a motif in the human genome
(3 billion base pairs) without a computer. This point is obvious
and well known, so I won't belabor it.
But I will add that the specific data you want to retrieve for
your research may exist in several different databases. Depending
on how you want to select and compare that data, there may not be
an already existing program that you can use. For all but the
most common tasks, you may need to write your own programs to
handle the selection and comparison of data from the database or
databases of interest to you. I'll give an example of this
shortly.
Dealing with New Data
Many biologists find that their laboratory notebooks still
work just fine for recording their results.
But for an increasing number of researchers, new laboratory
techniques (microarrays, dHPLC, gene chips, high-throughput
sequencing, and so on) are generating a volume of data that
requires computer technology. This can run the gamut from fairly
simple tools such as spreadsheets, all the way to complex
relational databases, and beyond. Designing and implementing
complex databases is a specialized skill, and a career, in
itself; but most programmers have at least a working knowledge of
these skills.
Some biology research areas have long had a need for computing
with large amounts of experimental data. For instance the
determination of a protein structure by X-ray crystallography
generates large amounts of data, and requires sophisticated
algorithms to determine the structure from that data.
The point is that handling the experimental data of a lab may
require computer programming to make the data readily accessible,
to perform statistical analyses of the data, and to share the
data with colleagues. Many research grants now include provisions
to make the data and results of a project available for public
inspection. This is often accomplished by storing the data in a
database, and providing a Web page and interactive programs to
enable visitors to explore the results.
A very common approach to this task is to use the no-cost open
source Perl language for the programming (using the ever popular
CGI.pm module), perhaps combined with the no-cost Apache Web
server and a no-cost database such as MySQL or PostgreSQL and a
no-cost platform such as Linux or BSD. In other words, apart from
the hardware, all the software is free (and high quality), which
does wonders for the lab budget. For those with Macintosh or
Windows computers, the same approach will also work, as Perl,
Apache, and the databases are also available on those
platforms.
Automating the Automation
One of the most valuable, time-saving programming skills that
a biologist can learn is how to automate the automation. This
means writing programs that run, and collect the output from,
other programs, thereby eliminating the need to run the other
programs in person while sitting at the computer.
Let's give a simple example of a case in which you have some
large number of items that need to be examined closely by a range
of programs.
Evaluating Many Targets
Say you have two hundred candidate targets for some biological
experiment. The experiment is lengthy and costly, and so you need
to evaluate the targets in order to select the most promising
ones. You have several programs whose combined results allow you
to make a reasonable selection. If running the needed programs,
and examining and prioritizing the results, takes an hour per
target, then you've got about a month's worth of work sitting in
front of a computer screen ahead of you.
So instead, you write a program. This program takes each
target one at a time, then runs the auxiliary programs and
collects and collates the results. After it's done, it
prioritizes the targets according to the criteria you have
programmed, and then it presents you with a "top ten" list of
targets. The program may take a couple of days, or a week, to
write; it takes 30 minutes to run. You've saved yourself the rest
of the month for doing "real" biology, actually performing the
experiments on the targets; and you're on track to get to the
finish line three weeks sooner.
Economic Reasons for Learning Bioinformatics Programming
Bioinformatics skills are commanding a premium in the
marketplace, with a lot of the demand coming from the private
sector.
For many biologists now getting their training in graduate
school, or doing their postdocs, it is an unpleasant fact that an
oversupply of trained people, compared to the demand, may result
in a relatively low rate of pay, depending of course on their
area of specialization. Reports in leading journals have decried
the overproduction of biology PhDs relative to the level of
funding for biology in general. For some of these biologists,
bioinformatics skills can significantly enhance their job
prospects and their salaries, because there is an lack of trained
bioinformatics people relative to the demand.
And that fact of supply and demand in the labor market for
biology researchers is the economic reason that biologists want
to learn programming. I've even seen young Ph.D.s in my classes
who, despairing of finding a decent position, are learning
programming with the intention of leaving biology research
altogether. Of course, we could all lobby for an increase in
funding for biology research. (This is the golden age of
biological research, when such funding can be expected to yield
great results.) But the competition for jobs and grants is likely
to remain quite heated for some time to come.
It is not hard to find (say with a search engine like Google looking up the
words "bioinformatics salary") lots of reports on the premium
being paid for trained bioinformaticians, and especially for
experienced bioinformaticians. The field is simply growing faster
than the number of qualified individuals who can fill the need
for bioinformatics skills.
Trends and Predictions
The upward trend of the use of computers in biology research
has been going on for several years now. In the time-honored
tradition of futurists everywhere, I predict that the current
trend will continue!
Actually, there are solid reasons to suppose that the
prediction is true. If a simple appeal to authority is
acceptable, then you should note that continuing growth in
bioinformatics has been predicted by many scientific leaders,
commissions, universities, businesses, and granting agencies.
Nor, come to think of it, can I recall anyone making a prediction
of no growth, or decline, in the demand for bioinformatics. I'm
not an economist, so I'll defer on the details of such
predictions to those who are.
If your work is in biology research, then you must decide for
yourself whether learning programming is a practical and
effective way to advance your research. That is the bottom line,
after all. For many bench experimentalists, programming is this
kind of useful and productive research skill.
James Tisdall has worked as a musician, as a programmer
and Member of Technical Staff at Bell Labs (where he programmed
for speech research and discovered a formal language for musical
rhythm), as a programmer and systems manager at the Human Genome
Project in the Computational Biology and Informatics Laboratory
(where he began using Perl for bioinformatics in 1991 with his
program DNA WorkBench), as computational biologist at Mercator
Genetics in Menlo Park, California (where his Perl programs
helped discover the gene involved in the common hereditary
disease hemochromatosis), as manager of bioinformatics at the Fox
Chase Cancer Center in Philadelphia, and most recently as a
consultant for Biocomputing Associates of Kimberton,
Pennsylvania, and the Burke Medical Research Institute affiliated
with Cornell University, working on neurodegenerative diseases
such as Alzheimer's and Parkinson's.
O'Reilly & Associates will soon release (October 2001) Beginning Perl for
Bioinformatics.
|