Log in

No account? Create an account
Bioinformatics needs to be democratized - Arvind Narayanan's journal [entries|archive|friends|userinfo]

Bioinformatics needs to be democratized [Jun. 15th, 2009|12:39 pm]
Arvind Narayanan
[Tags|, , , ]

I've started working with genome data, so I've been giving myself a crash course in genetics, especially the human genome, over the past week or two. I'm almost up to speed — I can now understand paragraphs like this without needing to look anything up:
We identified a 47 Kb interval containing an Alu insertion polymorphism (DXS225) and four microsatellites in complete linkage disequilibrium in a low recombination rate region of the long arm of the human X chromosome. This haplotype block was studied in 667 males from the HGDP-CEPH Human Genome Diversity Panel.
Just two weeks ago I would have been completely lost. That isn't your basic Mendelian genetics either, it's a research paper from 2007.

Learning a new set of concepts over a short period of time by immersing yourself in it is an intense experience, and one that I thoroughly enjoy. That's why privacy research has been so rewarding for me — it has given me a chance to read tons of papers in law, economics, sociology and now genetics, not to mention many, many areas of computer science.

My main aim was to understand the math behind meiosis. While it isn't very hard, it is still an area of active research, and our knowledge of it is incomplete. Consequently, there doesn't seem to be a way to get to it without cutting through layers of biology. Even the really basic stuff, like chromosomal crossover, is generally explained in a tedious way using observations of traits in plants and fruitflies. This is because of an accident of scientific history — until DNA sequencing became a reality, the only way to learn what happens during meiosis was to observe the results when animals mate and make inferences about the genome based on that. It's like the story of the blind men and the elephant.

More generally, bioinformatics seems to be populated by people whose background is primarily in genetics rather than in computer science. When computer scientists do study the genome, they end up spending their time on something inane like proving the NP-hardness of some computation on genomic sequences, apparently oblivious to the fact that human DNA does not consist of worst-case strings constructed by malicious adversaries! Even on basic information-theoretic questions such as the amount of entropy in the human genome, the best I could find is a marginally related, speculative blog post.

Someone needs to write "The Facts of Life, for Computer Scientists." I'm confident that any competent python hacker with a solid knowledge of algorithms and statistics could read that document, learn the basics in a few hours, download some data from HapMap, install a library of useful tools with a single apt-get command, and start producing useful code and generating interesting hypotheses, all in the course of an evening!

Bioinformatics needs to be democratized. There is a gigantic amount of data available, but the people who are producing the data aren't necessarily the ones best equipped to play with it. On the other hand, there is a huge community of hackers who would like to do just that, but don't realize how easy it is. If you could get these two groups to talk to each other, amazing things can happen.

From: (Anonymous)
2009-06-16 02:58 am (UTC)
When you say "computer scientists", you presumably mean "theoretical computer scientists"? There are also machine learners who do computational biology, and who do not prove NP-Hardness theorems. They count as computer scientists too, you know... :)

There is indeed a gigantic amount of data available in bioinformatics, but some of my experience has been that people who gather this data are very very reluctant to share it with anyone. The culture is to release a dataset only after the group which gathered it has done everything with it that it could...Unless one can figure out a proper incentive system for people to share their data, actually implementing a democratization would be really hard.
(Reply) (Thread)
[User Picture]From: arvindn
2009-06-16 03:16 am (UTC)
yeah, i was generalizing too much there, sorry. i was mainly using it as an example of computer scientists being out of touch, to support the claim that "the two groups need to talk to each other more." i certainly do know computer scientists who are not theorists and do bioinformatics. still, the number of machine learners doing bioinformatics is overwhelmed by the number of geneticists doing it.

re. your latter point, i see what you mean, but i disagree somewhat. i don't think people who release data come anywhere close to doing everything with it that they could -- that would take decades.

besides, my argument was slightly different: there should be an easy way for tinkerers to download, understand and play with data that is already public, even if it doesn't directly lead to new research results.
(Reply) (Parent) (Thread)
[User Picture]From: iliada
2009-06-16 05:25 am (UTC)
There is a vibrant community of computational biologists at Stanford, spun off by the CS department: http://robotics.stanford.edu/~serafim/, http://bejerano.stanford.edu and others. They have data (actually, tons of it), know python, and understand biology. If you want to get up to speed quickly, I encourage you talking to them.
(Reply) (Thread)
[User Picture]From: arvindn
2009-06-16 05:32 am (UTC)
that's exciting. thanks for the pointers!
(Reply) (Parent) (Thread)
From: emeritusl
2009-06-16 02:57 pm (UTC)
"When computer scientists do study the genome, they end up spending their time on something inane like proving the NP-hardness of some computation on genomic sequences ..."

I don't think that's true - at least over the last decade or so. In fact, when I was taking Tandy Warnow's Intro. To Computational Biology course - which was essentially a theory course - proving NP-hardness of stuff was not good enough even for a course project.
(Reply) (Thread)
[User Picture]From: arvindn
2009-06-16 04:22 pm (UTC)
i dunno man, i see papers about this stuff all the time.

while i agree that the claim that every CS person in bioinformatics is doing hardness crap is an exaggeration, the claim that none of them is doing it is just as much of an exaggeration :-)
(Reply) (Parent) (Thread)
From: emeritusl
2009-06-16 04:26 pm (UTC)
Of course. I think a bigger percentage don't do than do.
(Reply) (Parent) (Thread)
[User Picture]From: dannyman
2009-06-16 05:21 pm (UTC)
Yeah, it is just too bad that Python hackers can't document things either. If anyone is going to be insane enough to tackle this multiple-disciplinary meatball you'll probably see those first modules in . . . Perl.
(Reply) (Thread)
[User Picture]From: patrickwonders
2009-06-17 02:58 am (UTC)
Can you give some info about what you used to learn from? I could be into both doing some hacking (Lisp, probably...) and writing "The Fact of Life"
(Reply) (Thread)
[User Picture]From: arvindn
2009-06-17 03:45 am (UTC)
oh, one or two physical books, a bunch of books through google books (http://books.google.com/books?id=g4hC63UrPbUC is the only one that i really liked -- officemate had a paper copy), everything that wikipedia has to say, and finally, about 50 research papers.
(Reply) (Parent) (Thread)
From: ext_194364
2009-06-17 11:43 am (UTC)

There are already tons of data available

There are already tons of data available -- swisprot, ncbi, biogrid, interact, mips, mint, sgd, hprd. There are also tons of libraries available -- bioperl, biopython, libsvm, etc. This doesn't even mention the tons of databases and online tools that do some serious computation from experimental data.

Even will all this stuff out there, I doubt there are that many hackers that are interested in doing these kinds of experiments. I think it's mostly because there is really not much you can do with the results.

Unlike making a website or writing a game, often times the only results of writing a program for bioinformatics is a number or a string of ATCG's. Right now we are still miles away from practical applications for all this data. Typically these algorithms are just used by biologists to help them analyze experimental data or suggest other things to investigate.

But in general, biologists do not trust the stuff that bioinformaticists make. Good prediction algorithms are very hard to find. In protein-protein interaction prediction, there is evidence that recently published algorithms have a real precision of less than 5%. There are details for how to prepare the data, how to perform the calculations, how to interpret the results that are very hard to explain.

One example is just getting information from the databases. Conversions of gene names to locus names to succession numbers with multiple aliases in various databases is not an easy task. Even if you do combine all the information from the databases correctly, there's no guarantee that there were no mistakes in the original experiment or data entry.

I think it would be very hard to be an amateur bioinformaticist. Of course, there are stuff that would programmers would be able to provide that would be greatly beneficial to biologists. If people are interested in actual bioinformatics, I would suggest starting with the analysis of microarray datasets. But otherwise, I would suggest going into programs that help the experimental process, such as web 2.0 for collaboration between biologists, cs-people, chemists, etc or even a stackoverflow type system for biologists.
(Reply) (Thread)
[User Picture]From: arvindn
2009-06-17 05:24 pm (UTC)

Re: There are already tons of data available

hi, thanks for the detailed comments. i'll look into the datasets and libraries you mentioned.

"If people are interested in actual bioinformatics, I would suggest starting with the analysis of microarray datasets."

that's what i've been doing, and i'm not sure it requires a whole lotta background, and i think it would be great for more people to get into it.

"But otherwise, I would suggest going into programs that help the experimental process, such as web 2.0 for collaboration between biologists, cs-people, chemists, etc or even a stackoverflow type system for biologists."

that would be great, but i'm not sure what's the reward whoever builds it. it's too small a niche to make much money off of, and unlike hacking on datasets, i doubt many people would build a collaboration tool for the love of it.
(Reply) (Parent) (Thread)
From: ext_194364
2009-06-17 08:19 pm (UTC)

Re: There are already tons of data available

"too small a niche to make much money off of"

hmmm, that is really getting me thinking... The market is small, but the money isn't. Bio labs spends lots of money for various kinds of experiments, I'd guess the advertising revenue could be pretty significant.

There is definitely a need. Just look around ncbi, or find the website for some people doing bioinformatics, you should be able to find lots of information and people are pretty willing to share information and programs. But right now things are all over the place, and sometimes a bit overwhelming. It would benefit greatly from an online community.

I just reread my comment, and just to clarify, I didn't mean that this isn't an interesting field, but that it might be more rewarding to collaborate with a laboratory than going about it alone. It is a pretty difficult field that we are just beginning to understand.
(Reply) (Parent) (Thread)
[User Picture]From: arvindn
2009-06-17 09:58 pm (UTC)

Re: There are already tons of data available

that all makes sense, thanks.
(Reply) (Parent) (Thread)
From: (Anonymous)
2009-06-26 07:02 am (UTC)
Hi Arvind:

Welcome to the world of the biology :)

The problem is as you have articulated. We, various species of biologists, have no understanding of the computer terms. And the computer people have no idea of what we are talking about. It is like we discovered there is a powerful tool out there and let us use it. Only we no clue how to go about it.

BTW, what are you exactly looking at when you say math of meiosis? I do not do cell division studies per se but recently we have discovered that the protein we are interested in has a role in cell division. So we are now immersing ourselves in trying to understand one of the basic problems in cell division- how do chromosomes condense.


(Reply) (Thread)
[User Picture]From: arvindn
2009-06-26 07:48 am (UTC)
hi rohini,

my interest is at a slightly different level of abstraction. i need to write code to simulate a breeding population of individuals through a few tens of generations, and track identical-by-descent regions of DNA among individuals that are distantly related (e.g., individuals who share a common ancestor 5 generations ago.)

by the 'math' underlying this process, what i mean is that i need to understand how to randomly sample the chromosome sequence of a haploid daughter chromosome from that of the diploid parent one. if i can build this, then simulating population evolution is easy. i'm learning about the number and distribution of chromosomal crossovers, how to model crossover interference, estimating mutation rates, etc.

to give you a better idea, here are two typical papers that i've found useful in building the mathematical models of recombination:
* Broman and Weber, 2000: Characterization of human crossover interference
* Myers et al., 2005: A Fine-Scale Map of Recombination Rates and Hotspots Across the Human Genome

Edited at 2009-06-26 07:55 am (UTC)
(Reply) (Parent) (Thread)