2009-06-16 02:58 am (UTC)
When you say "computer scientists", you presumably mean "theoretical computer scientists"? There are also machine learners who do computational biology, and who do not prove NP-Hardness theorems. They count as computer scientists too, you know... :)
There is indeed a gigantic amount of data available in bioinformatics, but some of my experience has been that people who gather this data are very very reluctant to share it with anyone. The culture is to release a dataset only after the group which gathered it has done everything with it that it could...Unless one can figure out a proper incentive system for people to share their data, actually implementing a democratization would be really hard.
yeah, i was generalizing too much there, sorry. i was mainly using it as an example of computer scientists being out of touch, to support the claim that "the two groups need to talk to each other more." i certainly do know computer scientists who are not theorists and do bioinformatics. still, the number of machine learners doing bioinformatics is overwhelmed by the number of geneticists doing it.
re. your latter point, i see what you mean, but i disagree somewhat. i don't think people who release data come anywhere close to doing everything with it that they could -- that would take decades.
besides, my argument was slightly different: there should be an easy way for tinkerers to download, understand and play with data that is already public, even if it doesn't directly lead to new research results.
There is a vibrant community of computational biologists at Stanford, spun off by the CS department: http://robotics.stanford.edu/~serafim/
and others. They have data (actually, tons of it), know python, and understand biology. If you want to get up to speed quickly, I encourage you talking to them.
that's exciting. thanks for the pointers!
"When computer scientists do study the genome, they end up spending their time on something inane like proving the NP-hardness of some computation on genomic sequences ..."
I don't think that's true - at least over the last decade or so. In fact, when I was taking Tandy Warnow's Intro. To Computational Biology course - which was essentially a theory course - proving NP-hardness of stuff was not good enough even for a course project.
i dunno man, i see papers about this stuff all the time.
while i agree that the claim that every CS person in bioinformatics is doing hardness crap is an exaggeration, the claim that none of them is doing it is just as much of an exaggeration :-)
Of course. I think a bigger percentage don't do than do.
Yeah, it is just too bad that Python hackers can't document things either. If anyone is going to be insane enough to tackle this multiple-disciplinary meatball you'll probably see those first modules in . . . Perl.
Can you give some info about what you used to learn from? I could be into both doing some hacking (Lisp, probably...) and writing "The Fact of Life"
oh, one or two physical books, a bunch of books through google books (http://books.google.com/books?id=g4hC63UrPbUC
is the only one that i really liked -- officemate had a paper copy), everything that wikipedia has to say, and finally, about 50 research papers.
2009-06-17 11:43 am (UTC)
There are already tons of data available
There are already tons of data available -- swisprot, ncbi, biogrid, interact, mips, mint, sgd, hprd. There are also tons of libraries available -- bioperl, biopython, libsvm, etc. This doesn't even mention the tons of databases and online tools that do some serious computation from experimental data.
Even will all this stuff out there, I doubt there are that many hackers that are interested in doing these kinds of experiments. I think it's mostly because there is really not much you can do with the results.
Unlike making a website or writing a game, often times the only results of writing a program for bioinformatics is a number or a string of ATCG's. Right now we are still miles away from practical applications for all this data. Typically these algorithms are just used by biologists to help them analyze experimental data or suggest other things to investigate.
But in general, biologists do not trust the stuff that bioinformaticists make. Good prediction algorithms are very hard to find. In protein-protein interaction prediction, there is evidence that recently published algorithms have a real precision of less than 5%. There are details for how to prepare the data, how to perform the calculations, how to interpret the results that are very hard to explain.
One example is just getting information from the databases. Conversions of gene names to locus names to succession numbers with multiple aliases in various databases is not an easy task. Even if you do combine all the information from the databases correctly, there's no guarantee that there were no mistakes in the original experiment or data entry.
I think it would be very hard to be an amateur bioinformaticist. Of course, there are stuff that would programmers would be able to provide that would be greatly beneficial to biologists. If people are interested in actual bioinformatics, I would suggest starting with the analysis of microarray datasets. But otherwise, I would suggest going into programs that help the experimental process, such as web 2.0 for collaboration between biologists, cs-people, chemists, etc or even a stackoverflow type system for biologists.
2009-06-17 05:24 pm (UTC)
Re: There are already tons of data available
hi, thanks for the detailed comments. i'll look into the datasets and libraries you mentioned.
"If people are interested in actual bioinformatics, I would suggest starting with the analysis of microarray datasets."
that's what i've been doing, and i'm not sure it requires a whole lotta background, and i think it would be great for more people to get into it.
"But otherwise, I would suggest going into programs that help the experimental process, such as web 2.0 for collaboration between biologists, cs-people, chemists, etc or even a stackoverflow type system for biologists."
that would be great, but i'm not sure what's the reward whoever builds it. it's too small a niche to make much money off of, and unlike hacking on datasets, i doubt many people would build a collaboration tool for the love of it.
2009-06-17 08:19 pm (UTC)
Re: There are already tons of data available
"too small a niche to make much money off of"
hmmm, that is really getting me thinking... The market is small, but the money isn't. Bio labs spends lots of money for various kinds of experiments, I'd guess the advertising revenue could be pretty significant.
There is definitely a need. Just look around ncbi, or find the website for some people doing bioinformatics, you should be able to find lots of information and people are pretty willing to share information and programs. But right now things are all over the place, and sometimes a bit overwhelming. It would benefit greatly from an online community.
I just reread my comment, and just to clarify, I didn't mean that this isn't an interesting field, but that it might be more rewarding to collaborate with a laboratory than going about it alone. It is a pretty difficult field that we are just beginning to understand.
2009-06-17 09:58 pm (UTC)
Re: There are already tons of data available
that all makes sense, thanks.
2009-06-26 07:02 am (UTC)
Welcome to the world of the biology :)
The problem is as you have articulated. We, various species of biologists, have no understanding of the computer terms. And the computer people have no idea of what we are talking about. It is like we discovered there is a powerful tool out there and let us use it. Only we no clue how to go about it.
BTW, what are you exactly looking at when you say math of meiosis? I do not do cell division studies per se but recently we have discovered that the protein we are interested in has a role in cell division. So we are now immersing ourselves in trying to understand one of the basic problems in cell division- how do chromosomes condense.
my interest is at a slightly different level of abstraction. i need to write code to simulate a breeding population of individuals through a few tens of generations, and track identical-by-descent regions of DNA among individuals that are distantly related (e.g., individuals who share a common ancestor 5 generations ago.)
by the 'math' underlying this process, what i mean is that i need to understand how to randomly sample the chromosome sequence of a haploid daughter chromosome from that of the diploid parent one. if i can build this, then simulating population evolution is easy. i'm learning about the number and distribution of chromosomal crossovers, how to model crossover interference, estimating mutation rates, etc.
to give you a better idea, here are two typical papers that i've found useful in building the mathematical models of recombination:
* Broman and Weber, 2000: Characterization of human crossover interference
* Myers et al., 2005: A Fine-Scale Map of Recombination Rates and Hotspots Across the Human GenomeEdited at 2009-06-26 07:55 am (UTC)