This is a rant that's probably long overdue. Well, overdue if you're dealing with biological derived data, anyhow. For those of you who aren't bioinformaticians, or have never dealt with a genes and proteins, you might want to tune out about now... or now.... or now.
Anyhow, let me ask you a question. What's the very first thing you learned in your very first high school course? Was it Solubility? Maybe it was ideal gas laws? or was it Stoicheometry? (every chemists favourite word.) Actually, I'm willing to put my money on it being nomenclature. There are a whole bunch of elements (118 or so
on enthusiastic periodic tables), and you need to know how to talk about them, what to call them when they start hanging out together, interacting, and getting into all sorts of trouble. If you stretch you mind back you might even remember hearing something about IUPAC
, who set the rules for how these names apply, and for that matter, all of the terminology around the interactions.
Unfortunately, no where in any of my bioinformatics training has anyone ever sat down with me and explained the rules of genomics, proteomics or metabolomics nomenclature, and there's a good reason why: It doesn't exist.
Unlike the elements, and their myriad ways of interacting, most elements only react in very simple compounds, although there are a few exceptions. The biggest exception is organic chemistry, which, in fact, later on sat down and built it's own nomenclature system. Systems within systems. When biochemists started pushing the bounds of the organic chemistry IUPAC system, the realized they needed new names. Unfortunately, IUPAC never really jumped on this ball. We have fancy names like ornithine, which is really (2R)-2,5-diaminopentanoic acid, or even better Nicotinamide D-ribonucleotide, which is really 1-[(2R,3R,4S,5S)-3,4-dihydroxy-5-[(hydroxy-oxido- phosphoryl)oxymethyl]tetrahydrofuran- 2-yl]pyridine-5-carboxamide.
Ok, so you can't blame the biochemists for coming up with short hand names - they really didn't have a choice. traditional IUPAC names are too long, and no one would ever finish their thesis writing things that way. But why didn't IUPAC come back into the fray?
Now, I have to push one step further. In genomics, we have to deal with a person's entire genome. For many organisms, a gene can also be expressed in different forms, depending on how a cell interprets the DNA, leading to isoforms, or variants. Of course, even these variants can be further processed to by splicing (hacking and reassembling the sequence), leading to further ambiguity. And then
, they get translated into a protein. Confusing, you say? Wait till you get to the next part.
Along the way, everyone has been annotating genes differently. Some people have annotated proteins instead. Other people annotate protein fragments, which later are discovered to really be parts of other previously annoted proteins, leading to multiple names, and the best is yet to come. Each database of these annotations stores the names in a different system, leading to wonderful complexities, where you have what amounts to 20 different names for a single gene.
Bioinformaticians spend an inordinate amount of time using tools, many of which are written to use a single nomenclature. Some tools use names like ENSG00000139618
, others use the equivalent of "BRCA2". Actually, why don't I just list a few more for this same gene/protein
"BRCA1/BRCA2-containing complex, subunit 2"
I'm sure I've missed a few. Oh, I found another: "Fanconi anemia, complementation group D1".
So, let me ask. Why haven't bioinformaticians, genome and proteome experts and other interested parties all sat down to come up with a single naming scheme? I have no idea, but it's long overdue!