pinning Straw into Gold: How to do gold standard data right

How to do gold standard data right
LingPipe Blog author：lingpipe

We
have been struggling with how to evaluate whether we are finding ALL
the genes in MEDLINE/PubMed abstracts. And we want to do it right.
There has been a fair amount of work on how to evaluate natural
language problems–search via TREC, BIOCREATIVE, MEDTAG but nothing out
there really covered what we consider to be a key problem in text based
bioinformatics–coverage or recall in application to existing databases
of Entrez Gene.

What is the Rumpelstiltskin tie in?

From the Wikipedia:

“In order to make himself appear more important, a miller lied to
the king that his daughter could spin straw into gold. The king called
for the girl, shut her in a tower room with straw and a spinning wheel,
and demanded that she spin the straw into gold by morning, for three
nights, or be executed. ” Much drama ensues but in the end a fellow
named Rumpelstiltskin saves the day.

The cast breaks down as follows:

The king: The National Institutes of Health (NIH) who really prefer that you deliver on what you promise on grants.
The miller: Our NIH proposal in which we say “We are
committed to making all the facts, or total recall, available to
scientists…” Even worse is that this is from the one paragraph summary
of how we were going to spend $750,000 of the NIH’s money. They will be
asking about this.
The daughter: I (Breck), who sees lots of straw and no easy path to developing adequate gold standard data to evaluate ourselves against.
The straw: 15 million MEDLINE/PubMed abstracts and 500,000
genes that need to be connected in order to produce the gold. Really
just a subset of it.
The gold: A scientifically valid sample of mappings between
genes and abstracts that we can test our claims of total recall. This
is commonly called “Gold Standard Data.”
Rumpelstiltskin: Bob, and lucky for me I do know his name.

Creating Gold from Straw

Creating linguistic gold standard data is difficult, detail
oriented, frustrating and ultimately some of the most important work
that one can do to take on natural language problems seriously. I was
around when version 1 of the Penn Treebank was created and would chat
with Beatrice Santorini about the difficulties they encountered for
things as simple seeming as part-of-speech tagging. I annotated MUC-6
data for named entities and coreference, did the John Smith corpus of
cross-document coref with Amit Bagga and have done countless customer
projects. All of those efforts gave me insights that I would not have
had otherwise about how language is actually used rather than the
idealized version you get in standard linguistics classes.

The steps for creating a gold standard are:

Define what you are trying to annotate: We started with a
very open ended “lets see what looks annotatable” attitude for linking
Entrez Gene to MEDLINE/PubMed. By the time we felt we had a
sufficiently robust linguistic phenomenon we had a standard that mapped
abstracts as a whole to gene entries in Entrez Gene. The relevant
question was: “Does this abstract mention anywhere a literal instance
of the gene?” Gene families were not taken to mention the member genes,
so “the EXT familly of genes” would not count, but “EXT1 and EXT2 are
not implicated in autism” would.
Validate that you can get multiple people to do the same annotation:
Bob and I sat down and annotated 20 of the same abstracts independently
and compared our results. We found that we had 36 shared mappings from
gene to abstract, with Bob finding 3 mappings that Bob did not and I
found 4 that Bob did not. In terms of recall I found 92% (36/39) of
what Bob did. Bob found 90% (36/40) of what I found. Pretty good eh?
Not really, see below.
Annotate enough data to be statistically meaningful: Once we
are convinced we have a reliable phenomenon, then we need to be sure we
have enough examples to minimize chance occurrences.

The Tricky Bit

I (the daughter) need to stand in front of the king (the NIH) and
say how good our recall is. Better if the number is close to 100%
recall. But what is 100% recall?

Even a corpus annotation with an outrageously high 90% interannoatator

agreement leads to problems:

A marketing problem: Even if we hit 99.99% recall on the corpus, we don’t know what’s up with the 5% error.

We can report 99.99% recall against the corpus, but not against the truth.

after being total rock stars and modeling Bob at 99.99%, a slide that
says we can only claim recall of 85-95% on the data. So I can throw out
the 99.99% number and introduce a salad of footnotes and diagrams. I
see congressional investigations in my future.
A scientific problem: It bugs me that I don’t have a handle
on what truth looks like. We really do think recall is the key to text
bioinformatics and that text bioinformatics is the key to curing lots
of diseases.

Rumpelstiltskin Saves the Day

So, here we are in our hip Brooklyn office space, sun setting
beautifully over the Williamsburg bridge, Bob and I are sitting around
with a lot of straw. It is getting tense as I imagine the king’s
reaction to the “standard approach” of working from an interannotator
agreement validated data set. Phrases like “cannot be done in a
scientifically robust way”, “we should just do what everyone else does”
and “maybe we should focus on precision” were bandied about with
increasing panic. But the next morning Rumpelstiltskin walked in with
the gold. And it goes like this:

The problem is in estimating what truth is given somewhat unreliable
annotators. Assuming that Bob and I make independent errors and after
adjudication (we both looked at where we differed and decided what the
real errors were) we figured that each of us would miss 5% (1/20) of
the abstract to gene mappings. If we took the union of our annotations,
we end up with .025% missed mentions (1/400) by multiplying our recall
errors (1/20*1/20)–this assumes independence of errors, a big
assumption.

Now we have a much better upper limit that is in the 99% range, and
more importantly, a perspective on how to accumulate a recall gold
standard. Basically we should take annotations from all remotely
qualified annotators and not worry about it. We know that is going to
push down our precision (accuracy) but we are not in that business
anyway.

An apology

At ISMB in Detroit, I stood up and criticized the BioCreative/GENETAG folks for adopting a crazy-seeming annotation standard
that went something like this: “Annotate all the gene mentions in this
data. Don’t get too worried about the phrase boundaries, but make sure
the mention is specific.” I now see that approach as a sane way to
increase recall. I see the error of my ways and feel much better since
we have demonstrated 99.99% recall against gene mentions for that
task–note this is a different, but related task to linking Entrez Gene
ids to text abstracts. And thanks to the BioCreative folks for all the
hard work pulling together those annotations and running the bakeoffs.

1 Comments:

Unknown said...: But GameXta's impact extends beyond individual triumphs. It's injecting Karachi's tech scene with a much-needed dose of innovation and creativity. The energy emanating from this hub is attracting investors, starfield companion quests , and opportunities, painting a vibrant future for the city's digital landscape.; 9:45 PM