Reaching Dubious Parity with Hamstrung Humans

Jeffrey P. Bigham



We are living in an era of AI progress and AI hype.

I’ve now seen three different industrial leaders in speech recognition technology claim to have reached “human parity” with their speech recognition systems (Microsoft, Google and IBM). Such a claim was widely spread by Microsoft in 2016, resulting in at least one Arxiv paper and numerous popular press articles with titles like, “Achieving Human Parity in Conversational Speech Recognition.” The other companies appear to have then put out their own press releases claiming to have matched this ill-defined achievement.

Speech recognition has gotten a lot better, but, even in domains where it works shockingly well, it’s prone to nonsensical errors. Speech might be faster than typing on my phone, but its errors are often ridiculous. In harder domains, like real-time captioning for classrooms where the vocabulary is expansive, filled with jargon, and the timing constraint is less than 5 seconds, accuracy rates are abysmal.

So, what is going on?

One thing I like about the Microsoft “human parity”
paper is that it (briefly) describes the methodology used to measure human performance. You can read the details, but basically, they took a commonly worked on speech dataset (NIST 2000 Switchboard corpus), and sent it off to a professional human transcription service. They then calculated Word Error Rate (WER), and showed it was just a teensy bit higher than what their automated system produced.

Sounds reasonable, except for three things:

1)  It is biased toward making it easier for the machine

This is a dataset the speech community has worked on for decades. If they’re going to get anywhere close to human parity, it’ll be on this dataset. They assume that the speech recognizer has all the words in its vocabulary (out of vocabulary, OOV, words are a big problem in practice), and importantly does not need to worry about a great number of words that do not appear in the dataset. It sounds as though at least for the Switchboard part of the dataset, some of the same speakers are included both training and testing. Their methods, developed in part on this dataset for a very long time, likely overfit to it.

2)  The speech is decontextualized

The dataset is of conversations between people. They mention in the paper how they split up the utterances and sent them separately, preventing humans from learning from context… only fair because that’s what the machine does. Except, that’s not fair at all! The machine doesn’t use context because it can’t. It lacks commonsense and the ability to quickly recognize, build, and utilize context. They’ve made it harder for humans by making the task less natural.

“The transcribers were given the same audio segments as were provided to the speech recognition system, which results in short sentences or sentence fragments from a single channel. This makes the task easier since the speakers are more clearly separated, and more difficult since the two sides of the conversation are not interleaved. Thus, it is the same condition as reported for our automated systems.”

All the more baffling is that the claims going around about “human parity” keep saying that it refers to “conversational speech,” even though they explicitly made it non-conversational.

3)  Word Error Rate (WER)

WER is the edit distance between the outputs and the ground truth. This seems like a reasonable metric, and it’s used almost exclusively in speech research. Except, not all errors are equal, and this matters a lot if you’re going to make a bold claim like a system reaching parity with humans. They don’t give the errors made by the ASR and the human transcribers in their paper (it would be too lengthy to include, probably), but it’s pretty safe to say that the errors the humans made were much more reasonable. In our own past experience, we’ve seen human transcribers switch “someone” for “somebody” … whereas, ASR systems do things like swap “someone” for “dumb pun”… yet, WER calculations treat these two errors as being the same.

Remember when Watson did really well at Jeopardy, but when it made a mistake it was hilariously bad?

Overall, I get why it’s so tempting to make these bold claims. We’re in an exciting time in AI research. It’s getting better. And, it’s natural to want your technology to be at the top. It’s also likely extremely valuable to have your technology at the top. There’s a big incentive to exaggerate.

But, I don’t think we need to be overselling what AI can do.

What it can really do is quite amazing!

To make progress we need to clearly see what AI can do now, what it is likely to be able to do in the near and medium terms, and what complementary technologies and human interfaces will be needed to make it actually useful.

This page and contents are copyright Jeffrey P. Bigham except where noted.
Blog posts are not intended to be final products, but rather a reflection of current thinking and/or catalysts for discussion, like tweets but longer.