Word Error Rates

By Bryce Summers. Started writing on 9.8.2025. Published on 10.8.2025.

AI Transcription

Personal Historians and other professionals who use recordings regularly need the audio to be transcribed into an accurate written form. Once a recording has been transcribed, the text can be efficiently key word searched to find timestamped content, direct quotes can be copy and pasted into books and articles, and/or the text can be used as a starting point in telling a story in someone's own words. While transcripts can be very beneficial, the cost of transcription can be prohibitive.

Transcribing audio is a lengthy laborious process, but an expert transcriber can write transcripts that mimic the original recordings with the utmost accuracy. The professional can almost immediately begin to work with the transcript, assuming they are confident in the ability of the transcriber, with minimal extra labor required from the professional, but with the highest monetary cost. Those who desire transcriptions with a dramatic reduction in monetary cost and who are willing to spend some of their own time for correcting inaccuracies may find that algorithmic ("AI") transcription is the best solution to their transcription needs.

AI Transcription is the process of transforming audio data into text data by the use of a computer program. A transcription program can be run on any computer. It is most private to run transcription on a local computer, trusted by the professional. It is most cost effective to run transcription on a computer located at an unspecified location somewhere in our world (a computer "In the cloud."), usually operated by a large company like Amazon or Google.

Choosing a Program

There are many options when if comes to choosing a transcription program. They all have the exact same goal: to accurately transcribe in order all spoken words from a recording into a text document. That said, they don't produce the same outputs. Each transcription program makes a unique set of transcription errors. Since the professional will need to locate and correct each of these errors, the quantity of errors relative to the length of the transcript is a good measure of how much labor a transcription program demands of its end user. This quantity is represented by a the Word Error Rate (WER), which is the percentage of words in a transcript that are superfluous, incorrect, or missing. The word Accuracy (WAcc) rating is a more intuitive quantity which gives the percentage of words that are correct.

A perfect transcript has a WER of 0% and a WAcc of 100%. When deciding between programs, you should always choose the program with the lowest WER / highest WAcc score that fits your budget.

How to measure WER and WAcc

You can measure WER yourself! First pick out an audio clip and transcribe it into a transcript via the program you wish to test. Go thru the transcript and correct it. Keep a tally of every time you substituted a word (such as correcting a misspelling or a word that was mistranslated), inserted a word into the transcript, or deleted a word from the transcript.

Let S be your substitution count, I is your insertion count, D is your deletion count, and C is the number words that were already correct. Word Error Rate can be calculated1 as follows:

\(\displaystyle WER = \frac{S+I+D}{S+D+C} \)

WAcc can then be computed easily:

\(W_{Acc} = 1 - WER\)

Automatic Transcript Correction

While you can compute WER and WAcc by correcting a transcript yourself, you may get frustrated if you need to repeat this process to test many programs. Once you have a corrected transcript that you believe is free of errors (You may wish to submit it to the inspection of another set of eyeballs to be more confident!), you can call this a 'reference' transcript for your audio file and can use it to automatically compute a sequence of corrections and WER given any output of a transcription program for that audio file. This means that you will only need to do corrections once and thereafter you will be able to efficiently measure the relative effectiveness of any number of other programs.

You can find programs for automatic WER calculations at many transcription sites, although I could imagine them using an algorithm that favors the effectiveness of their service or whose method is hard to interpret. When I went about trying to understand and compare various services, I decided to write my own program based on the sequence alignment problem[2, 3]. The goal is to line up the output and the reference word sequences such that it takes the fewest number of substitutions, insertions, and deletions to transform one into another. Often there are many ways to apply corrections, but these algorithms assume the professional will perform corrections in the optimal manner.

Practical Example

Let's assume that you have a list of transcription programs under consideration. Lets call them program A, program B, program C, etc. You want to know which of these programs is the most promising for your application.

You choose an audio clip that is large enough that it represents all of the audio that you need transcribed, but small enough that you can afford to transcribe it using all of the programs.

Run the audio clip thru program A. For this example, program A spits out a transcript as written below:

Output of program A (To be corrected)

How mulch would could a wood chuck chuck
If a a woodchuck would could chuck wood?
wood as a would chuck wood chuck,
only a could woodchuck chuck wood.

Edit Sequence

Next, you must correct the output, strive to eliminate all mistakes in as few edits as possible, as this text will be the yardstick against which all output will be judged. Its okay if this take a while, since you won't need to manually correct any of the other output.

Here is an illustration of how to transform (correct) the transcript above into the reference below in as few edits as possible. Insertions and Substitutions are indicated in blue. Deletions are denoted by red.


How much wood could a woodchuck chuck chuck
If a a woodchuck would could chuck wood
As much wood as a would woodchuck could chuck
If a woodchuck could chuck wood

Reference Text[4] (Correct)

How much wood could a woodchuck chuck
If a woodchuck could chuck wood?
As much wood as a woodchuck could chuck,
If a woodchuck could chuck wood.

WER Calculation.

You can calculate WER and WAcc for program A directly from an edit sequence.

Above, the edit sequence contains 17 correct words, 8 substitutions, 2 insertions, and 4 deletions. You can plug these numbers into the WER formula:

\(\displaystyle WER = \frac{S+I+D}{S+D+C} = \frac{8+2+4}{8+4+17} = \frac{14}{29} = 48.3\% \)

The Word Accuracy rate is:

\(W_{Acc} = 1 - WER = 1 - .483 = 51.7\%\)

Based on these results, you can expect that program A will make a mistake in 48 out of every 100 words that it processes. Similarly, you can only have confidence in 51 out of every 100 words that program A produces. This is not very good, since you will need to do redo at least half of the work yourself and may not find the mediocre assistance of the program A to be worthwhile.

Let's proceed with program B. Here is it's output when run on the audio clip:

Output of program B (To be corrected)

How much wood could a woodchuck chuck chuck
If a woodchuck could chuck wood?
As much wood a woodchuck could chuck,
If a woodchuck could chuck wood.

Edit Sequence

Now correct the output of program B. Since you already know the reference text from earlier, you can save time by correcting the output using an automated WER calculation program. Alternatively, you can of course manually correct the transcript by hand again.

Here is the edit sequence produced by my program:


How much wood could a woodchuck chuck chuck
If a woodchuck could chuck wood
As much wood
as a woodchuck could chuck
If a woodchuck could chuck wood

WER Calculation (Program B)

You can now calculate WER and WAcc for program B.

The edit sequence above contains 26 correct words, 0 substitutions, 1 insertion, and 1 deletion. We can plug these numbers into the WER formula:

\(\displaystyle WER = \frac{S+I+D}{S+D+C} = \frac{0+1+1}{0+1+26} = \frac{2}{27} = 7.4\% \)

The Word Accuracy rate is:

\(W_{Acc} = 1 - WER = 1 - .074 = 92.6\%\)

Repeat this process for the remaining transcription programs.

Identify the most accurate program that meets your needs, then use it to transcribe all of your audio.

Conclusion

To conclude this article, here are some of the key points that I hope you have taken away from reading it:


1. https://en.wikipedia.org/wiki/Word_error_rate
2. https://en.wikipedia.org/wiki/Sequence_alignment
3. https://en.wikipedia.org/wiki/Edit_distance
4. Wood Chucks by Mother Goose. https://www.poetryfoundation.org/poems/42904/how-much-wood-could-a-woodchuck-chuck-

Bryce's WER Calculation Program: https://github.com/Bryce-Summers/Automatic-WER-Calculation