ASR How Does it Work? The New Generation of ASR Transcription (2024)

Automatic speech recognitiontechnology (ASR) ishaving a great impact on the world. This technology is alreadytransforming the way students learn, employees work and society functions. ASRis also creating opportunities to assist specific communities of individuals, such as those navigating life or their studies with disabilities.

While ASR is a valuable tool that many people are using in their day-to-day lives, not everyone understands how it works or why it’s so useful. Misconceptions about the role of ASR and its capabilities persist. Delve deeper into the ways this technology works, and how ASR is supporting people with disabilities while simultaneously improving efficiency and saving time for millions of professionals.

Table of Contents:

What is ASR?
How does ASR transcription work?
What is ASR used for?
How does Verbit’s ASR work specifically?
How is the accuracy of ASR measured?

What is ASR?

An automatic speech recognition system involvesvoice recognition software thatprocesses human speech and turns it into text. While many people are only now learning the capabilities of these types of tools, engineers and researchers have spent decades working to build such systems. In fact, the first attempts tocreate speech recognition tools date back to 1952. At that time,Three Bell Labs researchers built a system called “Audrey” for single-speaker digit recognition.

The capabilities of today’s ASR far exceed those of its predecessors. The reason for this is that innovations in the realm of artificial intelligence are allowing engineers to develop sophisticated software that responds to human voices. Modernsystems can even differentiate speakers, accents and more.

How does ASR transcription work?

From the user’s perspective, setting up ASR and capturing a recording is easy. Essentially, the process works as follows:

An individual or a group speaks, and the ASR software detects this speech.
The device then creates a wave file of the words it hears.
The wave file is cleaned to delete background noise and normalize the volume.
The software then breaks down and analyzes thefiltered wave file in sequences.
The automatic speech recognition software analyzes these sequences and employs statistical probability to determine the whole words. Next, it works them intocomplete sentences.
Some technology providers’ ASR service includes editing by professional human transcribers. Adding this layer to the process helps correct any errors to achieve greater accuracy.

ASR How Does it Work? The New Generation of ASR Transcription (1)

What is ASR used for?

A variety of industries use ASR for many different purposes. For instance, ASR technology is becoming a standard tool for professionals in higher education, legal, finance, government, health care and media. In all these fields, conversations are continuous and it’s often necessary to capture word-for-word records. Here are some examples of ASR use cases in different industries.

Legal: In legal proceedings, it’s often crucial to capture every word that a witness or other involved party states. Also,there’s currently a shortage of court reporters, making it challenging to carry out this important step.Digital transcriptionand the ability to scale are key solutionsthat ASR technology offers those in this industry.
Higher education: ASR captions and transcriptions allow universities to support students navigating hearing loss or other disabilities in classrooms. It can also serve the needs of students who are non-native speakers, commuters, or who have varying learning needs. For instance, students with ADHD often focus better when they have access to captions.
Health care: Doctors are using ASR to transcribe notes from meetings with patients or document steps during surgeries.
Media: Media production companies use ASR to providelive captionsandmedia transcriptionfor all the produced and must according to the FCC (Federal Communications Committee) and other guidelines.
Corporate: Companies useASRcaptioning and transcription to provide more accessible training materialsand create inclusive environments for employees with differing needs.

What are the advantages of automatic speech recognition vs. traditional transcription?

Aside from the growing shortage of skilled traditional transcribers, ASR machines can help to improve efficiencies for captions and transcriptions. The technology can differentiate between voices in conversations, lectures, meetings and proceedings to provide an understanding of who said what. Speaker differentiation can be helpful since disruptions among participating parties are common in conversations with multiple stakeholders.

Users can upload hundreds of related documents, including books, articles and more into the ASR machine to train it to get smarter. The technology can absorb this plethora of information faster than a human can. It can then begin recognizing different accents, dialects and terminology more accurately.

However, the ideal format involves using human intelligence to fact-check results that theartificial intelligence produces. This editing step is particularly important when the ASR is supporting accessibility initiatives where guidelines and lawsrequire near-perfect accuracy.

How does Verbit’s ASR work specifically?

Verbit’s ASR machine works to provide captions and transcriptions for bothliveandrecordedaudio and video. It uses adaptive algorithms andthree modelsthat inform the ASR machine’s ability to perform precisely.

Anacoustic modelreduces background noise and echoes to cancel out factors that reduce the audio quality. This model also identifies speakers.
Alinguistic modelidentifies specific terminology, recognizes different accents and dialects and differentiates between speakers.
Acontextual events modelincorporates current events, news, and relevant updates. By doing so, the technology incorporates new terms that enter the public dialogue.

Verbit’s automatic speech recognition system works live, or users can select to upload completed recordings of files. After the user uploads those files, theproprietary speech-to-text engine gets to work.

Achieving accuracy is highly important toVerbitand its clients. In fact, laws like the Americans with Disabilities Act often require higher levels of accuracy from our clients. To accommodate this need, Verbit takes the process one step further by using two skilled human transcribers per project to edit and review the ASR’s results. Once the process is complete, users can download thefile immediately in the format of their choice.

How is the accuracy of ASR measured?

ASR alone isn’t always accurate. However, the accuracy varies greatly based on several factors, including how much training went into developing the system. As a result, some ASR performs much better than others. The system used to measure the accuracy of ASR is called the word error rate (WER).

The WER uses three categories of errors, including substitutions, deletions and insertions.

Substitutions: Thishappens when the ASR replaces the correct word with an incorrect one. For example, if a speaker says, “Don’t make a fuss,” and the ASR writes “Don’t make a bus.” Advanced AI takes the context into consideration to reduce these types of errors.

Deletions: A deletion is when the ASR leaves out a word. Omitting a word can change the meaning and make for a confusing transcription. Just consider the difference between “She did not complete the task” and “She did complete the task.”

Insertions: Sometimes, ASR will include words that the speaker did not say. Maybe the speaker said, “We’re ahead of schedule,” but the ASR transcribes, “We’re too ahead of schedule.” In this case, maybe another speaker, background noise or another issue led to the extra word.

Calculating the WER means dividing the number of errors by the total number of words in the sample audio and transcription. If there are 100 words in the sample and 20 errors, the WER is .2. ASR can produce transcripts with impressive WER rates. However, many variables impact accuracy.

When using ASR to transcribe poor-quality audio, speakers with heavy accents, recordings that include unusual niche language and other challenges, the transcript will likely have a worse WER. In real-world scenarios, background noise or speakers who stand too far from or too close to a microphone can impact the ability of ASR to produce quality results.

Training the AI to handle these issuescan reduce errors, but the best way to provide high quality is to have humans edit the results. When it comes to accessibility, adding this layer is often necessary to provide an equitable experience.

Automatic speech recognition technology is now expected and evolving

Consumers and professionals now expect to reap thebenefits that automatic speech recognition offers. The days of jotting down notes by hand, figuring out which button turns the lights on and rushing home after forgetting to lock the door are gone. You’ll be able to complete all of these tasks with your voice. Additionally, these features will be secure as the technology learns to differentiate between different voices.

ASR software andASR transcriptionservices will only continue to disrupt the way we function in our classrooms, workplaces and homes. With more efficiencies and use cases, this technology will continue to evolve to best serve those who rely on it.

Verbit’s mature ASR is supporting universities, businesses and other organizations worldwide. Reach out to us today to learn how our accessibility solutions are helping create more inclusive environments and new opportunities for people with disabilities.