7 min left
0% complete
Building Speech-to-Text in React with the Web Speech API
Imagine filling out a form by simply speaking your thoughts—no typing, no distractions, just pure expression. With modern browser APIs and a bit of clever React logic, this isn’t science fiction. It’s
Imagine filling out a form by simply speaking your thoughts—no typing, no distractions, just pure expression. With modern browser APIs and a bit of clever React logic, this isn’t science fiction. It’s entirely possible today using the Web Speech API.
In this guide, we’ll walk through how to implement continuous, uninterrupted speech recognition in a React application. You’ll learn how to build a reusable useSpeechRecognition hook, integrate it smoothly into your UI, and handle real-world edge cases that browsers don’t warn you about—like sudden disconnections mid-sentence.
By the end, you’ll have everything you need to add natural, accessible voice input to your forms with clean, maintainable code.
Part 1: Creating the useSpeechRecognition Hook
What Is This Hook For?
At its core, the useSpeechRecognition hook brings voice-to-text capability to your React app using only the browser’s built-in Web Speech API—no SDKs, no external dependencies.
It abstracts away browser inconsistencies and reliability issues, delivering a smooth, continuous dictation experience even when the underlying API gives up too easily.
How Does It Work?
The magic starts with the SpeechRecognition interface, available in Chromium-based browsers like Chrome and Edge. Here's how it’s configured:
const recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
recognition.continuous = true;
recognition.interimResults = true;Let’s break that down:
continuous: truemeans recognition doesn’t stop after a few seconds of silence.interimResults: trueallows real-time updates as the user speaks—showing partial results before they finish.
The main events we listen for:
onresult: Fires as speech is recognized, returning both final and interim (in-progress) transcripts.onend: Triggers when recognition stops—sometimes unexpectedly.onerror: Handles issues like mic permissions or network errors.
The Big Problem: Recognition Dies Mid-Sentence
Even with continuous: true, most browsers will abruptly stop listening after a few seconds. That’s not acceptable for long-form responses.
The solution? Auto-restart on `onend`.
We use a shouldRestartRef flag (a mutable ref to avoid re-renders) to decide whether the stop was intentional (user clicked "Stop") or forced (browser quit on its own).
recognition.onend = () => {
if (shouldRestartRef.current) {
try {
recognition.start();
} catch {
setIsListening(false);
shouldRestartRef.current = false;
}
} else {
setIsListening(false);
}
};This small piece of logic ensures users can speak uninterrupted for as long as they want—whether it’s 10 seconds or 2 minutes.
How to Use the Hook
Using the hook in a component is straightforward:
const {
isListening,
isSupported,
interimTranscript,
startListening,
stopListening,
} = useSpeechRecognition({
onResult: (transcript) => {
setUserResponse((prev) => prev + " " + transcript);
},
onError: (error) => {
if (error.type !== "no-speech") {
toast.error(error.message);
}
},
});Key features returned by the hook:
isListening: Control UI state (e.g., pulse animation)isSupported: Hide the mic button entirely in unsupported browsersinterimTranscript: Show what the user is currently saying in real time
You can plug this into any form field—textareas, input boxes, comment forms—and instantly enable voice input.
Trade-offs and Considerations
While powerful, the Web Speech API comes with some important trade-offs. Being honest about these helps set realistic expectations.
| Consideration | Benefit | Trade-off |
|---|---|---|
| Auto-restart | No mid-speech cutoffs | May resume after intentional pauses |
continuous: true | Supports long dictation | Increases CPU/battery usage |
| Browser-native | Zero external dependencies | Limited browser support |
| Interim results | Instant feedback | Requires UX handling of live-updating text |
💡 Note: Most implementations send audio to cloud-based services (like Google’s speech servers), so an active internet connection is required.
Best Practices We Followed
- Graceful degradation: The form works perfectly without voice input—we treat speech as an enhancement.
- Smart error handling: Not all errors need to bother the user. For example,
"no-speech"means silence, not failure—so we ignore it. - Clean up after yourself: On unmount, stop recognition and reset refs to prevent memory leaks:
useEffect(() => {
return () => {
shouldRestartRef.current = false;
if (recognitionRef.current) {
recognitionRef.current.stop();
recognitionRef.current = null;
}
};
}, []);- Hide vs. disable: We hide the mic button entirely in Firefox instead of showing a disabled state, reducing confusion.
Part 2: Integrating Voice Input into Your UI
Now that the engine is built, let’s focus on user experience—how voice control feels in practice.
Designing for Clarity and Feedback
Good voice UX isn’t just about functionality; it’s about making the user feel heard.
We integrated the hook into a comprehension assessment form with three key elements:
- Mic Button – Visible only in supported browsers, styled as a ghost button until active.
- Listening Indicator – A badge with a pulsing dot and live transcript preview.
- Dynamic Text Sync – Each recognized phrase appends cleanly to the form’s textarea.
Let’s look at the listening indicator in action:
{isListening && (
<div className="flex items-center gap-2">
<Badge variant="secondary" className="gap-1.5">
<span className="relative flex h-2 w-2">
<span className="absolute inline-flex h-full w-full animate-ping rounded-full bg-primary opacity-75" />
<span className="relative inline-flex h-2 w-2 rounded-full bg-primary" />
</span>
Listening...
</Badge>
{interimTranscript && (
<span className="text-sm text-muted-foreground italic truncate">
"{interimTranscript}"
</span>
)}
</div>
)}That little animated dot gives instant visual confirmation: Yes, we're listening.
And showing the interimTranscript? That’s crucial. It reassures users they’re being understood—even before they’re done speaking.
What Does the Experience Feel Like?
Imagine a user reading a technical blog post and clicking “Test Your Understanding.” They see a mic icon next to the instructions.
Here’s what happens next:
- Click the mic → button pulses, "Listening..." appears
- Say: _"The main concept is dependency injection, which allows components to receive their dependencies from outside..."_
- Words stream into the textarea as they speak
- Click the mic again or submit the form → listening stops
No jarring cutoffs, no confusing errors—just fluid, intuitive input.
UI Decisions and Why They Matter
| Decision | Why It Works |
|---|---|
| Hide mic in unsupported browsers | Cleaner than showing a broken/disabled button |
| Append with space logic | Avoids run-on words: "hello world", not "helloworld" |
| Pulse animation | Strong visual signal that the mic is live |
| Show interim transcript | Builds trust through immediate feedback |
Every choice here was shaped by one goal: make voice input feel natural.
Handling Edge Cases Gracefully
Even with great UX, we must anticipate when things change outside the voice flow.
For example: What if the user submits the form while still talking?
We auto-stop recognition when:
- The form is submitting (
isLoading) - Results are being displayed (
showResults)
useEffect(() => {
if (isLoading || showResults) {
stopListening();
}
}, [isLoading, showResults, stopListening]);This prevents background recognition from continuing after the task is complete—a subtle but important detail.
Final Thoughts and Key Takeaways
Adding speech-to-text to your React app doesn’t require AI platforms or expensive APIs. With the Web Speech API and a well-designed hook, you can deliver fast, responsive voice input that feels native.
Here’s what we learned:
- Browser APIs need wrappers: Raw APIs are unreliable; layer in logic for continuity and UX.
- Feature detection > error handling: Check
isSupportedearly and adapt the UI accordingly. - Design the unhappy path: Unsupported browsers, errors, permissions—these aren’t edge cases, they’re part of the journey.
- Feedback builds trust: Visual cues and live previews make users confident in the system.
While browser support remains limited—full in Chrome/Edge, limited in Safari, none in Firefox—the value in supported environments is undeniable. And as voice interaction becomes more expected, having a solid, reusable pattern ready will give you a head start.
So go ahead: give your users a voice—literally.
Browser Support Reference
| Browser | Speech Recognition Support |
|---|---|
| Chrome | ✅ Full |
| Edge | ✅ Full |
| Safari | ⚠️ Limited |
| Firefox | ❌ Not supported |