7 min left

0% complete

Building Speech-to-Text in React with the Web Speech API

Imagine filling out a form by simply speaking your thoughts—no typing, no distractions, just pure expression. With modern browser APIs and a bit of clever React logic, this isn’t science fiction. It’s

January 29, 2026·7 min read

In this guide, we’ll walk through how to implement continuous, uninterrupted speech recognition in a React application. You’ll learn how to build a reusable useSpeechRecognition hook, integrate it smoothly into your UI, and handle real-world edge cases that browsers don’t warn you about—like sudden disconnections mid-sentence.

By the end, you’ll have everything you need to add natural, accessible voice input to your forms with clean, maintainable code.

Part 1: Creating the `useSpeechRecognition` Hook

What Is This Hook For?

At its core, the useSpeechRecognition hook brings voice-to-text capability to your React app using only the browser’s built-in Web Speech API—no SDKs, no external dependencies.

It abstracts away browser inconsistencies and reliability issues, delivering a smooth, continuous dictation experience even when the underlying API gives up too easily.

How Does It Work?

The magic starts with the SpeechRecognition interface, available in Chromium-based browsers like Chrome and Edge. Here's how it’s configured:

const recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
recognition.continuous = true;
recognition.interimResults = true;

Let’s break that down:

continuous: true means recognition doesn’t stop after a few seconds of silence.
interimResults: true allows real-time updates as the user speaks—showing partial results before they finish.

The main events we listen for:

onresult: Fires as speech is recognized, returning both final and interim (in-progress) transcripts.
onend: Triggers when recognition stops—sometimes unexpectedly.
onerror: Handles issues like mic permissions or network errors.

The Big Problem: Recognition Dies Mid-Sentence

Even with continuous: true, most browsers will abruptly stop listening after a few seconds. That’s not acceptable for long-form responses.

The solution? Auto-restart on `onend`.

We use a shouldRestartRef flag (a mutable ref to avoid re-renders) to decide whether the stop was intentional (user clicked "Stop") or forced (browser quit on its own).

recognition.onend = () => {
  if (shouldRestartRef.current) {
    try {
      recognition.start();
    } catch {
      setIsListening(false);
      shouldRestartRef.current = false;
    }
  } else {
    setIsListening(false);
  }
};

This small piece of logic ensures users can speak uninterrupted for as long as they want—whether it’s 10 seconds or 2 minutes.

How to Use the Hook

Using the hook in a component is straightforward:

const {
  isListening,
  isSupported,
  interimTranscript,
  startListening,
  stopListening,
} = useSpeechRecognition({
  onResult: (transcript) => {
    setUserResponse((prev) => prev + " " + transcript);
  },
  onError: (error) => {
    if (error.type !== "no-speech") {
      toast.error(error.message);
    }
  },
});

Key features returned by the hook:

isListening: Control UI state (e.g., pulse animation)
isSupported: Hide the mic button entirely in unsupported browsers
interimTranscript: Show what the user is currently saying in real time

You can plug this into any form field—textareas, input boxes, comment forms—and instantly enable voice input.

Trade-offs and Considerations

While powerful, the Web Speech API comes with some important trade-offs. Being honest about these helps set realistic expectations.

Consideration	Benefit	Trade-off
Auto-restart	No mid-speech cutoffs	May resume after intentional pauses
`continuous: true`	Supports long dictation	Increases CPU/battery usage
Browser-native	Zero external dependencies	Limited browser support
Interim results	Instant feedback	Requires UX handling of live-updating text

💡 Note: Most implementations send audio to cloud-based services (like Google’s speech servers), so an active internet connection is required.

Best Practices We Followed

Graceful degradation: The form works perfectly without voice input—we treat speech as an enhancement.
Smart error handling: Not all errors need to bother the user. For example, "no-speech" means silence, not failure—so we ignore it.
Clean up after yourself: On unmount, stop recognition and reset refs to prevent memory leaks:

useEffect(() => {
  return () => {
    shouldRestartRef.current = false;
    if (recognitionRef.current) {
      recognitionRef.current.stop();
      recognitionRef.current = null;
    }
  };
}, []);

Hide vs. disable: We hide the mic button entirely in Firefox instead of showing a disabled state, reducing confusion.

Part 2: Integrating Voice Input into Your UI

Now that the engine is built, let’s focus on user experience—how voice control feels in practice.

Designing for Clarity and Feedback

Good voice UX isn’t just about functionality; it’s about making the user feel heard.

We integrated the hook into a comprehension assessment form with three key elements:

Mic Button – Visible only in supported browsers, styled as a ghost button until active.
Listening Indicator – A badge with a pulsing dot and live transcript preview.
Dynamic Text Sync – Each recognized phrase appends cleanly to the form’s textarea.

Let’s look at the listening indicator in action:

{isListening && (
  <div className="flex items-center gap-2">
    <Badge variant="secondary" className="gap-1.5">
      <span className="relative flex h-2 w-2">
        <span className="absolute inline-flex h-full w-full animate-ping rounded-full bg-primary opacity-75" />
        <span className="relative inline-flex h-2 w-2 rounded-full bg-primary" />
      </span>
      Listening...
    </Badge>
    {interimTranscript && (
      <span className="text-sm text-muted-foreground italic truncate">
        "{interimTranscript}"
      </span>
    )}
  </div>
)}

That little animated dot gives instant visual confirmation: Yes, we're listening.

And showing the interimTranscript? That’s crucial. It reassures users they’re being understood—even before they’re done speaking.

What Does the Experience Feel Like?

Imagine a user reading a technical blog post and clicking “Test Your Understanding.” They see a mic icon next to the instructions.

Here’s what happens next:

Click the mic → button pulses, "Listening..." appears
Say: _"The main concept is dependency injection, which allows components to receive their dependencies from outside..."_
Words stream into the textarea as they speak
Click the mic again or submit the form → listening stops

No jarring cutoffs, no confusing errors—just fluid, intuitive input.

UI Decisions and Why They Matter

Decision	Why It Works
Hide mic in unsupported browsers	Cleaner than showing a broken/disabled button
Append with space logic	Avoids run-on words: `"hello world"`, not `"helloworld"`
Pulse animation	Strong visual signal that the mic is live
Show interim transcript	Builds trust through immediate feedback

Every choice here was shaped by one goal: make voice input feel natural.

Handling Edge Cases Gracefully

Even with great UX, we must anticipate when things change outside the voice flow.

For example: What if the user submits the form while still talking?

We auto-stop recognition when:

The form is submitting (isLoading)
Results are being displayed (showResults)

useEffect(() => {
  if (isLoading || showResults) {
    stopListening();
  }
}, [isLoading, showResults, stopListening]);

This prevents background recognition from continuing after the task is complete—a subtle but important detail.

Final Thoughts and Key Takeaways

Adding speech-to-text to your React app doesn’t require AI platforms or expensive APIs. With the Web Speech API and a well-designed hook, you can deliver fast, responsive voice input that feels native.

Here’s what we learned:

Browser APIs need wrappers: Raw APIs are unreliable; layer in logic for continuity and UX.
Feature detection > error handling: Check isSupported early and adapt the UI accordingly.
Design the unhappy path: Unsupported browsers, errors, permissions—these aren’t edge cases, they’re part of the journey.
Feedback builds trust: Visual cues and live previews make users confident in the system.

While browser support remains limited—full in Chrome/Edge, limited in Safari, none in Firefox—the value in supported environments is undeniable. And as voice interaction becomes more expected, having a solid, reusable pattern ready will give you a head start.

So go ahead: give your users a voice—literally.

Browser Support Reference

Browser	Speech Recognition Support
Chrome	✅ Full
Edge	✅ Full
Safari	⚠️ Limited
Firefox	❌ Not supported

Part 1: Creating the useSpeechRecognition Hook