Loading [000/100]
← All case studies

GenAI Voice / Consumer Messaging

WhatsApp Regional Voice Over AI

A Smart Audio Playback layer for WhatsApp that renders text messages in the sender’s regional dialect with an authentic accent, rather than a generic synthetic voice.

Role
Product Manager and builder: framing, model selection, full-stack prototype.
Timeframe
December 2025 to January 2026
Stack
  • Next.js 14 frontend styled to match WhatsApp Web
  • FastAPI backend for AI orchestration
  • Google Gemini for dialect detection and script transliteration
  • ElevenLabs text-to-speech with pinned Voice IDs for high-fidelity regional accents (Hindi and Punjabi in the current build)
  • Experimental CarPlay surface at /carplay for hands-free playback

Why Now

Global text-to-speech has spent a decade optimizing the average voice and flattening everything underneath it. For any user whose primary communication carries a dialect or regional accent, the existing accessibility pipeline strips out the very thing that makes a message feel personal.

The timing bet is that hosted voice models finally cleared the fidelity bar in 2025. A pipeline that combines dialect detection with a voice model tuned to a specific accent now produces audio that a listener will accept as authentic, which was not true a year earlier.

The Problem

Two separate user pains meet on one surface. The first is accessibility: users who need or prefer audio playback receive a generic, affectless voice that erases the identity of the sender. The second is context: when messages are read in a voice that does not match the sender, the emotional register of the original message is lost.

From a product lens, this is a daily-engagement and retention surface. Audio playback raises session length, extends messaging into contexts where the screen cannot be used, and opens a wedge into in-car and wearable listening where messaging apps currently have no meaningful presence.

Product Bet

The core bet is that fidelity, not feature count, decides adoption. Users will open a synthesized audio clip once out of curiosity. They will only open it again if the voice feels like it could belong to the person who sent the message.

The practical evaluation target for this prototype was a short preview clip. A user should be able to listen to a few seconds of synthesized audio and accept it as a plausible voice note from the sender, rather than as a machine reading a message out loud.

What I Built

A WhatsApp Web clone with a play icon on every text message. Clicking the icon sends the message text through Gemini, which detects the dialect and produces the correct transliteration. The transliterated text is then sent to ElevenLabs against a pinned Voice ID chosen to match that dialect, which returns an audio clip that plays inline.

The prototype also ships a CarPlay surface. A notification banner triggers hands-free audio playback, which represents the real adoption vector for the feature. Users who cannot look at a screen are the ones who need synthesized voice most, and designing for that context first keeps the product honest.

The backend and frontend are cleanly separated, so the voice pipeline can be swapped without touching the messaging interface. This matters for a prototype because the model layer is the part most likely to change as better options ship.

Tradeoffs

I did not build on-device synthesis. On-device would have solved the long-term cost story but would have added significant scope without changing what the prototype was designed to answer. The point of this build was to validate fidelity, not to prove unit economics.

I chose pinned Voice IDs over voice cloning of the sender. Cloning is the obvious future state, but it carries a consent surface that is too heavy for a first prototype.

Business Read

At messaging-app scale, a pure API pipeline is not viable on unit economics. The productionization path would use a distilled on-device model for the highest-volume dialects, and reserve hosted inference for the long tail.

The feature is best understood as a platform wedge rather than a standalone product. Once audio playback is native to a messaging app, the same pipeline extends into in-car listening, wearables, and any context where the screen is unavailable. Those are the surfaces where messaging apps currently lose time to voice assistants and podcasts.

Outcomes

  • End-to-end prototype shipping: WhatsApp Web UI, dialect detection, regional voice synthesis, and inline playback working across a full chat.
  • A hands-free CarPlay surface built into the prototype, which keeps the design anchored to the use case that actually needs synthesized voice.
  • A clear productization path identified: distilled on-device models for the high-volume dialects, with hosted inference kept for the long tail.
Get in touch →