We Stopped Typing
Inside the year voice AI became impossible to ignore
Imagine this: the year is 2035, and you’re sitting in a bright and sunny cafe for lunch in Lisbon with your friend who only speaks Portuguese. As your friend speaks, her words arrive in your ears in English, with no perceptible delay. The earbuds you’re wearing assist with boosting her voice, minimizing background noise, and interface seamlessly with the glasses you’re wearing. There’s no way to tell, but your lightweight glasses have an invisible screen built into them, creating a minimal overlay across your field of vision for language assistance. When you respond back to her, she hears you in Portuguese. Neither of you is looking at a phone, and you’ve completely forgotten you’re wearing earbuds or your glasses. You’re just in the moment.
After you leave the cafe, you walk back to a coworking space for a meeting. On the way, you draft a message without speaking or typing. A thin wearable resting behind your ears reads the neuromuscular signals from your jaw and face as you subvocalize the words. You ‘speak’ silently, and you don’t even need to use the screen on your portable computer to check the message before you send it. You briefly confirm the send with your OS agent before you arrive.
Once the meeting begins, you do not take notes. You simply set your portable computer on the desk, and it captures everything. You stopped carrying a laptop computer with you a couple of years ago, and your phone and computer are now the same thing. If you need a larger screen, you just pull your foldable portable screen out of your backpack. Your phone tracks your eye movements when you look at the screen, so you rarely need to even touch the screen anymore to do anything. It’s faster to type with eyes and write with your mind. The daily experience of using your phone is so seamless you’d swear it was a part of your body, and some days it really feels like it.
Some of this sounds like it’s pretty out there. And, compared to how things are today, it seems like a major jump. But everything I’m describing exists in prototype or early production today. It isn’t a matter of if this tech converges, but when.
As human beings, voice is our natural UX. We learn how to speak before we learn just about anything else, and we’ve only been typing as a human race for ~150 years. Computing interfaces have always moved towards less friction: keyboards & mice → touchscreens → voice. The only thing faster than our hands to type is using our voice and mind.
The next step isn’t to add a new interface. The future of interfacing removes the need for additional UI entirely. Silent speech, predictive input, and thought-to-action design.
Major advances in 2026 are speeding up the inevitable voice revolution, with voice AI finally passing the threshold of ‘interesting’ to astonishing.
Our Vocal Past
I’m sure many of you remember Dragon NaturallySpeaking. Back in the early 2000s, dictating to MS Word through Dragon felt like magic. Of course, you needed to train the software for an hour on your voice before it was usable, and punctuation was frustrating. It was cool to watch words magically pop onto the screen, even if you had to enunciate and slow down your speaking.
Ultimately, it was a miserable experience, and a great example of software that was a fantastic idea but too slow to be truly useful for the average person. If you could type reasonably fast, it was still more productive than speaking. If you used Dragon, you’d have to go through your entire document afterwards and edit it for commas, periods, and mis-dictations. A miracle of technology for those unable to type, but not fun to use for everyone else.
Speech-to-text has been quite the slog. When Siri was introduced by Apple in 2011, it had voice, but we quickly learned what Siri’s limitations were. Alexa by Amazon and Google Assistant quickly followed, heavily relying on recurrent neural networks (RNNs) and Long Short-Term Memory (LSTM) networks. Google Assistant slowly became a more enjoyable experience, but this took years, and the technology powering the intelligence behind these systems was fragmented and unable to reason about intent. Simply put, the interfaces prior to LLMs were not intelligent enough to make interfacing feel natural. This was obvious to anyone using them for more than a couple of minutes.
When OpenAI released Whisper in September of 2022, the accuracy floor for speech-to-text jumped overnight. Whisper was trained on 680,000 hours of multilingual audio, and while it wasn’t perfect, it was good enough and free. Most importantly of all, it can be run locally.
In the years since Whisper was released, the world has been sprinting towards realistic and lifelike human vocal audio. ElevenLabs has become the top voice AI API for developers, and the latest models are often indistinguishable from a real recorded voice. Millions of people talk to ChatGPT using advanced voice mode daily, and the interface has proven to be both useful and the most intuitive way to talk to AI. On-device translation is now a reality for Apple and Android devices, with solutions for translation that are both secure and quick. The barriers for language and communication are dwindling by the day.
Today, anyone can install a free and open source version of Whisper on their computer. Personally, I’m using MacWhisper. It nails punctuation and honestly works like magic. I can’t imagine using my Mac without it. I talk at my computer all day and still can’t get over how fast it is.
What Works Today
When we talk about voice and AI, we’re really talking about a few things. Speech-to-text, text-to-speech, and speech-to-speech. This involves transcription, dictation, translation, and audio generation. Each problem has been tackled in different ways, but some problems are closer to being truly solved than others.
Dictation (Realtime Speech-to-Text) has been solved: speech-to-text is a solved problem on the computer. Whisper is free and paid solutions are just a nicer UX. Solutions that are just as easy exist for iOS, Android, Windows, and Linux.
Voice Generation (Text-to-Speech) is nearly solved: Tools like ElevenLabs’ v3 generate voices with sighs, whispers, laughs, and tonal variation that you can control. Even just a couple of years ago TTS felt off. The latest models, like Sesame’s CSM are so lifelike I forget I’m talking to an AI. Training models on real conversations vs trained actors was one of the solutions to making things more lifelike.
Translation is real-time and making major progress: Google Translate with Gemini can now do continuous real-time speech-to-speech translation, and Apple’s Live Translate works the same way. Google Lens is fantastic for quickly translating text as you’re walking around, and generating speech takes a second or less. Not perfect, but getting there.
Meeting notes have been solved. Even just a few years ago, we’d take notes by hand! Can you believe that? Now it’s totally solved. If you don’t like Granola, just use Notion’s AI Notetaker or Gemini. Not only do you get the transcript, you also get a summary. This has improved drastically from even just a couple of years ago.
Phone agents are here. Companies have already deployed ElevenLabs and Rime agents into the field. Voice agents can, on average, handle 90% of inbound calls at a fraction of the cost of a human agent, with ~10x faster resolution. Healthcare, finance, and legal are coming next.
What is not solved is also notable:
Real-time translation for medical and legal context. Errors are not acceptable here, and we’ve got a while to go before we’re entirely error free.
Noisy, multi-speaker environments where identifying who is speaking is still unreliable. If you’re in a social situation where multiple people are talking, what does the software translate?
Heavy accents and non-standard dialects are still a major problem for speech to text software.
More accurate systems take longer, but become less life-like and realistic. There’s a real trade-off between accuracy and latency. Do you want it fast and good, or slower and perfect?
The Sesame Breakthrough
I want to hone in on Sesame, because the company has caught my attention and I can’t stop talking about it.
In a nutshell, Sesame released their conversational speech model about a year ago that generates speech that includes natural pauses, hesitations, breathing sounds, and tonal shifts. The result is something that does not sound like a TTS engine. It sounds like a person.
I’m not the only person that’s impressed. Sesame recently raised $250 million in a series B in October, led by Sequoia. Sesame is hiring for professionals in the wearables space, and is clearly prototyping glasses to go with their model. Who knows when we’ll see the design, but I’m excited.
Sesame is obviously not the only company producing realistic audio. Hume, Cartesia, OpenAI, and ElevenLabs are all producing high-quality conversational audio. But Sesame proved to me, in their demo, that the uncanny valley of synthetic speech is over. If you don’t believe me, put on a pair of headphones and talk to Maya for 10 minutes. Let me know how it goes.
Models are also experimenting with detecting and responding to human emotions in real time. You might be sad, or angry, and traditional models won’t pick up on this. Hume AI’s Empathic Voice Interface does. Google took note of this and recently snagged the key players behind the software. Emotional intelligence is incredibly important for advancing the space, and Google realizes this.
Companies to Watch
The Voice AI space is evolving fast, and the winners are already becoming clear.
ElevenLabs: The dominant platform for text-to-speech, expanding into agents, music generation, and real-time transcription. The AWS of voice. The single company can handle TTS, automated speech recognition (ASR), voice agents for business, and now music generation.
OpenAI: Advanced voice mode changed the game when it came out, and OpenAI is slated to release a new model in 2026 that offers more natural back-and-forth. OpenAI is working with Jony Ive on a new audio device that may be released as early as 2027.
Google: Gemini is a powerhouse. You can use it for translation, transcription, and dictation. Millions of smart home devices already use Gemini, and it has become the default and go-to for Android devices globally.
Apple: On-device translation is already happening, and expanded Live Translate is slated to be released later in 2026. With wearables on the horizon, voice will become increasingly important for Apple to nail.
Sesame: Incredibly realistic, supporting more languages, and slated to be released from beta this year. Their AI glasses alone are worth keeping an eye on.
Fish Audio: The leader of Asian TTS. Cheaper than ElevenLabs and open-sourced their distilled model. If you’re building for Chinese, Japanese, or Korean, Fish is the best option.
AlterEgo: Spun out of MIT Media Lab, they’re building a non-invasive wearable that reads subvocalizations from your face and jaw to let you communicate silently. Up to 100 words per minute and climbing. Voice AI will expand to include silent speech. And, a potential breakthrough for assisting those with ALS and MS for speech restoration.
Tokyo’s Voice AI Panel
This week I attended a Voice AI panel at Google’s Tokyo office in Shibuya, hosted by AI Founder’s Mindset.
Three panelists. Abhishek Gupta from OneInbox, Donnie Chen from Agora, and Sho Akiyama from Mercari. They each brought different insights to the table.
Here’s some of their top insights:
Using Voice AI
Argue with AI out loud to think better. Sho brainstorms scripts and screenplays by talking to ChatGPT voice while driving around Tokyo. He treats it as an argumentative co-writer. The real-time back-and-forth is qualitatively different from typing prompts.
Generate reference tracks to show creative intent. Sho uses ElevenLabs to produce draft voice acting in English and Japanese so voice actors hear the exact emotion and tone he wants without booking studio time. More iteration cycles, better end product.
Voice companions beat text companions for engagement. A text-based AI in a gaming chat room feels like a bot. A voice-based AI with natural speech patterns feels like someone who showed up to hang out. Japan is the first market proving this out.
On-device models are unlocking new product categories. Qwen 3.5 at 2 billion parameters runs on edge devices without a network connection. Conversational toys, companion devices, and character IP products that work without Wi-Fi or a monthly subscription.
Your existing headphones are now a translation device. Do not buy dedicated hardware. Google’s real-time translation works through any pair of headphones. Apple does it on-device through AirPods. The capability moved from product to feature.
Building with Voice AI
“Voice does not have a UX. Voice is a UX.” Stop treating voice as a feature bolted onto a product. Study how call centers managed conversations for decades. The patterns for handling silence, confusion, and frustration already exist.
Design for silence, not just speed. When the AI does not know the answer, have it say “let me check on that” the way a human agent would. Users tolerate waiting. They do not tolerate dead air.
Latency is overrated. The industry obsesses over shaving milliseconds when the real problem is conversational naturalness. A 400ms response that feels human beats a 100ms response that feels robotic.
Summarize context, do not carry the full transcript. Extract key points from the first few minutes and carry the summary forward. The user expects the AI to remember why the conversation started, not every word they said.
Use a second AI agent as a real-time monitor. Run an independent listener on every call that flags when the conversation goes off-track. Operators see a dashboard with red indicators and can intercept. Do not depend on the primary model to police itself.
One thing is very clear: we’re just getting started with the potential for voice products integrated into our daily lives. The words seamless and intuitive come to mind.
The Trajectory
Voice is the most natural UX. The last few decades have forced us into interfaces that, in hindsight, may prove to look somewhat unnatural. Each generation of technology has reduced friction, and voice is that next reduction. Beyond voice, subvocalization and neural interfaces point us towards a future where the gap between thinking and communicating approaches zero.
We’re not there yet, but we’re getting closer. The quality threshold has been crossed, the infrastructure has been built, and enterprise is buying. On-device processing is improving every year, and open source solutions are bringing these solutions to masses for cheap. Voice AI funding increases year over year. Capital is following demonstrated demand.
This is a fun space to follow, and it’s one that I think we all can highly relate to. We use our voices daily to communicate, and now our devices are finally catching up with our biology. The usage feels more natural every day.
I’m excited to see translation become seamless, voice used for all kinds of interactions, and barriers drop as we embrace the most natural UX there is.
If you’ve been using Wispr Flow, MacWhisper, Granola, or other tools, I’d love to hear about it. How has it changed your work? Do you love it? Did you try it and find it cumbersome? Let us know in the comments.
Dictated using MacWhisper by Chris.













since i grew up on star trek, i've always intuitively thought it was all about voice as well. ive tried to train myself into a voice-only daily workflow with ai but i always go back to the keyboard. i found that: a lot more fatiguing to go voice all day, i can think more clearly by typing very slowly and re-reading it back, and there are lots of times and places where i dont have the necessary privacy or acoustic isolation to feel good about freely talking to voice input.
my strongest usecase for voice so far is when im burnt out on typing for the day aand i think of a random wikipedia type question and use voice-to-voice mode "tell me more about how [science thing] works". now that is truly gamechanging