Tether’s AI Research Group just released QVAC SDK 0.12.0 with TurboQuant, a production implementation of Google’s memory compression algorithm, letting everyday devices run AI with ‘data center sized’ context.

How TurboQuant works: Turning memory from a wall into a window
Here’s the problem no one talks about: when you chat with an artificial intelligence (AI), it doesn’t just need memory to load the model to be used. It also needs working memory [called the Key-Value cache (KV cache)] to remember everything you’ve already said.
A short prompt is fine. But a 100-page legal document? A few hours of conversation? That KV cache for a 4-billion-parameter model can reach 8 Gigabytes (GB) on its own. Four simultaneous sessions? 32 GB before you even load the model. That’s why most “AI assistants” forget everything after a few messages or force you into the cloud.
TurboQuant changes the math. It compresses that cache up to 5 times while preserving output quality. A session that needed 8 GB now fits in 1.6 GB. That laptop you already own just grew a data center’s worth of memory.
QVAC SDK 0.12.0 upgrade in a nutshell
The QVAC SDK 0.12.0 introduces TurboQuant, compressing working memory up to 5x so a laptop can handle 262k-token sessions locally. It adds text-to-video, Apple Metal performance for Flux2-klein, and a Vision-Language-Action addon for robot control. Developers get coding assistant support, cross-platform voice with diarization, and millisecond-level classification. Under the hood, Fabric syncs to llama.cpp v8828 for broader Graphics Processing Unit (GPU) acceleration. This upgrade hands developers a local AI toolkit previously only possible in data centers.
How this changes the rules in the AI ecosystem
Let’s put it this way. Until now, the implicit “bargain” was: long context = cloud dependency. But now Tether just broke that equation. In this context, some use case examples would be:
- A journalist can now analyze leaked documents on a laptop without uploading them.
- A doctor can run a local assistant on patient records that never leaves the clinic.
- A developer can query an entire codebase without sending proprietary code to OpenAI.
For startups, this means building AI products without assuming access to expensive GPU clusters. For decentralized networks, it means inference can happen on edge nodes without massive memory requirements.
“Google’s research showed that AI memory could be compressed far more efficiently than most people assumed. Our work brings that breakthrough into production software that developers, startups, and users can actually build with. If long context AI only works inside the largest data centers, then AI will be shaped by whoever owns the most hardware. TurboQuant changes what local AI can do by making memory less of a wall.” – Paolo Ardoino, CEO of Tether.
More than just chat
The same memory breakthrough applies to other modalities. So far, the Software Development Kit (SDK) already includes text-to-video, robot control, and vision-language-action models.
Let’s do this exercise: imagine a local AI that can watch your workshop, remember a 3-hour repair session, and guide a robot arm through the same steps, all on a single edge device. Or a security camera, for instance, that maintains context across days of footage without uploading to the cloud. To this point in technology, TurboQuant makes the KV cache smaller, but the potential use cases just got much, much larger.
