Trump Binance Dogecoin Clarity Act Bitcoin

Osmosis cuts RL memory by 33 percent with fused logprob kernel on Qwen models

Osmosis: Cutting Memory in Long-Context RL with Fused Logprobs

Share this article

Written by

Alan Rada

Reviewed by

Raghav Chopra

Updated 17:54 EDT

Jun. 12, 2026

Artificial intelligence (AI) infrastructure startup Osmosis announced a fused logprob kernel [meaning: a highly optimized Graphics Processing Unit (GPU) program used in Large Language Model (LLM) training and inference] that reduces reinforcement learning (RL) training time by 20 percent and increases the maximum trainable context length by 33 percent on Qwen3.5-122B-A10B.

The optimization targets the often-overlooked logarithm of probability (logprob) computation bottleneck in long-context RL, where the logits tensor can reach nearly 8 gigabytes (GB) per tensor-parallel rank at a 16K context length.

Osmosis is growing fast and tackling some of the hardest infrastructure problems in RL: https://t.co/h9Kuwqy7p2
— Brad Flora (@bradflora) June 12, 2026

The logprob bottleneck you didn’t know about

When you are training AI models with long contexts, there are usually three big things that eat up all your memory: attention, activations, and that tricky logprob calculation at the very end. While we already have tools like FlashAttention to handle the first two, that logprob part has mostly been ignored until now.

Take the massive Qwen3.5 model, for example. At a 16K context length, the data generated during that final step can hit nearly 8 GB per GPU. That’s a huge memory wall. But the team at Osmosis realized something pretty clever: RL algorithms like Group Relative Policy Optimization (GRPO) don’t actually need that whole mountain of data (full logits tensor). They really only need one specific value for each toke: the logprob.

By processing the sequence dimension in chunks and collapsing the vocabulary dimension on the fly, the fused kernel never materializes the full 7.8 GB tile. Instead, it streams chunks and keeps intermediate values in on-chip Static Random-Access Memory (SRAM).

How the fused kernel works

Before this, the standard way of doing things was to break the sequence into smaller chunks to save memory. While that helped a bit, it created a lot of extra “busy work” (per chunk overhead) for the computer. It had to run several separate kernels for every single chunk (matmul, log-softmax, and gather) plus a tensor-parallel all-reduce per chunk.

For a long context (i.e., 12k), that could mean looping through this process nearly 100 iterations. Osmosis fixed this by creating a single, streamlined process that does everything at once (fused Triton kernel combines the matmul, a streaming log-softmax, and the selected-token gather into a single pass). Instead of creating massive data files that clog up the system, it processes the information right in on-chip SRAM and immediately clears it out. This reduces the data traffic from gigabytes down to just kilobytes. The kernel also respects gradient needs: if the output layer is frozen, it skips computing weight and bias gradients entirely.

The results are pretty impressive: on the massive Qwen3.5-122B-A10B model, this new method cut the total processing time by nearly 9 percent and made the specific logprob calculation over 40 percent faster than the old way.

Results that scale

The numbers tell the story. On Qwen3.5-35B-A3B at 12K context, chunking alone cut step time by 22 percent and peak memory by 16 percent. On the much larger Qwen3.5-122B-A10B at 12K context, chunking cut logprob time by 59 percent and training time by 20 percent, reducing peak memory by roughly 10 GB.

The real breakthrough came at 16K context. Baseline training on a single 8xH200 node (141 GB per GPU) would hit out-of-memory errors, exceeding 150 GB. The fused kernel with chunk size 128 dropped peak memory to 136 GB, comfortably under the limit. That’s a 33 percent increase in maximum trainable context length on the same hardware.

About The Coin Headlines

The Coin Headlines strives to bring trust into crypto media. At a time when every soundbite and headline can move the markets from red to green and vice-versa, The Coin Headlines promises to bring verified, credible and timely news and analysis from the world of crypto, blockchain, Web3, tech and markets. Founded in 2026, The Coin Headlines is based in the UAE with a team of experienced journalists and editors covering breaking news and updates from around the world.

From covering the biggest events to interviewing some of the most popular KOLs in the industry, The Coin Headlines keeps you informed of the latest trends and insights.

At The Coin Headlines our focus is clear: Real-time news updates, market movements, whale transfers, macroeconomic trends, tech and AI and geopolitical breaking news. The news we report goes through a strict editorial audit before its published to ensure the readers only get verified and credible information. We realize the world of crypto is dynamic, volatile, and many times, confusing. At The Coin Headlines we break down these complex issues into simple articles which cater to not just the experienced trader but also the student and first-time investor who wants to understand the space before committing to it.

The logprob bottleneck you didn’t know about

How the fused kernel works

Results that scale

Related Articles