github.com

Orthrus: Memory-Efficient Parallel Token Generation

Orthrus is a framework for efficient parallel token generation in LLMs, ensuring lossless output and significant speed improvements.

flux

Tech Stack

Python

Summary

Orthrus is a dual-architecture framework designed for memory-efficient parallel token generation. It combines the generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed capabilities of diffusion models.

Key features include:

Significant Inference Acceleration - Achieves up to a 7.8× speedup on generation tasks by breaking the sequential bottleneck of standard autoregressive decoding.
Strictly Lossless Generation - Guarantees that the output matches the original base model's predictive distribution through an exact intra-model consensus mechanism.
Zero Redundant Memory Overhead - Utilizes the same high-fidelity Key-Value (KV) cache for both autoregressive and diffusion views, resulting in minimal memory overhead.
Parameter Efficiency - Enables parallel generation by fine-tuning only 16% of the total model parameters while keeping the base LLM frozen.

Orthrus sets a new standard for parallel generation fidelity and outperforms existing speculative decoding methods, making it a valuable tool for researchers and developers in the field of natural language processing.

Comments

No comments yet. Sign in to add the first comment!