github.com

Lucebox: Optimized LLM Inference for RTX 3090

Lucebox is a hub for optimized LLM inference tailored for specific consumer hardware, enhancing AI performance and efficiency.

flux

Tech Stack

C++Python C Objective-C

Summary

Lucebox is an optimization hub for LLM inference, specifically designed for hand-tuned performance on various consumer hardware. This repository includes tailored kernels, speculative decoding, and quantization techniques that enhance the efficiency of large language models (LLMs) on specific chips.

Key features:

Megakernel Qwen3.5 - A highly efficient kernel for hybrid DeltaNet/Attention LLMs, achieving 1.87 tokens per joule on an RTX 3090.
DFlash DDtree Qwen3.5 - The first GGUF port of DFlash speculative decoding, demonstrating significant speed improvements over traditional methods.
Custom CUDA kernels - Optimized for tree-aware state rollback, enhancing performance on supported GPUs.
Benchmarking tools - Includes scripts for evaluating model performance across various tasks.

The project aims to democratize access to powerful AI capabilities by enabling efficient local AI deployment without vendor lock-in.

Comments

No comments yet. Sign in to add the first comment!