Repository brief

meta-llama/llama

Read the upstream summary on the left, browse the cached forks below it, and load each fork comparison into the right-hand panel.

Cached analysis

cached 2026-03-30T13:25:27.987Z

meta-llama/llama

Deprecated Meta Llama inference repo for Llama 2. It provides minimal Python code to download model weights and run local inference, but the README directs new users to newer repos in the Llama Stack ecosystem instead. The repo is popular and widely forked, but its own README says it is no longer the preferred entry point.

GitHub

Stars59,277

Forks9,827

Default branchmain

Last pushed2025-01-26T21:42:26Z

Best maintainedNone

Closest to upstreamdzhulgakov/llama-mistral

Most feature-richsoulteary/llama-docker-playground

Most opinionatedOpenLMLab/OpenChineseLLaMA

Forks

Choose a fork to inspect

10 cached fork briefs

Fork comparison

tloen/llama-int8

stale

significant_divergence

Choose this fork if your priority is int8/quantized local inference and lower RAM use. Avoid it if you want an actively maintained base or compatibility with the current upstream Llama stack, because it is old, narrow, and materially behind upstream.

Likely purpose

Provide lower-memory quantized inference for LLaMA models, with explicit support for int8 loading and generation tweaks that make local execution more practical on constrained hardware.

Best for

Users who specifically want an int8 quantized LLaMA fork; People optimizing for memory-constrained local inference; Developers comparing or maintaining older bitsandbytes-based LLaMA experiments; Users who value practical quantized examples over staying current with upstream

Additional features

`--use_int8` quantized inference path
Reimplemented LLM.int8()-style loading
Reduced RAM consumption during model loading
Support for repetition penalty during generation
Quantize-before-load workflow
Updated example script and README guidance for quantized usage

Missing features

Falls far behind upstream, so it likely lacks many newer README, packaging, and inference updates from the main repo
Removed `t.NN -> BNB` conversion code, which suggests some conversion workflows were intentionally dropped
Appears narrower than upstream's general-purpose inference path, with less emphasis on the standard non-quantized reference experience
Some upstream behavior appears to be removed, disabled, or intentionally stripped down in this fork.
It trails upstream by 142 commits, so some recent upstream features and fixes are likely not present yet.

Strengths

Good fit for lower-memory local inference
More practical for users targeting bitsandbytes/int8 workflows
Likely simpler path for testing quantized LLaMA behavior
Includes generation-level improvements beyond a pure model-loading change

Risks

Very stale: last pushed in 2023 and 142 commits behind upstream
Materially diverged from the upstream baseline, so bug fixes and documentation may not transfer cleanly
Narrower scope means it is a poor choice if you want the current Meta Llama reference repo or broader ecosystem support
Quantization-specific code may be harder to maintain across model or library changes