_ ____ _
| | | __`\(_)
| | __ _ ___ ___ _ _ __ _ __ _ ___ | |__) | |_ __ ___ ___
| | / _` |/ _ \ / _ `| | | |/ _` |/ _` |/ _ \ | ___/| | '_ \ / _ \/ __|
| |___| (_| | | | | (_| | |_| | (_| | (_| | __/ | | | | |_) | __/\__ \
|______\__,_|_| |_|\__, |\__,_|\__,_|\__, |\___| |_| |_| .__/ \___||___/
__/ | __/ | | |
|___/ |___/ |_| pip install language-pipes The idea
P2P inference for open-source LLMs
Language models run their input through a long stack of transformer layers. Language Pipes cuts that stack into segments and hands each segment to a different machine, so the memory cost is shared across a network you control. No single node needs to hold the whole model and no central server sits between your nodes. It's peer-to-peer, decentralized, and Python-native.
Why Language Pipes
Distributed · Decentralized · OpenAI-compatible
Distributed Inference
Transformer layers are split across multiple machines over a peer-to-peer control plane, so a model too large for any one box runs across the network.
Architecture →Decentralized Config
Only the node hosting the End model ever sees raw text. Each node can host their own End models so there is no central authority or central configuration.
Configuration →OpenAI-compatible API
A drop-in base_url swap for existing OpenAI client code. Point your
tools at a node and keep the SDK you already use.
How it works
Layer models and the End model
Inference flows through a pipeline. The End model keeps the text-handling stages (tokenization, embedding, final norm and the output head) on one trusted node. The transformer layers in between are distributed across Layer models on other machines, which only ever see continuous-valued hidden-state tensors.
Each layer performs matrix multiplications between learned weights and a hidden-state
tensor, then passes the result down the pipe. Splitting where the layers are hosted
shares the memory cost and keeps text off every node but the one making the request.
See the job processor state machine →
Understand the threat model →
Quick start
Example
Distribute Qwen/Qwen3-1.7B across two machines. Node 1 hosts the End
Model, so prompts and responses stay on Node 1, plus enough layers to fit its
memory. Node 2 hosts the rest. Launch the TUI with language-pipes and
configure each node, then call it like any OpenAI endpoint:
from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:8000/v1", # node-1 IP + job port
api_key="not-needed",
)
resp = client.chat.completions.create(
model="Qwen/Qwen3-1.7B",
messages=[{"role": "user",
"content": "Write a haiku about distributed systems."}],
)
print(resp.choices[0].message.content) Support
Supported models
Model families today
- Qwen3
- Phi 4
- Meta Llama 2 and 3
- Gemma 3 and 4
Fine-tunes of a supported base model should work too.
See all tested models →