Language Pipes — Peer-to-peer distributed inference for open-source language models

 _                                                 ____   _
| |                                               |  __`\(_)
| |     __ _  ___   ___  _   _  __ _  __ _  ___   | |__) | |_ __   ___  ___
| |    / _` |/ _ \ / _ `| | | |/ _` |/ _` |/ _ \  |  ___/| | '_ \ / _ \/ __|
| |___| (_| | | | | (_| | |_| | (_| | (_| |  __/  | |    | | |_) |  __/\__ \
|______\__,_|_| |_|\__, |\__,_|\__,_|\__, |\___|  |_|    |_| .__/ \___||___/
                    __/ |            __/ |                 | |
                   |___/            |___/                  |_|

Made with <3 by Erin

$ pip install language-pipes

Read the docs → GitHub PyPI

The idea

P2P inference for open-source LLMs

Language models run their input through a long stack of transformer layers. Language Pipes cuts that stack into segments and hands each segment to a different machine, so the memory cost is shared across a network you control. No single node needs to hold the whole model and no central server sits between your nodes. It's peer-to-peer, decentralized, and Python-native.

Why Language Pipes

Distributed · Decentralized · OpenAI-compatible

⌗

Distributed Inference

Transformer layers are split across multiple machines over a peer-to-peer control plane, so a model too large for any one box runs across the network.

Architecture →

◈

Decentralized Config

Only the node hosting the End model ever sees raw text. Each node can host their own End models so there is no central authority or central configuration.

Configuration →

↯

OpenAI-compatible API

A drop-in base_url swap for existing OpenAI client code. Point your tools at a node and keep the SDK you already use.

API reference →

How it works

Layer models and the End model

Inference flows through a pipeline. The End model keeps the text-handling stages (tokenization, embedding, final norm and the output head) on one trusted node. The transformer layers in between are distributed across Layer models on other machines, which only ever see continuous-valued hidden-state tensors.

TOKENIZE End model

→

EMBED End model

⇢

LAYERS ×N Layer models

⇢

NORM End model

→

HEAD End model

End model — text stays here (one trusted node) Layer models — opaque tensors only (distributed) ⇢ hidden-state tensor sent over the network

Each layer performs matrix multiplications between learned weights and a hidden-state tensor, then passes the result down the pipe. Splitting where the layers are hosted shares the memory cost and keeps text off every node but the one making the request.
See the job processor state machine →
Understand the threat model →

Quick start

Example

Distribute Qwen/Qwen3-1.7B across two machines. Node 1 hosts the End Model, so prompts and responses stay on Node 1, plus enough layers to fit its memory. Node 2 hosts the rest. Launch the TUI with language-pipes and configure each node, then call it like any OpenAI endpoint:

node-1 — client.py

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",  # node-1 IP + job port
    api_key="not-needed",
)

resp = client.chat.completions.create(
    model="Qwen/Qwen3-1.7B",
    messages=[{"role": "user",
               "content": "Write a haiku about distributed systems."}],
)
print(resp.choices[0].message.content)

CLI reference → Configuration Full two-node walkthrough

Support

Supported models

Model families today

Qwen3
Phi 4
Meta Llama 2 and 3
Gemma 3 and 4

Fine-tunes of a supported base model should work too.
See all tested models →