MiniTen is a local-first inference serving platform for deploying Hugging Face LLMs as named vLLM workers on Kubernetes. It includes a Flask API, server-rendered web dashboard, CLI, Postgres metadata store, Kubernetes deployment worker, and OpenAI-compatible inference routes. Also supports Truss-style commands.
The current project is an implemented MVP. It is intended to run locally with Docker, Docker Compose, and kind, with a later production path toward OCI/OKE.
MiniTen lets you:
- Create users and log in.
- Create projects and manage project members.
- Create project-scoped API keys for inference.
- Deploy named model services backed by vLLM.
- Start, stop, scale, retry, sync, and delete model deployments.
- View model status, jobs, logs, analytics, and lifecycle events.
- Send OpenAI-compatible
/v1/chat/completionsrequests. - Use the same workflows from the web dashboard or
minitenCLI. - Run Truss-style commands
The Flask app is the control plane and request router. vLLM pods do the actual model inference.
Install these before running the project locally:
- Python 3.12
- Docker Desktop or Docker Engine
- Docker Compose
kindkubectlmake
Package manager examples:
macOS with Homebrew:
brew install python@3.12 docker docker-compose kind kubectl makeWindows with Chocolatey:
choco install python docker-desktop docker-compose kubernetes-cli kind make -yDebian/Ubuntu:
sudo apt-get update
sudo apt-get install -y python3 python3-pip make ca-certificates curl
# Docker Engine: https://docs.docker.com/engine/install/ubuntu/
# kubectl: https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/
# kind:
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.23.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kindThis is the recommended local workflow.
make setup-webmake setup-web does the full local setup:
- Installs Python dependencies with Poetry.
- Starts Postgres with Docker Compose.
- Runs database migrations.
- Creates or reuses a local kind cluster named
miniten. - Installs/patches metrics-server for local HPA autoscaling.
- Starts the deployment worker with real Kubernetes access.
- Starts the Flask API/dashboard in the background.
- Opens the dashboard in your browser.
The dashboard runs at:
http://127.0.0.1:8000
To reopen or restart the dashboard after setup:
make open-dashboardTo stop the background API/dashboard, remove local Compose state, and stop the local kind node while preserving its PVC/cache data:
make clean-envmake clean-env stops the kind node container instead of deleting it. That
frees Docker Desktop memory while preserving cached pod images and PVC data such
as Hugging Face model downloads. Use make clean-kind or make clean-all when
you want to delete the cluster and its cache.
Use this when you want the API server in the foreground:
make setup-env
make start-worker-real-k8s
make run-apiThen open:
http://127.0.0.1:8000
Important: make setup-env refuses to run while port 8000 is already
listening. Stop the API first, or run make clean-env.
Common commands:
make install
make setup-env
make setup-web
make open-dashboard
make clean-env
make start-kind
make stop-kind
make clean-kind
make clean-all
make install-metrics-server
make migrate
make run-api
make run-worker
make run-worker-dry-run
make start-worker-real-k8s
make test
make test-local-apis
make test-local-k8s
make test-local-vllm
make test-local-truss-vllm
make test-local-vllm-gpu
make tests
make coverage
make lint
make compile
make cleanWhat the important targets do:
| Target | Purpose |
|---|---|
make install |
Installs Poetry and project dependencies. |
make setup-env |
Starts Postgres, runs migrations, creates kind, starts dry-run Compose workers. |
make setup-web |
Runs setup, switches Compose workers to real Kubernetes, starts dashboard/API, opens browser. |
make open-dashboard |
Starts or reopens the local dashboard/API. |
make start-worker-real-k8s |
Restarts Compose workers with WORKER_DRY_RUN=false. |
make run-api |
Runs Flask API/dashboard in the foreground. |
make run-worker |
Runs the deployment worker in the foreground using real Kubernetes. |
make run-worker-dry-run |
Runs the worker without mutating Kubernetes. |
make clean-env |
Stops local API, removes Compose services/volumes, stops kind, clears setup marker. |
make start-kind |
Starts an existing local kind node without deleting PVC/cache data. |
make stop-kind |
Stops the local kind node without deleting PVC/cache data. |
make clean-kind |
Deletes the local kind cluster. |
make clean-all |
Runs clean-env and clean-kind. |
make install-metrics-server |
Installs/patches metrics-server in the local kind cluster for HPA autoscaling. |
make test |
Runs the Python unit test suite. |
make test-local-apis |
Runs API smoke tests against a running local API. |
make test-local-k8s |
Tests real Kubernetes resource creation/deletion with a lightweight smoke worker. |
make test-local-vllm |
Deploys a real CPU vLLM pod and calls chat completions. |
make test-local-truss-vllm |
Runs the real CPU vLLM smoke path through truss login, truss init, and truss push. |
make test-local-vllm-gpu |
Runs the GPU vLLM smoke path when Kubernetes exposes nvidia.com/gpu. |
make tests |
Runs make test, make test-local-apis, make test-local-k8s, make test-local-vllm, and make test-local-truss-vllm. |
make lint |
Runs Ruff. |
After make setup-web, use the browser to:
- Register an account.
- Log in.
- Create a project.
- Create a project API key.
- Deploy a model.
- Watch the model page for status, jobs, logs, and analytics.
- Send an inference request from the Inference page.
For a low-memory local CPU deployment, use:
| Field | Value |
|---|---|
| Model name | small-llm |
| Hugging Face model ID | HuggingFaceTB/SmolLM2-135M-Instruct |
| Replicas | 1 |
| CPU request | 1 |
| CPU limit | 4 |
| Memory request | 1Gi |
| Memory limit | 6Gi |
| GPU count | 0 |
| Dtype | auto |
| Max model length | 256 |
| Autoscaling | false |
The first vLLM startup can take a while because the model image and model files may need to download.
The dashboard is organized around the same workflow exposed by the CLI: create an account, create a project, deploy a model, manage runtime state, and call the model through an OpenAI-compatible API.
The home page is the entry point for local dashboard users. From here you can create an account or log in to an existing MiniTen account.
The account creation page registers a dashboard user. User accounts own and manage projects, API keys, project members, and model deployments.
The login page creates the browser session used for project and model management. It preserves your email if login fails so retrying does not clear the form.
The Projects page lists the projects you can access, their Kubernetes namespaces, and your role in each project. It also lets you create new projects. A project is the main isolation boundary for API keys, members, model deployments, analytics, and Kubernetes resources.
The project dashboard is the main workspace for a project. It shows project totals, model deployments, project API keys, and project members. From this page you can deploy a model, open model analytics, create or revoke API keys, invite members, change member roles, remove members, and delete the project.
API keys are project-scoped credentials used for inference requests. MiniTen shows the raw key only once when it is created. Later pages show safe metadata such as status and stored key prefix, but never the full credential.
The model deployment page manages one named model service. It shows the current status, Kubernetes readiness details, deployment jobs, and model settings. You can start, stop, retry, sync, scale, update, delete, view logs, and open analytics for the model. The live status box is scrollable so Kubernetes diagnostics stay readable without expanding the whole page.
The analytics page shows request counts, success/error totals, average latency, p95 latency, recent requests, and lifecycle events. Recent request rows include the stored API key prefix, status code, latency, route, error type, and request time. MiniTen stores request metadata only; prompts and model responses are not persisted.
The Inference page lets you test a deployed model from the browser with a project API key. It can stream OpenAI-compatible chat completion deltas as they arrive while also preserving the full HTTP response for inspection.
MiniTen can also be called from client code using an OpenAI-compatible base URL.
The notebook-style workflow uses the project API key, points the SDK at
http://127.0.0.1:8000/v1, and sends requests to the MiniTen model deployment
name.
Example notebooks are available in examples/minitendemo.ipynb and
examples/minitendemo_streaming.ipynb.
The Account page shows account metadata and contains the account deletion control. It also manages account API keys for user-level automation such as Truss-style commands. Account deletion is separated from project management so destructive account-level actions are not mixed into the project list. Deleting an account deletes its account API keys and also deletes projects where that account is the only owner; assign another owner first if a project should remain after the account is removed.
The Logs page shows recent Kubernetes pod logs for a model deployment. It is used when a deployment is loading, restarting, or failing and you need the raw vLLM startup/runtime output. The page supports selecting how many recent log lines to fetch and links back to the model deployment page so debugging can move between live status, jobs, and logs.
The miniten CLI uses the same HTTP API as the dashboard. Control-plane
commands use your user login token. Inference commands use a project API key.
Configure the local API URL:
python -m poetry run miniten config set-url http://127.0.0.1:8000
python -m poetry run miniten config showTop-level help now includes command inputs:
python -m poetry run miniten -hCurrent top-level help:
usage: miniten [-h]
{config,auth,projects,members,api-keys,models,inference,analytics}
...
MiniTen command-line client for the dashboard/API.
positional arguments:
{config,auth,projects,members,api-keys,models,inference,analytics}
options:
-h, --help show this help message and exit
command reference:
config set-url <url>
config show
auth register --email <email> [--password <password>]
auth login --email <email> [--password <password>]
auth logout
auth me
auth delete-user [--yes]
projects create <name>
projects list
projects get <project-id>
projects delete <project-id> [--yes]
members list <project-id>
members add <project-id> --email <email> --role {owner,member,viewer}
members update <project-id> <user-id> --role {owner,member,viewer}
members remove <project-id> <user-id>
api-keys create <project-id> <name> [--use]
api-keys list <project-id>
api-keys use <project-api-key>
api-keys revoke <project-id> <api-key-id>
account-api-keys create <name>
account-api-keys list
account-api-keys revoke <account-api-key-id>
models deploy <project-id> --name <name> --model-id <hf-model-id>
[--replicas <n>] [--cpu-request <value>] [--cpu-limit <value>]
[--memory-request <value>] [--memory-limit <value>] [--gpu-count <n>]
[--dtype <dtype>] [--max-model-len <tokens>]
[--autoscaling-enabled {true,false}] [--min-replicas <n>]
[--max-replicas <n>] [--target-cpu-utilization <percent>]
[--json <json-object>]
models list <project-id>
models get <project-id> <model-deployment-id>
models update <project-id> <model-deployment-id> [model settings options]
[--json <json-object>]
models start <project-id> <model-deployment-id>
models stop <project-id> <model-deployment-id>
models retry <project-id> <model-deployment-id>
models sync <project-id> <model-deployment-id>
models scale <project-id> <model-deployment-id> <replicas>
models hard-restart <project-id> <model-deployment-id> [--yes]
models delete <project-id> <model-deployment-id> [--yes]
models jobs <project-id> <model-deployment-id>
models status <project-id> <model-deployment-id>
models logs <project-id> <model-name> [--tail <lines>]
inference chat [--api-key <project-api-key>] [--model <name>]
[--prompt <text>] [--max-tokens <n>] [--temperature <float>]
[--stream] [--json <json-object>]
inference models [--api-key <project-api-key>]
analytics overview <project-id>
analytics metrics <project-id> <model-name> [--since <iso8601>]
analytics requests <project-id> <model-name> [--since <iso8601>]
[--limit <n>] [--status-code <code>]
analytics events <project-id> <model-name>
Run `miniten <group> <command> -h` for detailed help on one command.
Register and log in:
python -m poetry run miniten auth register --email user@example.com
python -m poetry run miniten auth login --email user@example.comCreate and inspect a project:
python -m poetry run miniten projects create "Personal Models"
python -m poetry run miniten projects listDeploy a small CPU model:
python -m poetry run miniten models deploy <project-id> \
--name small-llm \
--model-id HuggingFaceTB/SmolLM2-135M-Instruct \
--replicas 1 \
--cpu-request 1 \
--cpu-limit 4 \
--memory-request 1Gi \
--memory-limit 6Gi \
--gpu-count 0 \
--dtype auto \
--max-model-len 256 \
--autoscaling-enabled falseCheck model state:
python -m poetry run miniten models list <project-id>
python -m poetry run miniten models jobs <project-id> <model-deployment-id>
python -m poetry run miniten models status <project-id> <model-deployment-id>
python -m poetry run miniten models logs <project-id> small-llm --tail 100Create and save a project API key:
python -m poetry run miniten api-keys create <project-id> local-dev --use
python -m poetry run miniten api-keys list <project-id>Create an account API key for user-level automation:
python -m poetry run miniten account-api-keys create truss-local
python -m poetry run miniten account-api-keys list
python -m poetry run miniten account-api-keys revoke <account-api-key-id>Send inference:
python -m poetry run miniten inference chat \
--model small-llm \
--prompt "Say hello in one short sentence." \
--max-tokens 32 \
--temperature 0Stream inference:
python -m poetry run miniten inference chat \
--model small-llm \
--prompt "Say hello in one short sentence." \
--max-tokens 32 \
--temperature 0 \
--streamMiniTen also ships a small truss command for a MiniTen-compatible YAML
workflow. It uses account API keys because truss init can create projects.
Command summary:
| Command | Purpose |
|---|---|
truss login [--api-key <key>] [--base-url <url>] |
Stores a MiniTen account API key for later Truss-style commands. Without --api-key, it always prompts for the key. |
truss init <project-name> [--model-name <name>] |
Creates or reuses a MiniTen project, prompts for a deployment model_name when needed, and writes <project-name>/config.yaml. |
truss push [--config <path>] [--no-watch] [--poll-interval <seconds>] |
Deploys the current directory's config.yaml and watches for config changes by default. |
truss watch [--config <path>] [--poll-interval <seconds>] |
Reattaches to an existing Truss-style directory and queues update jobs when config.yaml changes. |
Log in with a MiniTen account API key:
python -m poetry run truss login
💻 Let's add a MiniTen remote!
🤫 Quietly paste your API_KEY:truss login prompts unless --api-key is provided. You can pass the account
API key and API URL non-interactively:
python -m poetry run truss login \
--api-key "<miniten-account-api-key>" \
--base-url http://127.0.0.1:8000Create or initialize a project directory. If the project already exists and
you are a member, MiniTen reuses it. Either way, the command writes
qwen-2.5-3b/config.yaml.
python -m poetry run truss init qwen-2.5-3b
📦 Name this model: qwen-2-5-3b
Truss qwen-2-5-3b was created in ~/qwen-2.5-3b
cd qwen-2.5-3bEdit config.yaml. The project is inferred from the current directory name, so
the file does not need a project ID.
model_name: qwen-2-5-3b
model_id: HuggingFaceTB/SmolLM2-135M-Instruct
replicas: 1
resources:
cpu_request: "2"
cpu_limit: "4"
memory_request: "4Gi"
memory_limit: "12Gi"
gpu_count: 0
vllm:
dtype: auto
max_model_len: 512
autoscaling:
enabled: falseDeploy from the YAML:
python -m poetry run truss push
✨ Model qwen-2-5-3b was successfully pushed ✨
🪵 View logs for your deployment at http://127.0.0.1:8000/projects/<project-id>/models/qwen-2-5-3b/logs
👀 Watching for changes to truss...truss push reads ./config.yaml, validates model_name, and uses the
current directory name as the project name. If that project does not exist, or
your account API key's user is not allowed to deploy into it, the command fails
with the API error message. After a successful push, the command keeps watching
config.yaml. When the file changes, MiniTen queues an update_model job
through the same deployment update pipeline used by the dashboard and
miniten models update.
You can override the config path or deploy once without watching:
python -m poetry run truss push --config config.yaml
python -m poetry run truss push --no-watchIf you stop the push session, reattach the watcher later:
python -m poetry run truss watchFor noninteractive init in CI or smoke tests, pass --model-name:
python -m poetry run truss init qwen-2.5-3b --model-name qwen-2-5-3bmodel_name must use MiniTen deployment-name formatting: lowercase letters,
numbers, and hyphens only, starting and ending with a letter or number. If the
name is invalid, truss push prints the validation issue before sending the
request.
MiniTen's inference routes are compatible with the OpenAI Python SDK for both
standard and streaming chat completions. The model value is the MiniTen
deployment name, not the Hugging Face model ID.
Install the SDK if it is not already available:
python -m poetry add openaiThen call MiniTen like an OpenAI-compatible endpoint:
from openai import OpenAI
import os
client = OpenAI(
base_url="http://127.0.0.1:8000/v1",
api_key=os.environ["MINITEN_PROJECT_API_KEY"],
)
response = client.chat.completions.create(
model="small-llm",
messages=[
{"role": "system", "content": "You are a concise technical writer."},
{"role": "user", "content": "What is gradient descent?"},
{
"role": "assistant",
"content": (
"An optimization algorithm that iteratively adjusts model "
"parameters by moving in the direction of steepest decrease "
"in the loss function."
),
},
{"role": "user", "content": "How does the learning rate affect it?"},
],
max_tokens=64,
temperature=0,
)
print(response.choices[0].message.content)For token-by-token streaming, pass stream=True and iterate over the chunks:
stream = client.chat.completions.create(
model="small-llm",
messages=[
{"role": "user", "content": "Say hello in one short sentence."},
],
max_tokens=64,
temperature=0,
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)You can also run the included compatibility smoke script:
export MINITEN_PROJECT_API_KEY="mt_live_..."
export MINITEN_MODEL="small-llm"
python -m poetry run python scripts/test_openai_sdk_chat.pyOr pass values directly:
python -m poetry run python scripts/test_openai_sdk_chat.py \
--api-key "mt_live_..." \
--model "small-llm"Lifecycle commands:
python -m poetry run miniten models stop <project-id> <model-deployment-id>
python -m poetry run miniten models start <project-id> <model-deployment-id>
python -m poetry run miniten models hard-restart <project-id> <model-deployment-id>
python -m poetry run miniten models sync <project-id> <model-deployment-id>
python -m poetry run miniten models scale <project-id> <model-deployment-id> 1
python -m poetry run miniten models delete <project-id> <model-deployment-id>Analytics:
python -m poetry run miniten analytics overview <project-id>
python -m poetry run miniten analytics metrics <project-id> small-llm
python -m poetry run miniten analytics requests <project-id> small-llm --limit 20
python -m poetry run miniten analytics events <project-id> small-llmThe CLI stores local state in ~/.miniten/config.json by default. Override
settings with:
MINITEN_API_URL
MINITEN_ACCESS_TOKEN
MINITEN_PROJECT_API_KEY
MINITEN_CLI_CONFIG
All JSON API routes are under /v1.
Main route groups:
| API Group | Purpose |
|---|---|
| Users | Account creation, lookup, deletion. |
| Auth | Login/logout and user token creation. |
| Account API Keys | User-scoped automation key creation/listing/revocation. |
| Projects | Project creation, listing, lookup, deletion. |
| Members | Project membership management. |
| API Keys | Project-scoped inference key creation/listing/revocation. |
| Models | Model deploy/update/start/stop/scale/delete/status/logs. |
| Analytics | Project/model metrics, requests, and lifecycle events. |
| Inference | OpenAI-compatible model calls. |
Example inference request:
POST /v1/chat/completions
Authorization: Bearer <project-api-key>
Content-Type: application/json{
"model": "small-llm",
"messages": [
{
"role": "user",
"content": "Say hello in one short sentence."
}
],
"max_tokens": 32,
"temperature": 0
}MiniTen also proxies streaming chat completions using server-sent events. Add
"stream": true to receive incremental chat.completion.chunk responses from
the deployed model:
{
"model": "small-llm",
"messages": [
{
"role": "user",
"content": "Say hello in one short sentence."
}
],
"max_tokens": 32,
"temperature": 0,
"stream": true
}MiniTen uses one local Kubernetes cluster as shared infrastructure. The cluster is not model-specific.
For each project, MiniTen creates a namespace:
Namespace/miniten-<project-slug>
For each model deployment, MiniTen creates:
Deployment/<model-name>-v1
Service/<model-name>
PVC/<model-name>-hf-cache
HPA/<model-name>-v1, when autoscaling is enabled
Secret/<model-name>-secrets, when credentials are configured
Deleting a model deployment removes the model runtime resources such as HPA, Service, Deployment, and Secret. The model cache PVC is retained by default so model files do not need to be downloaded again after retries or redeploys.
Deleting a project removes the project namespace and the Kubernetes resources inside it.
Deleting the whole local cluster is a development-environment operation:
make clean-kindRun the unit tests only:
make testRun the full non-GPU local test suite:
make setup-env
make run-apiThen, in another terminal:
make testsmake tests runs the unit tests plus API, real Kubernetes, CPU vLLM, and Truss
CPU vLLM smoke tests. It does not run the GPU smoke target.
API smoke test:
make setup-env
make run-api
make test-local-apisReal Kubernetes smoke test:
make setup-env
make run-api
make test-local-k8sCPU vLLM smoke test:
make setup-env
make run-api
make test-local-vllmTruss CPU vLLM smoke test:
make setup-env
make run-api
make test-local-truss-vllmGPU vLLM smoke test:
make setup-env
make run-api
make test-local-vllm-gpuThe GPU smoke path requires Kubernetes to advertise allocatable
nvidia.com/gpu. Docker Desktop GPU support alone is not enough for kind pods,
because kind runs pod containers through containerd inside the kind node.
List namespaces:
kubectl get nsList project resources:
kubectl get all,pvc,hpa,secret -n <project-namespace>Watch model pods:
kubectl get pods -n <project-namespace> -wView recent events:
kubectl get events -A --sort-by=.lastTimestampView model logs:
kubectl logs -n <project-namespace> deploy/<model-name>-v1 --tail=200Port-forward a model service manually:
kubectl port-forward -n <project-namespace> svc/<model-name> 18080:8000The API/dashboard is already running. Stop it:
make clean-envOr manually stop the process using port 8000.
Check the model page in the dashboard or run:
python -m poetry run miniten models status <project-id> <model-deployment-id>
python -m poetry run miniten models jobs <project-id> <model-deployment-id>
python -m poetry run miniten models logs <project-id> <model-name> --tail 100If the model is still starting, wait for vLLM to download/load the model.
Use a smaller max model length. For local CPU testing, start with:
256
Use an instruct/chat model for /v1/chat/completions, such as:
HuggingFaceTB/SmolLM2-135M-Instruct
Base/non-chat models may not support chat completions without a tokenizer chat template.
Use lower local settings:
memory limit: 6Gi
max model length: 256
autoscaling: false
The project also defaults VLLM_CPU_MEMORY_UTILIZATION low for local CPU use.
That is expected on Docker Desktop. Docker can expose the GPU to a direct container, but kind pods run inside the kind node container through containerd. The NVIDIA runtime injection does not automatically propagate into that nested runtime.
Use CPU vLLM locally, or use a Linux/WSL Kubernetes setup with NVIDIA container runtime and the NVIDIA device plugin configured.
Common environment variables:
| Variable | Purpose |
|---|---|
DATABASE_URL |
Postgres connection URL. |
SECRET_KEY |
Flask/session/JWT signing secret. |
API_KEY_HASH_SECRET |
HMAC secret for project API keys. |
KUBECONFIG_DIR |
Directory used by local worker kubeconfig. |
WORKER_DRY_RUN |
When true, worker marks jobs without changing Kubernetes. |
WORKER_REPLICAS |
Number of Compose deployment workers, defaults to 2. |
HUGGING_FACE_TOKEN |
Optional token for private/gated Hugging Face models. |
VLLM_CPU_MEMORY_UTILIZATION |
CPU KV-cache reservation tuning for vLLM. |
LOG_LEVEL |
Logging level, defaults to INFO. |
See .env.example for the local defaults used by the Makefile and Docker
Compose.
app/
routes/ Flask API and dashboard routes
services/ Business logic and deployment worker
db/ SQL loader, pool, migrations
k8s/ Kubernetes manifests and client helpers
security/ Passwords, tokens, API key hashing
templates/ Dashboard templates
static/ Dashboard CSS/JS
utils/ Errors, logging, time, validation helpers
migrations/ Raw SQL migrations
scripts/ Local setup, dashboard, smoke tests
tests/ Unit and smoke-style tests
docs/ Diagrams and design notes
examples/ Demo .ipynb files
Postgres stores metadata, jobs, API key hashes, lifecycle events, and inference request metadata. It does not store raw API keys, passwords, prompts, model responses, or model weights.
Model weights live on Hugging Face and are cached in Kubernetes PVCs.















