Set up llama.cpp as a local AI provider for WordPress.
Install on macOS using Homebrew:
brew install llama.cpp
Verify it works:
llama-server --help
You can also build from source:
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build --config Release
GGUF is the binary format llama.cpp uses. Models are available on Hugging Face in different quantization levels:
| Quantization | Size | Quality | Speed |
|---|---|---|---|
Q2_K | Smallest | Lower | Fastest |
Q4_K_M | Small | Good | Fast |
Q5_K_M | Medium | Better | Moderate |
Q8_0 | Largest | Best | Slowest |
Install the Hugging Face CLI (also known as hf) and download a model:
pip install -U huggingface_hub
huggingface-cli download \
TheBloke/TinyLlama-1.1B-Chat-GGUF \
tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
--local-dir ~/models \
--local-dir-use-symlinks False
Verify the download:
ls -lh ~/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
| Model | Size | Good for |
|---|---|---|
| TinyLlama 1.1B (Q4_K_M) | ~636 MB | Testing, low-resource machines |
| Phi-3 Mini 3.8B (Q4_K_M) | ~2.2 GB | Balance of speed and quality |
| Mistral 7B (Q4_K_M) | ~4.1 GB | High quality, needs more RAM |
| Llama 3 8B (Q4_K_M) | ~4.7 GB | Best quality, needs GPU or 16GB+ RAM |
Start the llama.cpp server with your models directory:
llama-server --models-dir ~/models
| Flag | What it does |
|---|---|
--models-dir | Directory containing GGUF files (auto-loads all) |
-c | Context size in tokens (512, 2048, 4096…) |
-t | CPU threads for inference |
-b | Batch size for prompt processing |
--host | 127.0.0.1 = localhost only, 0.0.0.0 = all interfaces |
--port | HTTP server port (default 8080) |
--api-key | Comma-separated API keys for authentication |
--n-gpu-layers | Offload layers to GPU (faster on supported hardware) |
Test the server:
curl http://127.0.0.1:8080/v1/models
You should see a JSON response listing the loaded model(s). You can also open http://127.0.0.1:8080/ in your browser to see the built-in chat UI.
Set the Server URL in the plugin settings based on where llama.cpp is running relative to your WordPress site.
| Setup | Server URL | Best for |
|---|---|---|
| Same Machine | http://127.0.0.1:8080 | Local development, simplest setup |
| Same Network | http://192.168.x.x:8080 | Dedicated GPU machine on your LAN |
| Remote Server | https://your-tunnel.trycloudflare.com | Cloud server or sharing with others |
WordPress and llama.cpp run on the same computer. This is the simplest setup.
Start the server:
llama-server --models-dir ~/models
Plugin setting: Set the Server URL to http://127.0.0.1:8080 (or leave empty for the default).
llama.cpp runs on a different machine on your local network (e.g. a desktop with a GPU).
Step 1 — Start the server on the machine with the model. Use --host 0.0.0.0 to accept network connections:
llama-server \
--models-dir ~/models \
--host 0.0.0.0
Step 2 — Find the server's local IP:
# macOS
ipconfig getifaddr en0
# Linux
hostname -I
Step 3 — Test from the WordPress machine:
curl http://192.168.x.x:8080/v1/models
Plugin setting: Set the Server URL to http://<server-ip>:8080 (e.g. http://192.168.1.50:8080).
llama.cpp runs on a remote machine (cloud VPS, office server, etc.) and is exposed to the internet via a secure tunnel.
Step 1 — Start the server with authentication:
llama-server \
--models-dir ~/models \
--host 0.0.0.0 \
--api-key your-secret-key
--api-key when exposing the server to the internet. Never run a public server without authentication.
Step 2 — Create a tunnel (if no public IP). Pick one of the options below:
Cloudflare Tunnel provides a free, stable URL with built-in DDoS protection.
Install cloudflared:
# macOS (Homebrew)
brew install cloudflared
# Linux (Debian/Ubuntu)
curl -L https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb -o cloudflared.deb
sudo dpkg -i cloudflared.deb
Quick tunnel (no account needed):
cloudflared tunnel --url http://localhost:8080
This gives you a public HTTPS URL like https://something-random.trycloudflare.com.
Named tunnel (stable URL, requires a free Cloudflare account and a domain):
# One-time setup
cloudflared tunnel login
cloudflared tunnel create llama
cloudflared tunnel route dns llama llama.yourdomain.com
# Run the tunnel
cloudflared tunnel run --url http://localhost:8080 llama
ngrok is a popular alternative that requires a free account.
Install ngrok:
# macOS (Homebrew)
brew install ngrok
# Linux (snap)
snap install ngrok
# Or download from https://ngrok.com/download
Set up and run:
Sign up at ngrok.com and add your auth token:
ngrok config add-authtoken YOUR_AUTH_TOKEN
ngrok http 8080
This gives you a public HTTPS URL like https://abc123.ngrok-free.app.
Step 3 — Plugin setting: Set the Server URL to your tunnel URL (e.g. https://something-random.trycloudflare.com or https://abc123.ngrok-free.app).
--api-key when sharing the server--n-gpu-layers for GPU offloading on supported hardware-c only as needed — larger context uses more memory