llama.cpp Provider
lgrammel/ai-sdk-llama-cpp is a community provider that enables local LLM inference using llama.cpp directly within Node.js via native C++ bindings.
This provider loads llama.cpp directly into Node.js memory, eliminating the need for an external server while providing native performance and GPU acceleration.
Features
- Native Performance: Direct C++ bindings using node-addon-api (N-API)
- GPU Acceleration: Automatic Metal support on macOS
- Streaming & Non-streaming: Full support for both
generateTextandstreamText - Structured Output: Generate JSON objects with schema validation using
generateObject - Embeddings: Generate embeddings with
embedandembedMany - Chat Templates: Automatic or configurable chat template formatting (llama3, chatml, gemma, etc.)
- GGUF Support: Load any GGUF-format model
This provider currently only supports macOS (Apple Silicon or Intel). Windows and Linux are not supported.
Prerequisites
Before installing, ensure you have the following:
- macOS (Apple Silicon or Intel)
- Node.js >= 18.0.0
- CMake >= 3.15
- Xcode Command Line Tools
# Install Xcode Command Line Tools (includes Clang)xcode-select --install
# Install CMake via Homebrewbrew install cmakeSetup
The llama.cpp provider is available in the ai-sdk-llama-cpp module. You can install it with:
pnpm add ai-sdk-llama-cpp
The installation will automatically compile llama.cpp as a static library with Metal support and build the native Node.js addon.
Provider Instance
You can import llamaCpp from ai-sdk-llama-cpp and create a model instance:
import { llamaCpp } from 'ai-sdk-llama-cpp';
const model = llamaCpp({ modelPath: './models/llama-3.2-1b-instruct.Q4_K_M.gguf',});Configuration Options
You can customize the model instance with the following options:
-
modelPath string (required)
Path to the GGUF model file.
-
contextSize number
Maximum context size. Default:
2048. -
gpuLayers number
Number of layers to offload to GPU. Default:
99(all layers). Set to0to disable GPU. -
threads number
Number of CPU threads. Default:
4. -
debug boolean
Enable verbose debug output from llama.cpp. Default:
false. -
chatTemplate string
Chat template to use for formatting messages. Default:
"auto"(uses the template embedded in the GGUF model file). Available templates include:llama3,chatml,gemma,mistral-v1,mistral-v3,phi3,phi4,deepseek, and more.
const model = llamaCpp({ modelPath: './models/your-model.gguf', contextSize: 4096, gpuLayers: 99, threads: 8, chatTemplate: 'llama3',});Language Models
Text Generation
You can use llama.cpp models to generate text with the generateText function:
import { generateText } from 'ai';import { llamaCpp } from 'ai-sdk-llama-cpp';
const model = llamaCpp({ modelPath: './models/llama-3.2-1b-instruct.Q4_K_M.gguf',});
try { const { text } = await generateText({ model, prompt: 'Explain quantum computing in simple terms.', });
console.log(text);} finally { await model.dispose();}Streaming
The provider fully supports streaming with streamText:
import { streamText } from 'ai';import { llamaCpp } from 'ai-sdk-llama-cpp';
const model = llamaCpp({ modelPath: './models/llama-3.2-1b-instruct.Q4_K_M.gguf',});
try { const result = streamText({ model, prompt: 'Write a haiku about programming.', });
for await (const chunk of result.textStream) { process.stdout.write(chunk); }} finally { await model.dispose();}Structured Output
Generate type-safe JSON objects that conform to a schema using generateObject:
import { generateObject } from 'ai';import { z } from 'zod';import { llamaCpp } from 'ai-sdk-llama-cpp';
const model = llamaCpp({ modelPath: './models/your-model.gguf',});
try { const { object: recipe } = await generateObject({ model, schema: z.object({ name: z.string(), ingredients: z.array( z.object({ name: z.string(), amount: z.string(), }), ), steps: z.array(z.string()), }), prompt: 'Generate a recipe for chocolate chip cookies.', });
console.log(recipe);} finally { await model.dispose();}The structured output feature uses GBNF grammar constraints to ensure the model generates valid JSON that conforms to your schema.
Generation Parameters
Standard AI SDK generation parameters are supported:
const { text } = await generateText({ model, prompt: 'Hello!', maxTokens: 256, temperature: 0.7, topP: 0.9, topK: 40, stopSequences: ['\n'],});Embedding Models
You can create embedding models using the llamaCpp.embedding() factory method:
import { embed, embedMany } from 'ai';import { llamaCpp } from 'ai-sdk-llama-cpp';
const model = llamaCpp.embedding({ modelPath: './models/nomic-embed-text-v1.5.Q4_K_M.gguf',});
try { const { embedding } = await embed({ model, value: 'Hello, world!', });
const { embeddings } = await embedMany({ model, values: ['Hello, world!', 'Goodbye, world!'], });} finally { model.dispose();}Model Downloads
You'll need to download GGUF-format models separately. Popular sources:
- Hugging Face - Search for GGUF models
- TheBloke's Models - Popular quantized models
Example download:
# Create models directorymkdir -p models
# Download a model (example: Llama 3.2 1B)wget -P models/ https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.ggufResource Management
Always call model.dispose() when done to unload the model and free GPU/CPU
resources. This is especially important when loading multiple models to
prevent memory leaks.
const model = llamaCpp({ modelPath: './models/your-model.gguf',});
try { // Use the model...} finally { await model.dispose();}Limitations
- macOS only: Windows and Linux are not supported
- No tool/function calling: Tool calls are not supported
- No image inputs: Only text prompts are supported
Additional Resources
- GitHub Repository
- npm Package
- llama.cpp - The underlying inference engine