llama.cpp Provider

lgrammel/ai-sdk-llama-cpp is a community provider that enables local LLM inference using llama.cpp directly within Node.js via native C++ bindings.

This provider loads llama.cpp directly into Node.js memory, eliminating the need for an external server while providing native performance and GPU acceleration.

Features

  • Native Performance: Direct C++ bindings using node-addon-api (N-API)
  • GPU Acceleration: Automatic Metal support on macOS
  • Streaming & Non-streaming: Full support for both generateText and streamText
  • Structured Output: Generate JSON objects with schema validation using generateObject
  • Embeddings: Generate embeddings with embed and embedMany
  • Chat Templates: Automatic or configurable chat template formatting (llama3, chatml, gemma, etc.)
  • GGUF Support: Load any GGUF-format model

This provider currently only supports macOS (Apple Silicon or Intel). Windows and Linux are not supported.

Prerequisites

Before installing, ensure you have the following:

  • macOS (Apple Silicon or Intel)
  • Node.js >= 18.0.0
  • CMake >= 3.15
  • Xcode Command Line Tools
# Install Xcode Command Line Tools (includes Clang)
xcode-select --install
# Install CMake via Homebrew
brew install cmake

Setup

The llama.cpp provider is available in the ai-sdk-llama-cpp module. You can install it with:

pnpm add ai-sdk-llama-cpp

The installation will automatically compile llama.cpp as a static library with Metal support and build the native Node.js addon.

Provider Instance

You can import llamaCpp from ai-sdk-llama-cpp and create a model instance:

import { llamaCpp } from 'ai-sdk-llama-cpp';
const model = llamaCpp({
modelPath: './models/llama-3.2-1b-instruct.Q4_K_M.gguf',
});

Configuration Options

You can customize the model instance with the following options:

  • modelPath string (required)

    Path to the GGUF model file.

  • contextSize number

    Maximum context size. Default: 2048.

  • gpuLayers number

    Number of layers to offload to GPU. Default: 99 (all layers). Set to 0 to disable GPU.

  • threads number

    Number of CPU threads. Default: 4.

  • debug boolean

    Enable verbose debug output from llama.cpp. Default: false.

  • chatTemplate string

    Chat template to use for formatting messages. Default: "auto" (uses the template embedded in the GGUF model file). Available templates include: llama3, chatml, gemma, mistral-v1, mistral-v3, phi3, phi4, deepseek, and more.

const model = llamaCpp({
modelPath: './models/your-model.gguf',
contextSize: 4096,
gpuLayers: 99,
threads: 8,
chatTemplate: 'llama3',
});

Language Models

Text Generation

You can use llama.cpp models to generate text with the generateText function:

import { generateText } from 'ai';
import { llamaCpp } from 'ai-sdk-llama-cpp';
const model = llamaCpp({
modelPath: './models/llama-3.2-1b-instruct.Q4_K_M.gguf',
});
try {
const { text } = await generateText({
model,
prompt: 'Explain quantum computing in simple terms.',
});
console.log(text);
} finally {
await model.dispose();
}

Streaming

The provider fully supports streaming with streamText:

import { streamText } from 'ai';
import { llamaCpp } from 'ai-sdk-llama-cpp';
const model = llamaCpp({
modelPath: './models/llama-3.2-1b-instruct.Q4_K_M.gguf',
});
try {
const result = streamText({
model,
prompt: 'Write a haiku about programming.',
});
for await (const chunk of result.textStream) {
process.stdout.write(chunk);
}
} finally {
await model.dispose();
}

Structured Output

Generate type-safe JSON objects that conform to a schema using generateObject:

import { generateObject } from 'ai';
import { z } from 'zod';
import { llamaCpp } from 'ai-sdk-llama-cpp';
const model = llamaCpp({
modelPath: './models/your-model.gguf',
});
try {
const { object: recipe } = await generateObject({
model,
schema: z.object({
name: z.string(),
ingredients: z.array(
z.object({
name: z.string(),
amount: z.string(),
}),
),
steps: z.array(z.string()),
}),
prompt: 'Generate a recipe for chocolate chip cookies.',
});
console.log(recipe);
} finally {
await model.dispose();
}

The structured output feature uses GBNF grammar constraints to ensure the model generates valid JSON that conforms to your schema.

Generation Parameters

Standard AI SDK generation parameters are supported:

const { text } = await generateText({
model,
prompt: 'Hello!',
maxTokens: 256,
temperature: 0.7,
topP: 0.9,
topK: 40,
stopSequences: ['\n'],
});

Embedding Models

You can create embedding models using the llamaCpp.embedding() factory method:

import { embed, embedMany } from 'ai';
import { llamaCpp } from 'ai-sdk-llama-cpp';
const model = llamaCpp.embedding({
modelPath: './models/nomic-embed-text-v1.5.Q4_K_M.gguf',
});
try {
const { embedding } = await embed({
model,
value: 'Hello, world!',
});
const { embeddings } = await embedMany({
model,
values: ['Hello, world!', 'Goodbye, world!'],
});
} finally {
model.dispose();
}

Model Downloads

You'll need to download GGUF-format models separately. Popular sources:

Example download:

# Create models directory
mkdir -p models
# Download a model (example: Llama 3.2 1B)
wget -P models/ https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf

Resource Management

Always call model.dispose() when done to unload the model and free GPU/CPU resources. This is especially important when loading multiple models to prevent memory leaks.

const model = llamaCpp({
modelPath: './models/your-model.gguf',
});
try {
// Use the model...
} finally {
await model.dispose();
}

Limitations

  • macOS only: Windows and Linux are not supported
  • No tool/function calling: Tool calls are not supported
  • No image inputs: Only text prompts are supported

Additional Resources