⚡ Now with 20+ Production GGUF Models

Build with 20+ Cloud GGUF Models

OpenAI-compatible API for Llama, Mixtral, Qwen, Phi, Gemma, and more. Sub-100ms latency, streaming support, function calling.

Using Llama 3.3 70B▼

or import from

20+

GGUF Models

<100ms

Avg Latency

99.9%

Uptime

50K+

Developers

Everything you need to build with AI

Production-grade infrastructure for deploying and scaling GGUF models

🤖

20+ Cloud GGUF Models

Access Llama, Mixtral, Qwen, Phi, Gemma, Command, DeepSeek and more. All optimized for performance with quantization levels from Q4 to Q8.

⚡

Sub-100ms Latency

Edge-deployed models on Cloudflare Workers with global CDN. Average response time under 100ms for all GGUF models.

🔌

OpenAI-Compatible API

Drop-in replacement for OpenAI API. Same endpoints, same format, same SDKs. Works with all major AI frameworks.

💬

Streaming Support

Server-Sent Events (SSE) and WebSocket streaming for real-time responses. Perfect for chat applications.

🔒

Privacy First

No data retention. All requests processed in-memory. Optional end-to-end encryption for enterprise plans.

🎯

Function Calling

Native function calling support across all models. JSON mode, tool use, and structured outputs.

Available Models

All models available via OpenAI-compatible API with consistent pricing

Llama 3.3 70B

OpenAI-Compatible API

Drop-in replacement for OpenAI. Same SDKs, same format, better performance.

API Endpoints

POST/api/v1/chat/completionsMain chat completion endpoint

POST/api/v1/generateText generation endpoint

GET/api/v1/modelsList all available models

POST/api/v1/embeddingsCreate text embeddings

POST/api/v1/completionsLegacy completions API

GET/api/v1/healthHealth check endpoint

POST/api/v1/tokenizeTokenize text input

GET/api/v1/usageGet usage statistics

WS/api/v1/streamWebSocket streaming endpoint

POST/api/v1/images/generateImage generation

POST/api/v1/audio/transcribeAudio transcription

POST/api/v1/code/generateCode generation specialist

example.js
import OpenAI from 'openai';

const openai = new OpenAI({
  baseURL: 'https://api.caffeine.ai/v1',
  apiKey: 'your-api-key'
});

const completion = await openai.chat.completions.create({
  model: 'llama-3.3-70b',
  messages: [{
    role: 'user',
    content: 'Hello!'
  }],
  stream: true
});