Now with 20+ Production GGUF Models

Build with 20+ Cloud GGUF Models

OpenAI-compatible API for Llama, Mixtral, Qwen, Phi, Gemma, and more. Sub-100ms latency, streaming support, function calling.

Using Llama 3.3 70B
or import from
20+
GGUF Models
<100ms
Avg Latency
99.9%
Uptime
50K+
Developers

Everything you need to build with AI

Production-grade infrastructure for deploying and scaling GGUF models

🤖

20+ Cloud GGUF Models

Access Llama, Mixtral, Qwen, Phi, Gemma, Command, DeepSeek and more. All optimized for performance with quantization levels from Q4 to Q8.

Sub-100ms Latency

Edge-deployed models on Cloudflare Workers with global CDN. Average response time under 100ms for all GGUF models.

🔌

OpenAI-Compatible API

Drop-in replacement for OpenAI API. Same endpoints, same format, same SDKs. Works with all major AI frameworks.

💬

Streaming Support

Server-Sent Events (SSE) and WebSocket streaming for real-time responses. Perfect for chat applications.

🔒

Privacy First

No data retention. All requests processed in-memory. Optional end-to-end encryption for enterprise plans.

🎯

Function Calling

Native function calling support across all models. JSON mode, tool use, and structured outputs.

Available Models

All models available via OpenAI-compatible API with consistent pricing

Llama 3.3 70B
Meta
State-of-the-art open model
Llama 3.2 90B
Meta
Vision + text capabilities
Llama 3.1 405B
Meta
Most capable open model
Mixtral 8x22B
Mistral
Sparse mixture of experts
Mixtral 8x7B
Mistral
Efficient MoE architecture
Qwen 2.5 72B
Alibaba
Strong multilingual model
Qwen 2 110B
Alibaba
Advanced reasoning model
Phi-4
Microsoft
Compact yet powerful
Gemma 2 27B
Google
Efficient open model
Gemma 2 9B
Google
Fast and efficient
Command R+
Cohere
Enterprise-grade model
Command R
Cohere
Balanced performance
DeepSeek 67B
DeepSeek
Coding specialist
Yi-34B
01.AI
Strong Chinese-English model
SOLAR 10.7B
Upstage
Korean-English optimized
Neural Chat 7B
Intel
Optimized for chat
OpenChat 3.5
OpenChat
General purpose chat
Starling LM 7B
Nexusflow
RLHF trained
Zephyr 7B β
HuggingFace
Small but mighty
Orca 2 13B
Microsoft
Reasoning specialist

OpenAI-Compatible API

Drop-in replacement for OpenAI. Same SDKs, same format, better performance.

API Endpoints

POST/api/v1/chat/completionsMain chat completion endpoint
POST/api/v1/generateText generation endpoint
GET/api/v1/modelsList all available models
POST/api/v1/embeddingsCreate text embeddings
POST/api/v1/completionsLegacy completions API
GET/api/v1/healthHealth check endpoint
POST/api/v1/tokenizeTokenize text input
GET/api/v1/usageGet usage statistics
WS/api/v1/streamWebSocket streaming endpoint
POST/api/v1/images/generateImage generation
POST/api/v1/audio/transcribeAudio transcription
POST/api/v1/code/generateCode generation specialist
example.js
import OpenAI from 'openai'; const openai = new OpenAI({ baseURL: 'https://api.caffeine.ai/v1', apiKey: 'your-api-key' }); const completion = await openai.chat.completions.create({ model: 'llama-3.3-70b', messages: [{ role: 'user', content: 'Hello!' }], stream: true });