Published on

Chat with llama.cpp

Start llama.cpp server

./build/bin/llama-server -hf Qwen/Qwen2-0.5B-Instruct-GGUF:Q2_K

Sending a POST request using curl

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "messages": [
            {"role": "user", "content": "Hello! How are you today?"}
        ],
        "max_tokens": 150,
        "temperature": 0.7
    }'

Parsing response

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "I'm good, thanks! How can I assist you today?"
            }
        }
    ],
    "created": 1751341871,
    "model": "gpt-3.5-turbo",
    "system_fingerprint": "b5787-0a5a3b5c",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 14,
        "prompt_tokens": 15,
        "total_tokens": 29
    },
    "id": "chatcmpl-bGxdAhvPfBtQ8O5Uc9hrPGWFdv1w1m9e",
    "timings": {
        "prompt_n": 12,
        "prompt_ms": 96.768,
        "prompt_per_token_ms": 8.064,
        "prompt_per_second": 124.0079365079365,
        "predicted_n": 14,
        "predicted_ms": 216.101,
        "predicted_per_token_ms": 15.435785714285714,
        "predicted_per_second": 64.78452205218855
    }
}

THE END