Published on

搭建 llama.cpp 开发环境并运行 simple 示例

第一步,编译

To get the Code:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

CUDA Compilation

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

第二步,下载模型(huggingface)

# Load and run the model:
./build/bin/llama-server -hf Qwen/Qwen2-0.5B-Instruct-GGUF:Q2_K

Qwen/Qwen2-0.5B-Instruct-GGUF:Q2_K 显示位于 ~/.cache/llama.cpp/Qwen_Qwen2-0.5B-Instruct-GGUF_qwen2-0_5b-instruct-q2_k.gguf

运行 simple 例子

./build/bin/llama-simple -m ~/.cache/llama.cpp/Qwen_Qwen2-0.5B-Instruct-GGUF_qwen2-0_5b-instruct-q2_k.gguf "hello, my name is"

查看结果

llama_context: graph nodes  = 918
llama_context: graph splits = 2
hello, my name is [name], I am a student of [name] and I am interested in [name]. I am a member of [name] and I am a member
main: decoded 32 tokens in 1.24 s, speed: 25.80 t/s

llama_perf_sampler_print:    sampling time =       8.01 ms /    32 runs   (    0.25 ms per token,  3997.50 tokens per second)
llama_perf_context_print:        load time =    1008.55 ms
llama_perf_context_print: prompt eval time =     113.13 ms /     5 tokens (   22.63 ms per token,    44.20 tokens per second)
llama_perf_context_print:        eval time =    1074.05 ms /    31 runs   (   34.65 ms per token,    28.86 tokens per second)
llama_perf_context_print:       total time =    2135.77 ms /    36 tokens

THE END