- Published on
搭建 llama.cpp 开发环境并运行 simple 示例
第一步,编译
To get the Code:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
CUDA Compilation
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
第二步,下载模型(huggingface)
# Load and run the model:
./build/bin/llama-server -hf Qwen/Qwen2-0.5B-Instruct-GGUF:Q2_K
Qwen/Qwen2-0.5B-Instruct-GGUF:Q2_K
显示位于 ~/.cache/llama.cpp/Qwen_Qwen2-0.5B-Instruct-GGUF_qwen2-0_5b-instruct-q2_k.gguf
运行 simple 例子
./build/bin/llama-simple -m ~/.cache/llama.cpp/Qwen_Qwen2-0.5B-Instruct-GGUF_qwen2-0_5b-instruct-q2_k.gguf "hello, my name is"
查看结果
llama_context: graph nodes = 918
llama_context: graph splits = 2
hello, my name is [name], I am a student of [name] and I am interested in [name]. I am a member of [name] and I am a member
main: decoded 32 tokens in 1.24 s, speed: 25.80 t/s
llama_perf_sampler_print: sampling time = 8.01 ms / 32 runs ( 0.25 ms per token, 3997.50 tokens per second)
llama_perf_context_print: load time = 1008.55 ms
llama_perf_context_print: prompt eval time = 113.13 ms / 5 tokens ( 22.63 ms per token, 44.20 tokens per second)
llama_perf_context_print: eval time = 1074.05 ms / 31 runs ( 34.65 ms per token, 28.86 tokens per second)
llama_perf_context_print: total time = 2135.77 ms / 36 tokens
THE END