在电脑上使用AVX512加速LLaMA.cpp本地推理

想在本地玩大模型但服务器光有CPU没有高端显卡?

No No No,为什么不试试最新最潮(也不算)的LLaMA.cpp,只要有一定内存,和支持AVX512指令集的英特尔服务器CPU,尽管推理速度慢了一些,但只需要64G以上普通内存,依然可以试玩大模型。

打开教程一看,又得自己编译,又要配环境,还要量化模型……

觉得复杂?一切没有想象中困难!

安装Intel oneAPI编译工具

首先需要前往英特尔官网Download the Intel® oneAPI Base Toolkit下载支持AVX512指令集的编译器套件。

下载完成后安装,安装命令是:

1
sh ./l_BaseKit_p_2024.1.0.596_offline.sh -a --silent --cli --eula accept

官网示例命令中使用了sudo,但实际上也支持用户模式安装。-a选项有点莫名其妙,实际作用是把后面的参数传递给安装器。如果希望交互式安装,则需要去掉--silent选项。

引入cmake和oneAPI套件(可选)

如果没有root权限,cmake和oneAPI都只能安装在用户目录下,则需要先引入上述环境。

1
2
source /path/to/intel/oneapi/setvars.sh
source /path/to/cmake/activate_cmake.sh

这里的activate_cmake.sh文件是自己创建的,只是简单地在PATH中添加了cmake的bin文件路径,方便后续使用而已,内容如下:

1
2
#!/bin/bash
export PATH="$PATH:/path/to/cmake/bin"

编译安装LLaMA.cpp

克隆LLaMA.cpp仓库(不要直接下载Release中的打包源代码,编译时会检测git仓库是否存在,请直接克隆)

1
2
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

按照仓库里的指引,找到Intel oneMKL部分的编译指令,直接执行:

1
2
cmake -B build -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_NATIVE=ON
cmake --build build --config Release

等待编译完成,可执行文件就位于build/bin下面。如果希望将这些文件安装到特定文件夹,可以继续执行:

1
cmake --install build/ --config Release --prefix /path/you/like/

量化和加载模型文件

得到可用的LLaMA.cpp后,就可以对训练好的模型量化然后使用了。当然,如果不想自己炼丹/量化,也可以下载已经量化好的模型。

以千问模型为例,参考llama.cpp - Qwen的指引。

首先可以前往Huggingface下载模型的GGUF文件,例如Qwen/Qwen2-7B-Instruct-GGUF

和HF上其他格式的检查点不同,这里只需要下载一个合适的GGUF文件就行。

下载完成后加载模型文件和初始提示词:

1
./main -m ../models/qwen2-7b-instruct-q5_k_m.gguf -n 512 --color -i -cml -f ../prompts/chat-with-qwen.txt

接着就可以对话了:

1
2
3
4
5
> Hello, tell me more about yourself.
I am an artificial intelligence designed to assist with a wide range of tasks, including answering questions, providing information, and performing various tasks. I am constantly learning and improving, and I am constantly updated with the latest technology and information. I am designed to assist with a wide range of tasks, and I am always ready to help you with whatever you need.

> What is 1+2?
1 + 2 equals 3.

踩坑

量化模型有很多,选哪个?

请参见Difference in different quantization methods.

quantize --help outputs a helpful table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Allowed quantization types:
2 or Q4_0 : 3.50G, +0.2499 ppl @ 7B - small, very high quality loss - legacy, prefer using Q3_K_M
3 or Q4_1 : 3.90G, +0.1846 ppl @ 7B - small, substantial quality loss - legacy, prefer using Q3_K_L
8 or Q5_0 : 4.30G, +0.0796 ppl @ 7B - medium, balanced quality - legacy, prefer using Q4_K_M
9 or Q5_1 : 4.70G, +0.0415 ppl @ 7B - medium, low quality loss - legacy, prefer using Q5_K_M
10 or Q2_K : 2.67G, +0.8698 ppl @ 7B - smallest, extreme quality loss - not recommended
12 or Q3_K : alias for Q3_K_M
11 or Q3_K_S : 2.75G, +0.5505 ppl @ 7B - very small, very high quality loss
12 or Q3_K_M : 3.06G, +0.2437 ppl @ 7B - very small, very high quality loss
13 or Q3_K_L : 3.35G, +0.1803 ppl @ 7B - small, substantial quality loss
15 or Q4_K : alias for Q4_K_M
14 or Q4_K_S : 3.56G, +0.1149 ppl @ 7B - small, significant quality loss
15 or Q4_K_M : 3.80G, +0.0535 ppl @ 7B - medium, balanced quality - *recommended*
17 or Q5_K : alias for Q5_K_M
16 or Q5_K_S : 4.33G, +0.0353 ppl @ 7B - large, low quality loss - *recommended*
17 or Q5_K_M : 4.45G, +0.0142 ppl @ 7B - large, very low quality loss - *recommended*
18 or Q6_K : 5.15G, +0.0044 ppl @ 7B - very large, extremely low quality loss
7 or Q8_0 : 6.70G, +0.0004 ppl @ 7B - very large, extremely low quality loss - not recommended
1 or F16 : 13.00G @ 7B - extremely large, virtually no quality loss - not recommended
0 or F32 : 26.00G @ 7B - absolutely huge, lossless - not recommended

The ppl column is perplexity increase relative to unquantized. Q4_K_M, Q5_K_S and Q5_K_M are considered "recommended".

出现unknown argument: -cml

自从Releases/b3087开始,LLaMA.cpp对main程序进行了重构,删去了-cml参数。但千问文档中需要使用这一参数。

直接解决这一问题的方法比较麻烦,当然也有不需要动脑的方法:回退到之前的版本Releases/b3086

在LLaMA.cpp主目录执行以下命令:

1
git revert --no-commit b3086..HEAD

重新编译一次即可。

更多请见Discussions/7837

libmkl_intel_ilp64.so.2: No such file or directory

如果出现以下报错:

1
./main: error while loading shared libraries: libmkl_intel_ilp64.so.2: cannot open shared object file: No such file or directory

这是因为英特尔的MKL库文件libmkl_intel_ilp64.so.2没有在LD_LIBRARY_PATH中可以被搜索到,该文件实际位于/path/to/intel/oneapi/mkl/<year>.<ver>/lib里。

解决方法也相当简单,直接再次source /path/to/intel/oneapi/setvars.sh,脚本会自动帮我们把库文件添加到环境变量中。

oneAPI Compiler与GCC的性能

然而,实际测试中,较新版本的GCC也能利用AVX512指令集为程序加速。于是恭喜你和我一样,踩坑了!