环境依赖

运行 pip install poetry==1.8.0 安装 poetry 包管理。

如下为 poetry 包管理中具体的依赖包

openai = "^1.30.3"
fastapi = "^0.111.0"
transformers = "^4.41.1"
tiktoken = "^0.6.0"
torch = "^2.3.0"
sse-starlette = "^2.1.0"
sentence-transformers = "^2.7.0"
sentencepiece = "^0.2.0"
accelerate = "^0.30.1"
pydantic = "^2.7.1"
timm = "^1.0.3"
pandas = "^2.2.2"
vllm = "^0.4.2"

当前 vLLM 不支持 tool call

因为当前 vLLM 不支持使用 OpenAI 包的方式传入 tools，详见 vLLM PR #3237。

from openai import OpenAI

client = OpenAI()

messages = [{"role": "user", "content": "What's the weather like in San Francisco, Tokyo, and Paris?"}]
tools = [...]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools,
    tool_choice="auto",  # auto is default, but we'll be explicit
)
response_message = response.choices[0].message
tool_calls = response_message.tool_calls

我们需要使用对应模型的 Prompt, 传入 messages 中，然后通过 OpenAI 的 SDK 调用。

智谱 GLM

启动 OpenAI-API 的 Server

# ChatGLM3-6B
python -m vllm.entrypoints.openai.api_server --served-model-name ChatGLM3-6B --model /THUDM/chatglm3-6b \
        --max-model-len 8192 --chat-template chatglm-template.jinja \
        --tokenizer /THUDM/chatglm3-6b --tensor-parallel-size 4 --trust-remote-code

# GLM-4-9B-Chat
python -m vllm.entrypoints.openai.api_server --served-model-name GLM-4-9B --model /THUDM/glm-4-9b-chat \
        --max-model-len 8192 --chat-template chatglm-template.jinja \
        --tokenizer /THUDM/glm-4-9b-chat --tensor-parallel-size 4 --trust-remote-code

参数 --host 指定需要部署模型的 IP 地址
参数 --port 指定需要部署模型的端口
参数 --tensor-parallel-size 用来指定使用多少张 GPU(或 TPU)并行推理
参数 --max-model-len 表示模型上下文长度，示例中指定为 8K。

提示

此处最好显式指定 --max-model-len，否则会使用默认值 32768 ，导致当前机器的显存不够用，出现 OOM。

其中 chatglm-template.jinja 文件内容如下:

{% for message in messages %}
{% if loop.first %}
[gMASK]sop<|{{ message['role'] }}|>\n{{ message['content'] }}
{% else %}
<|{{ message['role'] }}|>\n {{ message['content'] }}
{% endif %}
{% endfor %}
{% if add_generation_prompt %}
<|assistant|>
{% endif %}

Client

Client 端可以直接使用 OpenAI 的 SDK 访问接口。

from openai import OpenAI

SYTEMT_PROMPT = "Answer the following questions as best as you can. You have access to the following tools:\n{tools}"

base_url = "http://127.0.0.1:8000/v1"
client = OpenAI(api_key="EMPTY", base_url=base_url)

models = client.models.list()
model_name = models.data[0].id

query = "What's the Celsius temperature in San Francisco?"
tools = [
    {
        "name": "get_current_weather",
        "description": "Get the current weather",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA",
                },
                "format": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "The temperature unit to use. Infer this from the users location.",
                },
            },
            "required": ["location", "format"],
        }
    },
]
messages = [
    {"role": "system", "content": SYTEMT_PROMPT.format(tools=tools)},
    {"role": "user", "content": query},
]

response = client.chat.completions.create(
    model=model_name,
    messages=messages,
    max_tokens=1024,
    temperature=0.8,
    presence_penalty=1.2,
    top_p=0.7,
    extra_body={
        "top_k": 40,
    }
)

resp_conent = response.choices[0].message.content

环境依赖#

当前 vLLM 不支持 tool call#

智谱 GLM#

启动 OpenAI-API 的 Server#

Client#

环境依赖

当前 vLLM 不支持 tool call

智谱 GLM

启动 OpenAI-API 的 Server

Client