Code Review: Deep Dive into vLLM's Architecture and Implementation Analysis of OpenAI-Compatible Serving (2/2)

Posted on 2025-06-20 In 4. MLOps

Introduction

In the previous article, I explored why vLLM is gaining popularity and the process of setting up an OpenAI-compatible server when using vllm serve.
While the first article focused on the architectural foundations and server initialization process, in this article, I want to dive deeper into the runtime behavior and request processing pipeline.

The /v1/chat/completions endpoint has become the de facto standard for conversational AI applications, powering everything from customer service chatbots to sophisticated AI assistants.
Unlike the legacy /v1/completions endpoint, which operates on simple text completion, the chat completions endpoint provides structured message handling, role-based conversations, and built-in context management.

Through this deep dive, I’ll walk you through:

Endpoint Comparison: Detailed comparison between /v1/completions and /v1/chat/completions
Request Processing: Step-by-step breakdown of how chat messages are preprocessed and transformed
Chat Template System: How vLLM applies model-specific chat templates to structure conversations
Internal Pipeline: Deep dive into the inference process, from message parsing to response generation
Performance Considerations: Understanding token efficiency and memory management in chat contexts

By examining vLLM’s implementation of the OpenAI-compatible chat completions endpoint, I’ll uncover the sophisticated engineering that enables high-performance conversational AI serving while maintaining full API compatibility.

Theoretical Background

`/v1/completions` vs. `/v1/chat/completions`

As seen in the previous article, the OpenAI compatible server provides two endpoints as shown below.

$ vllm serve Qwen/Qwen3-0.6B --max-model-len 8192
...
INFO 06-09 23:16:17 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 06-09 23:16:17 [launcher.py:36] Route: /v1/completions, Methods: POST
...

Let me walk you through the differences between these two endpoints.

Aspect	`/v1/completions` $_[$$_{1}$$_]$	`/v1/chat/completions` $_[$$_{2}$$_]$
Purpose	Text Completion	Conversational Chat
Input Format	Single string (`prompt`)	Array of messages (`messages`)
Message Structure	`{"prompt": "Hello, World!"}`	`{"messages": [{"role": "user", "content": "Hello, World!"}]}`
Role Support	None (plain text)	`system`, `user`, `assistant`, etc.
Context Management	Manual inclusion in prompt	Automatic management via message history
Conversation Continuity	Requires manual implementation	Built-in support
Response Format	`choices[].text`	`choices[].message.content`
Use Cases	- Code generation - Text completion - One-shot tasks	- Chatbots - Conversational assistants - Multi-turn dialogues
Token Efficiency	Low (full context retransmission)	High (message-level management)
Legacy Status	Legacy (not recommended)	Currently recommended approach

As officially documented by OpenAI, /v1/completions is legacy and not recommended.

Let’s test them in practice and compare the output and logs provided by vLLM.

1
2
3

$ curl http://localhost:8000/v1/completions \
 -H "Content-Type: application/json" \
 -d '{"prompt": "Hello, World!"}' | jq

1
2
3

INFO 06-16 21:27:19 [logger.py:43] Received request cmpl-bc9fa340e282468eb41d47ea9db57bfd-0: prompt: 'Hello, World!', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [9707, 11, 4337, 0], prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-16 21:27:19 [engine.py:317] Added request cmpl-bc9fa340e282468eb41d47ea9db57bfd-0.
INFO:     127.0.0.1:59189 - "POST /v1/completions HTTP/1.1" 200 OK

From the logs, we can see that /v1/completions feeds the sentence from the "prompt" directly to the LLM.

{
  "id": "cmpl-bc9fa340e282468eb41d47ea9db57bfd",
  "object": "text_completion",
  "created": 1750076839,
  "model": "Qwen/Qwen3-0.6B",
  "choices": [
    {
      "index": 0,
      "text": " My name is Alex. I am a software engineer with a passion for coding and",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 4,
    "total_tokens": 20,
    "completion_tokens": 16,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

As a result, it responds with an extended sentence based on the input "prompt", rather than a chat-style response.

1
2
3

$ curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{"messages": [{"role": "user", "content": "Hello, World!"}]}' | jq

1
2
3

INFO 06-16 21:29:16 [logger.py:43] Received request chatcmpl-dab79c6ebcb24ff58b4e032f6f83b888: prompt: '<|im_start|>user\nHello, World!<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=8180, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-16 21:29:16 [engine.py:317] Added request chatcmpl-dab79c6ebcb24ff58b4e032f6f83b888.
INFO:     127.0.0.1:59198 - "POST /v1/chat/completions HTTP/1.1" 200 OK

In contrast, /v1/chat/completions, as shown in the server log above, applies a chat template according to the user’s input format and feeds that value to the LLM.

{
  "id": "chatcmpl-dab79c6ebcb24ff58b4e032f6f83b888",
  "object": "chat.completion",
  "created": 1750076956,
  "model": "Qwen/Qwen3-0.6B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": null,
        "content": "<think>\nOkay, the user said \"Hello, World!\" and I need to respond. First, I should acknowledge their message. Since it's a simple greeting, a straightforward response is best. I can say \"Hello, World!\" as well, but maybe add a friendly note to keep it engaging. Let me check if there's any context I'm missing, but the message is pretty basic. Just a greeting. Alright, I'll respond with a friendly message to reinforce the exchange.\n</think>\n\nHello, World! 😊 What's interesting about you?",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "total_tokens": 125,
    "completion_tokens": 113,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

As a result, the response appears in chat format.
The chat template applied in the above result uses the chat_template in tokenizer_config.json by default, unless a separate --chat-template option is specified.

Qwen/Qwen3-0.6B/tokenizer_config.jsonlink

1
2
3

...
  "chat_template": "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0].role == 'system' %}\n        {{- messages[0].content + '\\n\\n' }}\n    {%- endif %}\n    {{- \"# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0].role == 'system' %}\n        {{- '<|im_start|>system\\n' + messages[0].content + '<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n    {%- set index = (messages|length - 1) - loop.index0 %}\n    {%- if ns.multi_step_tool and message.role == \"user\" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}\n        {%- set ns.multi_step_tool = false %}\n        {%- set ns.last_query_index = index %}\n    {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n    {%- if message.content is string %}\n        {%- set content = message.content %}\n    {%- else %}\n        {%- set content = '' %}\n    {%- endif %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) %}\n        {{- '<|im_start|>' + message.role + '\\n' + content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {%- set reasoning_content = '' %}\n        {%- if message.reasoning_content is string %}\n            {%- set reasoning_content = message.reasoning_content %}\n        {%- else %}\n            {%- if '</think>' in content %}\n                {%- set reasoning_content = content.split('</think>')[0].rstrip('\\n').split('<think>')[-1].lstrip('\\n') %}\n                {%- set content = content.split('</think>')[-1].lstrip('\\n') %}\n            {%- endif %}\n        {%- endif %}\n        {%- if loop.index0 > ns.last_query_index %}\n            {%- if loop.last or (not loop.last and reasoning_content) %}\n                {{- '<|im_start|>' + message.role + '\\n<think>\\n' + reasoning_content.strip('\\n') + '\\n</think>\\n\\n' + content.lstrip('\\n') }}\n            {%- else %}\n                {{- '<|im_start|>' + message.role + '\\n' + content }}\n            {%- endif %}\n        {%- else %}\n            {{- '<|im_start|>' + message.role + '\\n' + content }}\n        {%- endif %}\n        {%- if message.tool_calls %}\n            {%- for tool_call in message.tool_calls %}\n                {%- if (loop.first and content) or (not loop.first) %}\n                    {{- '\\n' }}\n                {%- endif %}\n                {%- if tool_call.function %}\n                    {%- set tool_call = tool_call.function %}\n                {%- endif %}\n                {{- '<tool_call>\\n{\"name\": \"' }}\n                {{- tool_call.name }}\n                {{- '\", \"arguments\": ' }}\n                {%- if tool_call.arguments is string %}\n                    {{- tool_call.arguments }}\n                {%- else %}\n                    {{- tool_call.arguments | tojson }}\n                {%- endif %}\n                {{- '}\\n</tool_call>' }}\n            {%- endfor %}\n        {%- endif %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n    {%- if enable_thinking is defined and enable_thinking is false %}\n        {{- '<think>\\n\\n</think>\\n\\n' }}\n    {%- endif %}\n{%- endif %}",
...

Chat template testing can be performed as follows:

>>> import transformers
>>> tokenizer=transformers.AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
>>> messages = [
...   { "role": "system", "content": "You are a helpful assistant." },
...   { "role": "user", "content": "What is the capital of France?" },
...   { "role": "assistant", "content": "The capital of France is Paris." },
...   { "role": "user", "content": "Tell me more about it." }
... ]
>>> print(tokenizer.apply_chat_template(messages, tokenize=False))
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The capital of France is Paris.<|im_end|>
<|im_start|>user
Tell me more about it.<|im_end|>

Request/Response Schema of `/v1/chat/completions`

Now that I understand the fundamental differences between the endpoints, let me examine the detailed structure of the /v1/chat/completions request and response schemas.
Understanding these schemas is crucial for effective API integration and troubleshooting, as they define the contract between client applications and vLLM’s serving infrastructure.

My analysis here is based on vLLM’s source code implementation, providing insights into both OpenAI-compatible fields and vLLM-specific extensions that enhance functionality beyond the standard API specification.

Request Schema

The ChatCompletionRequest class in vLLM implements the complete OpenAI Chat Completions API specification while adding several vLLM-specific extensions for advanced sampling and optimization features.

The schema is carefully organized to match the official OpenAI API documentation order, ensuring maximum compatibility with existing OpenAI client libraries and tools.

vllm/entrypoints/openai/protocol.pylink

...
class ChatCompletionRequest(OpenAIBaseModel):
    # Ordered by official OpenAI API documentation
    # https://platform.openai.com/docs/api-reference/chat/create
    messages: list[ChatCompletionMessageParam]
    model: Optional[str] = None
    frequency_penalty: Optional[float] = 0.0
    logit_bias: Optional[dict[str, float]] = None
    logprobs: Optional[bool] = False
    top_logprobs: Optional[int] = 0
    # TODO(#9845): remove max_tokens when field is removed from OpenAI API
    max_tokens: Optional[int] = Field(
        default=None,
        deprecated=
        'max_tokens is deprecated in favor of the max_completion_tokens field')
    max_completion_tokens: Optional[int] = None
    n: Optional[int] = 1
    presence_penalty: Optional[float] = 0.0
    response_format: Optional[AnyResponseFormat] = None
    seed: Optional[int] = Field(None, ge=_LONG_INFO.min, le=_LONG_INFO.max)
    stop: Optional[Union[str, list[str]]] = Field(default_factory=list)
    stream: Optional[bool] = False
    stream_options: Optional[StreamOptions] = None
    temperature: Optional[float] = None
    top_p: Optional[float] = None
    tools: Optional[list[ChatCompletionToolsParam]] = None
    tool_choice: Optional[Union[
        Literal["none"],
        Literal["auto"],
        Literal["required"],
        ChatCompletionNamedToolChoiceParam,
    ]] = "none"

    # NOTE this will be ignored by vLLM -- the model determines the behavior
    parallel_tool_calls: Optional[bool] = False
    user: Optional[str] = None

    # --8<-- [start:chat-completion-sampling-params]
    best_of: Optional[int] = None
    use_beam_search: bool = False
    top_k: Optional[int] = None
    min_p: Optional[float] = None
    repetition_penalty: Optional[float] = None
    length_penalty: float = 1.0
    stop_token_ids: Optional[list[int]] = Field(default_factory=list)
    include_stop_str_in_output: bool = False
    ignore_eos: bool = False
    min_tokens: int = 0
    skip_special_tokens: bool = True
    spaces_between_special_tokens: bool = True
    truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
    prompt_logprobs: Optional[int] = None
    # --8<-- [end:chat-completion-sampling-params]
...

Field	Type	Required	Default	Description
`messages`	`list[ChatCompletionMessageParam]`	✅	-	Array of conversation messages
`model`	`Optional[str]`	❌	`None`	Model name to use (vllm-project/vllm#13568 made optional)
`frequency_penalty`	`Optional[float]`	❌	`0.0`	Frequency-based token penalty (-2.0 ~ 2.0)
`logit_bias`	`Optional[dict[str, float]]`	❌	`None`	Bias for specific tokens’ logits
`logprobs`	`Optional[bool]`	❌	`False`	Whether to return log probabilities
`top_logprobs`	`Optional[int]`	❌	`0`	Number of top log probabilities to return (0-20)
`max_tokens`	`Optional[int]`	❌	`None`	Maximum number of tokens to generate
`n`	`Optional[int]`	❌	`1`	Number of completions to generate
`presence_penalty`	`Optional[float]`	❌	`0.0`	Presence-based token penalty (-2.0 ~ 2.0)
`response_format`	`Optional[AnyResponseFormat]`	❌	`None`	Response format specification (JSON mode)
`seed`	`Optional[int]`	❌	`None`	Seed for reproducible output
`stop`	`Optional[Union[str, list[str]]]`	❌	`[]`	Stop strings for generation
`stream`	`Optional[bool]`	❌	`False`	Whether to stream responses
`temperature`	`Optional[float]`	❌	`None`	Sampling temperature (0.0 ~ 2.0)
`top_p`	`Optional[float]`	❌	`None`	Nucleus sampling probability
`tools`	`Optional[list[ChatCompletionToolsParam]]`	❌	`None`	Function call tool definitions
`tool_choice`	`Optional[Union[Literal, NamedToolChoice]]`	❌	`"none"`	Tool selection strategy
`user`	`Optional[str]`	❌	`None`	User identifier
`best_of`	`Optional[int]`	❌	`None`	Number of generations to select best from
`use_beam_search`	`bool`	❌	`False`	Whether to use beam search
`top_k`	`Optional[int]`	❌	`None`	Consider only top k tokens
`min_p`	`Optional[float]`	❌	`None`	Minimum probability threshold
`repetition_penalty`	`Optional[float]`	❌	`None`	Repetition penalty
`min_tokens`	`int`	❌	`0`	Minimum number of tokens to generate
`skip_special_tokens`	`bool`	❌	`True`	Whether to skip special tokens in output
`spaces_between_special_tokens`	`bool`	❌	`True`	Whether to add spaces between special tokens
`truncate_prompt_tokens`	`Optional[int]`	❌	`None`	Truncate prompt to specified token count
`prompt_logprobs`	`Optional[int]`	❌	`None`	Number of prompt log probabilities to return

Message Object

The message object structure supports both simple text conversations and complex multimodal interactions. vLLM extends the standard OpenAI message format to support custom roles and enhanced tool integration.

vllm/entrypoints/chat_utils.pylink

...
class CustomChatCompletionMessageParam(TypedDict, total=False):
    """Enables custom roles in the Chat Completion API."""
    role: Required[str]
    """The role of the message's author."""

    content: Union[str, list[ChatCompletionContentPartParam]]
    """The contents of the message."""

    name: str
    """An optional name for the participant.

    Provides the model information to differentiate between participants of the
    same role.
    """

    tool_call_id: Optional[str]
    """Tool call that this message is responding to."""

    tool_calls: Optional[Iterable[ChatCompletionMessageToolCallParam]]
    """The tool calls generated by the model, such as function calls."""


ChatCompletionMessageParam = Union[OpenAIChatCompletionMessageParam,
                                   CustomChatCompletionMessageParam]
...

Field	Type	Required	Description
`role`	`Required[str]`	✅	Message role: `system`, `user`, `assistant`, `tool`
`content`	`Union[str, list[ChatCompletionContentPartParam]]`	✅	Message content (text or multimodal array)
`name`	`str`	❌	Message author name
`tool_call_id`	`Optional[str]`	❌	Tool call ID (required when role is `tool`)
`tool_calls`	`Optional[Iterable[ChatCompletionMessageToolCallParam]]`	❌	Assistant’s tool calls

Response Schema

The response schema follows the OpenAI specification closely while incorporating vLLM-specific enhancements for advanced use cases like KV caching optimization and detailed logging.

vllm/entrypoints/openai/protocol.pylink

...
class ChatCompletionResponse(OpenAIBaseModel):
    id: str = Field(default_factory=lambda: f"chatcmpl-{random_uuid()}")
    object: Literal["chat.completion"] = "chat.completion"
    created: int = Field(default_factory=lambda: int(time.time()))
    model: str
    choices: list[ChatCompletionResponseChoice]
    usage: UsageInfo
    prompt_logprobs: Optional[list[Optional[dict[int, Logprob]]]] = None
    kv_transfer_params: Optional[dict[str, Any]] = Field(
        default=None, description="KVTransfer parameters.")
...

Field	Type	Description
`id`	`str`	Unique identifier for the completion request
`object`	`Literal["chat.completion"]`	Object type (`chat.completion` or `chat.completion.chunk`)
`created`	`int`	Creation time represented as Unix timestamp
`model`	`str`	Model name used
`choices`	`list[ChatCompletionResponseChoice]`	Array of generated completion choices
`usage`	`UsageInfo`	Token usage information
`prompt_logprobs`	`Optional[list[Optional[dict[int, Logprob]]]]`	Prompt log probability information
`kv_transfer_params`	`Optional[dict[str, Any]]`	KVTransfer parameters

Choice Object

Each choice represents a single completion generated by the model. The choice object contains the actual generated content along with metadata about the generation process.

vllm/entrypoints/openai/protocol.pylink

...
class ChatCompletionResponseChoice(OpenAIBaseModel):
    index: int
    message: ChatMessage
    logprobs: Optional[ChatCompletionLogProbs] = None
    # per OpenAI spec this is the default
    finish_reason: Optional[str] = "stop"
    # not part of the OpenAI spec but included in vLLM for legacy reasons
    stop_reason: Optional[Union[int, str]] = None
...

vllm/entrypoints/openai/protocol.pylink

...
class ChatMessage(OpenAIBaseModel):
    role: str
    reasoning_content: Optional[str] = None
    content: Optional[str] = None
    tool_calls: list[ToolCall] = Field(default_factory=list)
...

Field	Type	Description
`index`	`int`	Index of the choice
`message`	`ChatMessage`	Message generated by the assistant
`logprobs`	`Optional[ChatCompletionLogProbs]`	Log probability information
`finish_reason`	`Optional[str]`	Completion termination reason: `stop`, `length`, `function_call`, `content_filter`, `tool_calls`
`stop_reason`	`Optional[Union[int, str]]`	vLLM legacy field (outside OpenAI spec, provides similar info to `finish_reason`)

Usage Object

The usage object provides detailed token consumption metrics, essential for billing, monitoring, and optimization purposes.

vllm/entrypoints/openai/protocol.pylink

class UsageInfo(OpenAIBaseModel):
    prompt_tokens: int = 0
    total_tokens: int = 0
    completion_tokens: Optional[int] = 0
    prompt_tokens_details: Optional[PromptTokenUsageInfo] = None

Field	Type	Description
`prompt_tokens`	`int`	Number of tokens used in prompt
`total_tokens`	`int`	Total tokens (prompt + completion)
`completion_tokens`	`Optional[int]`	Number of tokens generated in completion
`prompt_tokens_details`	`Optional[PromptTokenUsageInfo]`	Detailed prompt token usage information

Router

vLLM’s OpenAI-compatible server is built on FastAPI, providing a robust and high-performance web framework for serving LLM requests.
When a user sends a POST request to /v1/chat/completions, FastAPI’s routing system directs the request to the following function, which serves as the main entry point for chat completion requests.

vllm/entrypoints/openai/api_server.pylink

...
@router.post("/v1/chat/completions",
             dependencies=[Depends(validate_json_request)],
             responses={
                 HTTPStatus.OK.value: {
                     "content": {
                         "text/event-stream": {}
                     }
                 },
                 HTTPStatus.BAD_REQUEST.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.NOT_FOUND.value: {
                     "model": ErrorResponse
                 },
                 HTTPStatus.INTERNAL_SERVER_ERROR.value: {
                     "model": ErrorResponse
                 }
             })
@with_cancellation
@load_aware_call
async def create_chat_completion(request: ChatCompletionRequest,
                                 raw_request: Request):
    handler = chat(raw_request)
    if handler is None:
        return base(raw_request).create_error_response(
            message="The model does not support Chat Completions API")

    generator = await handler.create_chat_completion(request, raw_request)

    if isinstance(generator, ErrorResponse):
        return JSONResponse(content=generator.model_dump(),
                            status_code=generator.code)

    elif isinstance(generator, ChatCompletionResponse):
        return JSONResponse(content=generator.model_dump())

    return StreamingResponse(content=generator, media_type="text/event-stream")
...

Request Handler

I can see that the handler is defined through the chat() function.
This function retrieves the openai_serving_chat instance that was registered in the app.state during server initialization, as shown below.

vllm/entrypoints/openai/api_server.pylink

...
def chat(request: Request) -> Optional[OpenAIServingChat]:
    return request.app.state.openai_serving_chat
...

Starlette Request Object

The Request object is a class included in the Starlette framework, and it inherits the app property from its parent class HTTPConnection.
This design provides access to the application state and configuration throughout the request lifecycle.

starlette/requests.pylink

1
2
3

...
class Request(HTTPConnection):
...

The app property provides access to the FastAPI application instance, while scope contains ASGI (Asynchronous Server Gateway Interface) information about the current request.
This architecture follows the ASGI specification, enabling efficient handling of asynchronous web requests.

starlette/requests.pylink

...
class HTTPConnection(Mapping[str, Any]):
    """
    A base class for incoming HTTP connections, that is used to provide
    any functionality that is common to both `Request` and `WebSocket`.
    """
...
    @property
    def app(self) -> Any:
        return self.scope["app"]
...

Application State Initialization

Looking at the initialization of state.openai_serving_chat, it occurs in the init_app_state() function as follows.
This initialization happens during server startup, ensuring that all necessary components are ready before handling incoming requests.

vllm/entrypoints/openai/api_server.pylink

...
async def init_app_state(
    engine_client: EngineClient,
    vllm_config: VllmConfig,
    state: State,
    args: Namespace,
) -> None:
...
    state.openai_serving_chat = OpenAIServingChat(
        engine_client,
        model_config,
        state.openai_serving_models,
        args.response_role,
        request_logger=request_logger,
        chat_template=resolved_chat_template,
        chat_template_content_format=args.chat_template_content_format,
        return_tokens_as_token_ids=args.return_tokens_as_token_ids,
        enable_auto_tools=args.enable_auto_tool_choice,
        tool_parser=args.tool_call_parser,
        reasoning_parser=args.reasoning_parser,
        enable_prompt_tokens_details=args.enable_prompt_tokens_details,
    ) if model_config.runner_type == "generate" else None
...

Testing `app.state`

The app.state mechanism can be tested with the following example.
This demonstrates how FastAPI’s application state works in practice and how components are shared across request handlers.

from random import random
from typing import Optional

import uvicorn
import uvloop
from fastapi import FastAPI, Request
from fastapi.datastructures import State
from loguru import logger
from pydantic import BaseModel

app = FastAPI()


class OpenAIServingChat:
    def __init__(self) -> None:
        logger.info("Init: OpenAIServingChat")

    def create_chat_completion(self, *args, **kwargs) -> float:
        logger.info("Run: OpenAIServingChat.create_chat_completion")
        return random()


async def init_app_state(state: State):
    state.openai_serving_chat = OpenAIServingChat()


def chat(request: Request) -> Optional[OpenAIServingChat]:
    return request.app.state.openai_serving_chat


class ChatCompletionRequest(BaseModel):
    id: int


@app.post("/v1/chat/completions")
async def create_chat_completion(request: ChatCompletionRequest, raw_request: Request):
    handler = chat(raw_request)
    logger.info(f"{raw_request=}")
    return {"id": request.id, "chat_completion": handler.create_chat_completion()}


async def main():
    await init_app_state(app.state)
    config = uvicorn.Config(app, host="0.0.0.0", port=8000)
    server = uvicorn.Server(config)
    await server.serve()


if __name__ == "__main__":
    uvloop.run(main())

$ curl -X 'POST' \
 'http://localhost:8000/v1/chat/completions' \
 -H 'accept: application/json' \
 -H 'Content-Type: application/json' \
 -d '{
  "id": 0
}' | jq
{
  "id": 0,
  "chat_completion": 0.7867811845314955
}

Examining the server logs reveals the initialization sequence: the OpenAIServingChat instance is initialized before FastAPI starts running.
When a request arrives, the handler is retrieved from request.app.state.openai_serving_chat and executed.

This pattern demonstrates FastAPI’s application lifecycle management, where:

Initialization Phase: Critical components are set up during server startup
Request Phase: Pre-initialized components are accessed through the application state
Processing Phase: The actual request handling occurs with the retrieved handler

2025-06-16 23:38:46.972 | INFO     | __main__:__init__:16 - Init: OpenAIServingChat
INFO:     Started server process [52024]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
2025-06-16 23:38:49.021 | INFO     | __main__:create_chat_completion:38 - raw_request=<starlette.requests.Request object at 0x105a80a50>
2025-06-16 23:38:49.021 | INFO     | __main__:create_chat_completion:19 - Run: OpenAIServingChat.create_chat_completion
INFO:     127.0.0.1:61279 - "POST /v1/chat/completions HTTP/1.1" 200 OK