Code Review: Deep Dive into vLLM's Architecture and Implementation Analysis of OpenAI-Compatible Serving (2/2)

Posted on 2025-06-20 In 4. MLOps

Introduction

In the previous article, I explored why vLLM is gaining popularity and the process of setting up an OpenAI-compatible server when using vllm serve.
While the first article focused on the architectural foundations and server initialization process, in this article, I want to dive deeper into the runtime behavior and request processing pipeline.

The /v1/chat/completions endpoint has become the de facto standard for conversational AI applications, powering everything from customer service chatbots to sophisticated AI assistants.
Unlike the legacy /v1/completions endpoint, which operates on simple text completion, the chat completions endpoint provides structured message handling, role-based conversations, and built-in context management.

Through this deep dive, I’ll walk you through:

Endpoint Comparison: Detailed comparison between /v1/completions and /v1/chat/completions
Request Processing: Step-by-step breakdown of how chat messages are preprocessed and transformed
Chat Template System: How vLLM applies model-specific chat templates to structure conversations
Internal Pipeline: Deep dive into the inference process, from message parsing to response generation
Performance Considerations: Understanding token efficiency and memory management in chat contexts

By examining vLLM’s implementation of the OpenAI-compatible chat completions endpoint, I’ll uncover the sophisticated engineering that enables high-performance conversational AI serving while maintaining full API compatibility.

Code Review: Deep Dive into vLLM's Architecture and Implementation Analysis of OpenAI-Compatible Serving (1/2)

Posted on 2025-06-13 In 4. MLOps

Introduction

vLLM $_[$$_{1}$$_,$$_{2}$$_]$ is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. $_[$$_{3}$$_]$

The rapid advancement of Large Language Models (LLMs) has brought efficient model serving and inference optimization to the forefront of MLOps concerns.
In response to these challenges, vLLM has emerged as a leading solution, garnering significant attention with 49.2k stars on GitHub as of June 9, 2025.
As demonstrated in the star history graph below, vLLM has established itself as the most prominent LLM serving framework among various competing solutions.

A particularly noteworthy aspect is the standardized API interface provided by OpenAI’s GPT series.
With countless developers already building applications based on this API specification, ensuring compatibility has become crucial for any LLM serving solution.
This article provides a comprehensive analysis of vLLM’s core technological foundations and examines the internal implementation processes that enable OpenAI-compatible server deployment when executing the vllm serve command.

System Design Interview Volume 2 (9)

Posted on 2025-05-26 In 3. DevOps

결제 시스템

결제 (payment) system: 금전적 가치의 이전을 통해 금융 거래를 정산하는 데 사용되는 모든 system

1단계: 문제 이해 및 설계 범위 확정

기능 요구사항
- 대금 수신 (pay-in) 흐름: 결제 system이 판매자를 대신하여 고객으로부터 대금 수령
- 대금 정산 (pay-out) 흐름: 결제 system이 전 세계의 판매자에게 제품 판매 대금 송금
비기능 요구사항
- 신뢰성 및 내결함성: 결제 실패는 신중하게 처리
- 내부 service (결제 system, 회계 system)와 외부 service (결제 servcice 제공업체) 간의 조정 process: System 간의 결제 정보가 일치하는지 비동기적으로 확인
개략적인 규모 추정
- 하루에 100만 건의 transaction 처리
- $1,000,000/10^5=10TPS$
- 10TPS는 일반적 database로 문제 없이 처리 가능하기 때문에 처리 대역폭 대신 결제 transaction의 정확한 처리에 초점

System Design Interview Volume 2 (8)

Posted on 2025-05-19 In 3. DevOps

실시간 게임 순위표

순위표: Leaderboard of an online mobile game

1단계: 문제 이해 및 설계 범위 확정

기능 요구사항
- 순위표에 상위 10명의 player 표시
- 특정 사용자의 순위 표시
- 어떤 사용자보다 4순위 위와 아래에 있는 사용자 표시
비기능 요구사항
- 점수 update는 실시간으로 순위표에 반영
- 일반적인 확장성, 가용성 및 안정성 요구사항
개략적 규모 추정
- Game을 하는 사용자가 24시간 동안 고르게 분포 가정
  - DAU가 5,000,000명인 경우 초당 평균 50명 game play
  - $\because\frac{5,000,000DAU}{10^5sec}\simeq50$
- 하지만 그렇게 균등한 경우는 존재하지 않고 북미 지역 기준 저녁 시간이 peak 시간대일 가능성이 높음
  - 최대 부하는 평균의 5배라 가정
  - $\therefore$ 초당 최대 250명의 사용자를 감당할 수 있어야 함
- 사용자 점수 획득 QPS
  - 한 사용자가 하루 평균 10개 game play 가정
  - $\therefore 50\times10\times5=2,500$
- 상위 10명 순위표 가져오기 QPS
  - 각 사용자가 하루에 한 번 game을 열고 상위 10명 순위표는 사용자가 처음 게임을 열 때만 표시한다고 가정
  - 초당 평균 50명이 game play하기 때문에 QPS는 약 50

System Design Interview Volume 2 (7)

Posted on 2025-05-13 In 3. DevOps

S3와 유사한 객체 저장소

Amazon S3 (Simple Storage Service): RESTful API 기반 interface로 이용 가능한 객체 저장소

2006년 6월: S3 service 시작
2010년: Versioning 기능, bucket policy, multipart upload 기능 제공
2011년: Server 측 암호화, 여러 객체 삭제, 객체 만료 등 지원
2013년: S3에 약 2조 개의 객체 저장
2014년 ~ 2015년: Life cycle policy, event notification, cross-region replication 등 기능 도입
2021년: S3에 약 100조 개 이상의 객체 저장