Introduction and release notes of Gateway with Inference Extension - Container Service for Kubernetes

Gateway with Inference Extension is an enhanced component based on the Kubernetes Gateway API and its Inference Extension specification. It supports Layer 4 and Layer 7 routing services in Kubernetes and provides intelligent load balancing for large language model (LLM) inference scenarios. This topic describes the Gateway with Inference Extension component, its usage, and its release notes.

Component information

Gateway with Inference Extension is built on the Envoy Gateway project. It is compatible with Gateway API features and integrates the inference extensions from the Gateway API. It is primarily used to provide load balancing and routing for LLM inference services.

Usage notes

The Gateway with Inference Extension component depends on the custom resource definitions (CRDs) provided by the Gateway API component. Before you install this component, make sure that the Gateway API component is installed in your cluster. For more information, see Install components.

Release notes

May 2025

Version number	Modification Time	Changes	Impact
v1.4.0-aliyun.1	May 27, 2025	Supports Gateway API 1.3.0. Inference extension enhancements: Supports multiple inference service frameworks, such as vLLM, SGLang, and TensorRT-LLM. Supports prefix-aware load balancing. Supports routing for inference services based on model names. Supports request queuing and priority scheduling for inference. Supports observability for generative AI requests. Supports global rate limiting. Supports token-based global rate limiting for generative AI requests. Supports adding Secret content to specified request headers.	Upgrading from an earlier version restarts the gateway pod. Perform the upgrade during off-peak hours.

April 2025

Version number	Modification Time	Changes	Impact
v1.3.0-aliyun.2	May 7, 2025	Supports ACS clusters. Inference extension enhancements: Supports referencing InferencePool resources in HTTPRoute. Also supports weight-based routing, traffic mirroring, and circuit breaking at the InferencePool level. Supports prefix-aware load balancing.	Upgrading from an earlier version restarts the gateway pod. Perform the upgrade during off-peak hours.

March 2025

Version number	Modification Time	Changes	Impact
v1.3.0-aliyun.1	March 12, 2025	Supports Gateway API v1.2. Supports Inference Extension, which provides intelligent load balancing for LLM inference scenarios.	This upgrade does not affect your services.