TensorRT-LLM

TensorRT-LLM is NVIDIA's open-source LLM inference framework, released in late 2023, that compiles transformer models into highly optimized CUDA kernels for the lowest possible latency on NVIDIA GPUs. The framework supports advanced optimizations including in-flight batching (NVIDIA's equivalent of continuous batching), paged attention, speculative decoding, INT8 SmoothQuant, FP8 (on Hopper and Blackwell), and FlashAttention. TensorRT-LLM produces faster single-request latency than vLLM on equivalent hardware in most benchmarks, at the cost of more complex deployment — models must be compiled to TensorRT engines for each specific GPU type and configuration. The framework is the foundation of NVIDIA's NIM (NVIDIA Inference Microservices) packaging and is used by enterprise customers including Snowflake, Cisco, ServiceNow, and many others. TensorRT-LLM is the natural choice when minimum latency on NVIDIA hardware matters most. AI governance teams pair TensorRT-LLM with governance layers like Centralpoint for token metering, audit logging, and policy enforcement.

TensorRT-LLM endpoints in Centralpoint: Centralpoint sits in front of TensorRT-LLM endpoints alongside vLLM, cloud APIs, and other inference backends in a model-agnostic stack. The platform meters tokens, keeps prompts local, supports generative and embedded models, and deploys chatbots through one line of JavaScript with full audit trails.

Related Keywords:
TensorRT-LLM,,

Back