High Performance Inferencing for LLMs
Dr. Ravishankar Ravindran : "High Performance Inferencing for LLMs"
Inferencing has become ubiquitous across cloud, regional, edge, and device environments, powering a wide spectrum of AI use cases spanning vision, language, and traditional machine learning applications. In recent years, Large Language Models (LLMs), initially developed for natural language tasks, have expanded to multimodal applications including vision speech, reasoning and planning each demanding distinct service-level objectives (SLOs). Achieving high-performance inferencing for such diverse workloads requires both model-level and system-level optimizations.
This talk focuses on system-level optimization techniques that maximize token throughput , achieve user experience metrics and inference service-provider efficiency. We review several recent innovations including KV caching, Paged/Flash/Radix Attention, Speculative Decoding, P/D Disaggregation and KV Routing, and explain how these mechanisms enhance performance by reducing latency, memory footprint, and compute overhead. These techniques are implemented in leading open-source inference frameworks such as vLLM, SGLang, Hugging Face TGI, and NVIDIA NIM, which form the backbone of large-scale public and private LLM serving platforms.
The use of GPU Training, Inference and Analysis clusters with Multi-Instance-GPU's (MIG), and Federated Models with QML applications has now become practical.
Attendees will gain a practical understanding of the challenges in delivering scalable, low-latency LLM inference, and of the architectural and algorithmic innovations driving next-generation high-performance inference systems.
Date and Time
Location
Hosts
Registration
-
Add Event to Calendar
Loading virtual attendance info...
- Contact Event Hosts
-
eMerging Open Tech Foundation is an education and skills development private entity registered with the Ministry of Company Affairs (MCA.gov.in) for Bharat's (India’s) march to Global leadership. In collaboration with INGR Quantum IT working group/ INGR AIML working groups, this Professional Continuous Education program is being organized.
- Co-sponsored by eMerging Open Tech Foundation
- Survey: Fill out the survey
Speakers
Ravishankar of eOTF
High Performance Inferencing for LLMs
Inferencing has become ubiquitous across cloud, regional, edge, and device environments, powering a wide spectrum of AI use cases spanning vision, language, and traditional machine learning applications. In recent years, Large Language Models (LLMs), initially developed for natural language tasks, have expanded to multimodal applications including vision speech, reasoning and planning each demanding distinct service-level objectives (SLOs). Achieving high-performance inferencing for such diverse workloads requires both model-level and system-level optimizations.
This talk focuses on system-level optimization techniques that maximize token throughput , achieve user experience metrics and inference service-provider efficiency. We review several recent innovations including KV caching, Paged/Flash/Radix Attention, Speculative Decoding, P/D Disaggregation and KV Routing, and explain how these mechanisms enhance performance by reducing latency, memory footprint, and compute overhead. These techniques are implemented in leading open-source inference frameworks such as vLLM, SGLang, Hugging Face TGI, and NVIDIA NIM, which form the backbone of large-scale public and private LLM serving platforms.
The use of GPU Training, Inference and Analysis clusters with Multi-Instance-GPU's (MIG), and Federated Models with QML applications has now become practical
Attendees will gain a practical understanding of the challenges in delivering scalable, low-latency LLM inference, and of the architectural and algorithmic innovations driving next-generation high-performance inference systems.
Biography:
Ravishankar Ravindran has over 24 years of experience contributing to advanced data networking products and research. He is an accomplished Professional currently on advisory role to eOTF as Technical Director. Previous to this, he was the Telco Architect at F5, he led the system architecture and design for F5’s Telco Cloud platform, supporting 5G vRAN and Core workloads. His work included active participation in standards development, particularly in the O-RAN Alliance's Working Group 6 (Cloud Architecture and Orchestration), and contributions to the Nephio project under the Linux Foundation Networking (LFN), focusing on Kubernetes-based domain orchestration for Telco use cases spanning RAN, Core, and Transport networks.
Previously, as Chief Architect at Corning Inc., he focused on the architecture and design of disaggregated RAN (CU/DU) for third-party cloud platforms and contributed to the integration of vDU with third-party O-RUs based on O-RAN’s Open Fronthaul (O-FH) specifications, with a focus on the M-plane. Prior to that, he served as Chief Architect at Sterlite Technologies (STL), where he was responsible for the end-to-end design of multi-tier RAN Intelligent Controllers (RICs), aimed at optimizing large-scale RAN systems through xApps such as Mobile Load Balancing, Traffic Steering, and Dynamic Spectrum Sharing.
Before this, he led the Future and Network Theory Lab at Futurewei (Huawei Technologies) as a Principal Researcher, focusing on efficient networking for cloud robotics, autonomous vehicles, and drone systems. His research emphasized next-generation networking requirements, including information-centric networking (ICN), software-defined networking (SDN), and network virtualization—particularly addressing challenges in mobility, content distribution, and content-centric routing protocols.
Prior to this role, he was part of the CTO Office at Nortel, where he was a member of the Advanced Technology Group, working on research areas such as control plane routing protocols for IP/(G)MPLS, L2/L3 Virtualization services, scheduling problems in 4G wireless, and end-to-end QoE/QoS engineering for multimedia services. He later served as a Technology Advisor at Avaya.
Ravindran has been an active contributor to numerous standardization bodies including the IETF, ITU, ATIS, the O-RAN Alliance, and LFN’s Nephio. He participated in the ITU’s Focus Group on IMT-2020, helping to define early standards for 5G. He holds a Ph.D. in Electrical Engineering from Carleton University, has served as an editor for the Springer Photonic Network Communications (PNET) journal, and has been part of technical program committees for top-tier conferences. Ravindran is a (co-)inventor on over 90 granted and filed U.S. patents (with additional patents pending) and has authored over 50 peer-reviewed papers in IEEE and ACM venues. His research and patents have been cited over 4,000 times, according to his Google Scholar profile (https://scholar.google.com/
Agenda
- Introduction to INGR with AIML & QIT & SVQAC working groups (Prakash Ramchandran) - 10 mts
- High Performance Inferencing for LLMs - By Dr. Ravishankar Ravindran ( Tech. Director eOTF - Advisory) -60 mts
- Q&A - 20 mts