High Performance Inferencing for LLMs

#LLM #HPC #Inferencing #Optimization
Share

Dr. Ravishankar Ravindran : "High Performance Inferencing for LLMs"


Inferencing has become ubiquitous across cloud, regional, edge, and device environments, powering a wide spectrum of AI use cases spanning vision, language, and traditional machine learning applications. In recent years, Large Language Models (LLMs), initially developed for natural language tasks, have expanded to multimodal applications including vision speech, reasoning and planning each demanding distinct service-level objectives (SLOs). Achieving high-performance inferencing for such diverse workloads requires both model-level and system-level optimizations.

This talk focuses on system-level optimization techniques that maximize token throughput , achieve user experience  metrics and inference service-provider efficiency. We review several recent innovations including KV caching, Paged/Flash/Radix Attention, Speculative Decoding, P/D Disaggregation and KV Routing, and explain how these mechanisms enhance performance by reducing latency, memory footprint, and compute overhead. These techniques are implemented in leading open-source inference frameworks such as vLLM, SGLang, Hugging Face TGI, and NVIDIA NIM, which form the backbone of large-scale public and private LLM serving platforms.

The use of GPU Training, Inference and Analysis clusters with Multi-Instance-GPU's (MIG), and Federated Models with QML applications has now become practical.

Attendees will gain a practical understanding of the challenges in delivering scalable, low-latency LLM inference, and of the architectural and algorithmic innovations driving next-generation high-performance inference systems.

 



  Date and Time

  Location

  Hosts

  Registration



  • Add_To_Calendar_icon Add Event to Calendar

Loading virtual attendance info...

  • Contact Event Hosts
  • eMerging Open Tech Foundation is an education and skills development private entity registered with the Ministry of Company Affairs (MCA.gov.in) for  Bharat's (India’s)  march to Global leadership. In collaboration with INGR Quantum IT working group/ INGR AIML working groups,  this Professional  Continuous Education program  is being organized.

  • Co-sponsored by eMerging Open Tech Foundation
  • Survey: Fill out the survey
  • Starts 20 October 2025 07:00 AM UTC
  • Ends 01 November 2025 07:00 AM UTC
  • No Admission Charge


  Speakers

Ravishankar of eOTF

Topic:

High Performance Inferencing for LLMs

Inferencing has become ubiquitous across cloud, regional, edge, and device environments, powering a wide spectrum of AI use cases spanning vision, language, and traditional machine learning applications. In recent years, Large Language Models (LLMs), initially developed for natural language tasks, have expanded to multimodal applications including vision speech, reasoning and planning each demanding distinct service-level objectives (SLOs). Achieving high-performance inferencing for such diverse workloads requires both model-level and system-level optimizations.

This talk focuses on system-level optimization techniques that maximize token throughput , achieve user experience  metrics and inference service-provider efficiency. We review several recent innovations including KV caching, Paged/Flash/Radix Attention, Speculative Decoding, P/D Disaggregation and KV Routing, and explain how these mechanisms enhance performance by reducing latency, memory footprint, and compute overhead. These techniques are implemented in leading open-source inference frameworks such as vLLM, SGLang, Hugging Face TGI, and NVIDIA NIM, which form the backbone of large-scale public and private LLM serving platforms.

The use of GPU Training, Inference and Analysis clusters with Multi-Instance-GPU's (MIG), and Federated Models with QML applications has now become practical

Attendees will gain a practical  understanding of the challenges in delivering scalable, low-latency LLM inference, and of the architectural and algorithmic innovations driving next-generation high-performance inference systems.

Biography:


Ravishankar Ravindran has over 24 years of experience contributing to advanced data networking products and research. He is an accomplished Professional currently on advisory role to eOTF as Technical Director. Previous to this, he was the Telco Architect at F5, he led the system architecture and design for F5’s Telco Cloud platform, supporting 5G vRAN and Core workloads. His work included active participation in standards development, particularly in the O-RAN Alliance's Working Group 6 (Cloud Architecture and Orchestration), and contributions to the Nephio project under the Linux Foundation Networking (LFN), focusing on Kubernetes-based domain orchestration for Telco use cases spanning RAN, Core, and Transport networks.
Previously, as Chief Architect at Corning Inc., he focused on the architecture and design of disaggregated RAN (CU/DU) for third-party cloud platforms and contributed to the integration of vDU with third-party O-RUs based on O-RAN’s Open Fronthaul (O-FH) specifications, with a focus on the M-plane. Prior to that, he served as Chief Architect at Sterlite Technologies (STL), where he was responsible for the end-to-end design of multi-tier RAN Intelligent Controllers (RICs), aimed at optimizing large-scale RAN systems through xApps such as Mobile Load Balancing, Traffic Steering, and Dynamic Spectrum Sharing.
Before this, he led the Future and Network Theory Lab at Futurewei (Huawei Technologies) as a Principal Researcher, focusing on efficient networking for cloud robotics, autonomous vehicles, and drone systems. His research emphasized next-generation networking requirements, including information-centric networking (ICN), software-defined networking (SDN), and network virtualization—particularly addressing challenges in mobility, content distribution, and content-centric routing protocols.
Prior to this role, he was part of the CTO Office at Nortel, where he was a member of the Advanced Technology Group, working on research areas such as control plane routing protocols for IP/(G)MPLS, L2/L3 Virtualization services, scheduling problems in 4G wireless, and end-to-end QoE/QoS engineering for multimedia services. He later served as a Technology Advisor at Avaya.
Ravindran has been an active contributor to numerous standardization bodies including the IETF, ITU, ATIS, the O-RAN Alliance, and LFN’s Nephio. He participated in the ITU’s Focus Group on IMT-2020, helping to define early standards for 5G. He holds a Ph.D. in Electrical Engineering from Carleton University, has served as an editor for the Springer Photonic Network Communications (PNET) journal, and has been part of technical program committees for top-tier conferences. Ravindran is a (co-)inventor on over 90 granted and filed U.S. patents (with additional patents pending) and has authored over 50 peer-reviewed papers in IEEE and ACM venues.  His research and patents have been cited over 4,000 times, according to his Google Scholar profile (https://scholar.google.com/citations?user=v_8yeKYAAAAJ&hl=en).





Agenda

  1. Introduction to INGR  with AIML & QIT & SVQAC working  groups (Prakash Ramchandran) - 10 mts
  2. High Performance Inferencing for LLMs  - By Dr. Ravishankar Ravindran (  Tech. Director eOTF - Advisory) -60 mts
  3. Q&A - 20 mts