BEGIN:VCALENDAR
VERSION:2.0
PRODID:IEEE vTools.Events//EN
CALSCALE:GREGORIAN
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
BEGIN:DAYLIGHT
DTSTART:20250309T030000
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
RRULE:FREQ=YEARLY;BYDAY=2SU;BYMONTH=3
TZNAME:PDT
END:DAYLIGHT
BEGIN:STANDARD
DTSTART:20251102T010000
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
RRULE:FREQ=YEARLY;BYDAY=1SU;BYMONTH=11
TZNAME:PST
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260124T044656Z
UID:09A5B7BD-9F27-4688-A6FA-3E6246AA8758
DTSTART;TZID=America/Los_Angeles:20251101T090000
DTEND;TZID=America/Los_Angeles:20251101T103000
DESCRIPTION:Inferencing has become ubiquitous across cloud\, regional\, edg
 e\, and device environments\, powering a wide spectrum of AI use cases spa
 nning vision\, language\, and traditional machine learning applications. I
 n recent years\, Large Language Models (LLMs)\, initially developed for na
 tural language tasks\, have expanded to multimodal applications including 
 vision speech\, reasoning and planning each demanding distinct service-lev
 el objectives (SLOs). Achieving high-performance inferencing for such dive
 rse workloads requires both model-level and system-level optimizations.\n\
 nThis talk focuses on system-level optimization techniques that maximize t
 oken throughput \, achieve user experience metrics and inference service-p
 rovider efficiency. We review several recent innovations including KV cach
 ing\, Paged/Flash/Radix Attention\, Speculative Decoding\, P/D Disaggregat
 ion and KV Routing\, and explain how these mechanisms enhance performance 
 by reducing latency\, memory footprint\, and compute overhead. These techn
 iques are implemented in leading open-source inference frameworks such as 
 vLLM\, SGLang\, Hugging Face TGI\, and NVIDIA NIM\, which form the backbon
 e of large-scale public and private LLM serving platforms.\n\nThe use of G
 PU Training\, Inference and Analysis clusters with Multi-Instance-GPU&#39;s (M
 IG)\, and Federated Models with QML applications has now become practical.
 \n\nAttendees will gain a practical understanding of the challenges in del
 ivering scalable\, low-latency LLM inference\, and of the architectural an
 d algorithmic innovations driving next-generation high-performance inferen
 ce systems.\n\nCo-sponsored by: eMerging Open Tech Foundation\n\nSpeaker(s
 ): Ravishankar\, \n\nAgenda: \n- Introduction to INGR with AIML &amp; QIT &amp; SV
 QAC working groups (Prakash Ramchandran) - 10 mts\n- High Performance Infe
 rencing for LLMs - By Dr. Ravishankar Ravindran ( Tech. Director eOTF - Ad
 visory) -60 mts\n- Q&amp;A - 20 mts\n\nVirtual: https://events.vtools.ieee.org
 /m/508671
LOCATION:Virtual: https://events.vtools.ieee.org/m/508671
ORGANIZER:c.polk@comsoc.org
SEQUENCE:155
SUMMARY:High Performance Inferencing for LLMs
URL;VALUE=URI:https://events.vtools.ieee.org/m/508671
X-ALT-DESC:Description: &lt;br /&gt;&lt;p dir=&quot;ltr&quot;&gt;Inferencing has become ubiquitou
 s across cloud\, regional\, edge\, and device environments\, powering a wi
 de spectrum of AI use cases spanning vision\, language\, and traditional m
 achine learning applications. In recent years\, Large Language Models (LLM
 s)\, initially developed for natural language tasks\, have expanded to mul
 timodal applications including vision speech\, reasoning and planning each
  demanding distinct service-level objectives (SLOs). Achieving high-perfor
 mance inferencing for such diverse workloads requires both model-level and
  system-level optimizations.&lt;/p&gt;\n&lt;p dir=&quot;ltr&quot;&gt;This talk focuses on system
 -level optimization techniques that maximize token throughput \, achieve u
 ser experience&amp;nbsp\; metrics and inference service-provider efficiency. W
 e review several recent innovations including KV caching\, Paged/Flash/Rad
 ix Attention\, Speculative Decoding\, P/D Disaggregation and KV Routing\, 
 and explain how these mechanisms enhance performance by reducing latency\,
  memory footprint\, and compute overhead. These techniques are implemented
  in leading open-source inference frameworks such as vLLM\, SGLang\, Huggi
 ng Face TGI\, and NVIDIA NIM\, which form the backbone of large-scale publ
 ic and private LLM serving platforms.&lt;/p&gt;\n&lt;p dir=&quot;ltr&quot;&gt;The use of GPU Tra
 ining\, Inference and Analysis clusters with Multi-Instance-GPU&#39;s (MIG)\, 
 and Federated Models with QML applications has now become practical.&lt;/p&gt;\n
 &lt;p dir=&quot;ltr&quot;&gt;Attendees will gain a practical understanding of the challeng
 es in delivering scalable\, low-latency LLM inference\, and of the archite
 ctural and algorithmic innovations driving next-generation high-performanc
 e inference systems.&lt;/p&gt;\n&lt;p&gt;&amp;nbsp\;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;Agenda: &lt;br /&gt;&lt;ol&gt;\n&lt;l
 i&gt;Introduction to INGR&amp;nbsp\; with AIML &amp;amp\; QIT &amp;amp\; SVQAC working&amp;nb
 sp\; groups (Prakash Ramchandran) - 10 mts&lt;/li&gt;\n&lt;li&gt;High Performance Infe
 rencing for LLMs&amp;nbsp\; - By Dr. Ravishankar Ravindran (&amp;nbsp\; Tech. Dire
 ctor eOTF - Advisory) -60 mts&lt;/li&gt;\n&lt;li&gt;Q&amp;amp\;A - 20 mts&lt;/li&gt;\n&lt;/ol&gt;
END:VEVENT
END:VCALENDAR

