BEGIN:VCALENDAR
VERSION:2.0
PRODID:IEEE vTools.Events//EN
CALSCALE:GREGORIAN
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
BEGIN:DAYLIGHT
DTSTART:20260308T030000
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
RRULE:FREQ=YEARLY;BYDAY=2SU;BYMONTH=3
TZNAME:PDT
END:DAYLIGHT
BEGIN:STANDARD
DTSTART:20251102T010000
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
RRULE:FREQ=YEARLY;BYDAY=1SU;BYMONTH=11
TZNAME:PST
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20251231T215718Z
UID:1065F907-B515-4AFD-AFCF-4A7524B8F039
DTSTART;TZID=America/Los_Angeles:20251211T180000
DTEND;TZID=America/Los_Angeles:20251211T190000
DESCRIPTION:[]\n\nInferencing has become ubiquitous across cloud\, regional
 \, edge\, and device environments\, powering a wide spectrum of AI use cas
 es spanning vision\, language\, and traditional machine learning applicati
 ons. In recent years\, Large Language Models (LLMs)\, initially developed 
 for natural language tasks\, have expanded to multimodal applications incl
 uding vision speech\, reasoning and planning each demanding distinct servi
 ce-level objectives (SLOs). Achieving high-performance inferencing for suc
 h diverse workloads requires both model-level and system-level optimizatio
 ns.\n\nThis talk focuses on system-level optimization techniques that maxi
 mize token throughput \, achieve user experience metrics and inference ser
 vice-provider efficiency. We review several recent innovations including K
 V caching\, Paged/Flash/Radix Attention\, Speculative Decoding\, P/D Disag
 gregation\, KV Routing and Parallelism\, and explain how these mechanisms 
 enhance performance by reducing latency\, memory footprint\, and compute o
 verhead. These techniques are implemented in leading open-source inference
  frameworks such as vLLM\, SGLang\, Hugging Face TGI\, and NVIDIA’s Tens
 orRT-llm\, which form the backbone of large-scale public and private LLM s
 erving platforms.\n\nAttendees will gain a practical understanding of the 
 challenges in delivering scalable\, low-latency LLM inference\, and of the
  architectural and algorithmic innovations driving next-generation high-pe
 rformance inference systems.\n\nVirtual: https://events.vtools.ieee.org/m/
 516797
LOCATION:Virtual: https://events.vtools.ieee.org/m/516797
ORGANIZER:westphal@ieee.org
SEQUENCE:38
SUMMARY:High Performance Inferencing for LLMs
URL;VALUE=URI:https://events.vtools.ieee.org/m/516797
X-ALT-DESC:Description: &lt;br /&gt;&lt;p&gt;&lt;img src=&quot;https://events.vtools.ieee.org/v
 tools_ui/media/display/68b5b524-5125-432c-af53-a8a9f76faa3a&quot; alt=&quot;&quot; width=
 &quot;600&quot; height=&quot;338&quot;&gt;&lt;/p&gt;\n&lt;p&gt;Inferencing has become ubiquitous across cloud
 \, regional\, edge\, and device environments\, powering a wide spectrum of
  AI use cases spanning vision\, language\, and traditional machine learnin
 g applications. In recent years\, Large Language Models (LLMs)\, initially
  developed for natural language tasks\, have expanded to multimodal applic
 ations including vision speech\, reasoning and planning each demanding dis
 tinct service-level objectives (SLOs). Achieving high-performance inferenc
 ing for such diverse workloads requires both model-level and system-level 
 optimizations.&lt;/p&gt;\n&lt;p&gt;This talk focuses on system-level optimization tech
 niques that maximize token throughput \, achieve user experience&amp;nbsp\; me
 trics and inference service-provider efficiency. We review several recent 
 innovations including KV caching\, Paged/Flash/Radix Attention\, Speculati
 ve Decoding\, P/D&amp;nbsp\;Disaggregation\, KV Routing and Parallelism\,&amp;nbsp
 \;and explain how these mechanisms enhance performance by reducing latency
 \, memory footprint\, and compute overhead. These techniques are implement
 ed in leading open-source inference frameworks such as&amp;nbsp\;vLLM\,&amp;nbsp\;
 SGLang\, Hugging Face TGI\, and NVIDIA&amp;rsquo\;s&amp;nbsp\;TensorRT-llm\, which
  form the backbone of large-scale public and private LLM serving platforms
 .&lt;/p&gt;\n&lt;p&gt;Attendees will gain a practical understanding of the challenges 
 in delivering scalable\, low-latency LLM inference\, and of the architectu
 ral and algorithmic innovations driving next-generation high-performance i
 nference systems.&lt;/p&gt;
END:VEVENT
END:VCALENDAR

