LIVE ON THURS, 19th SEPT 2024

Bringing Observability to Complex AI Platforms and Models

  • 8:30 PM (IST)
  • 11:00 AM (EDT)
  • 8:00 AM (PDT)
  • 45 minutes

Webinar starts in:

Days
Hours
Minutes
Seconds

About the webinar

With 99% of Fortune 500 companies and 31% of SMBs already using AI — and many more planning to adopt — failures could trigger cascading effects. It is critical for organizations to closely monitor the health of their AI platforms. However, monitoring AI platforms and models is far more complex than traditional monitoring. From workloads and the ML lifecycle to infrastructure and hardware, every component is built differently, posing unique challenges such as multi-stage pipelines, dynamic workloads, model drift, and hardware temperature management.

Observability tools like Nvidia DCGM, OpenTelemetry, and Prometheus can help you implement an observability stack. However, identifying the influential metrics, logs, and traces is key to uncovering the factors that determine AI platforms’ health. By shedding light on GPU utilization, performance, model drift, and LLM accuracy, we can optimize GPU resource sharing and extend the lifespan of AI platforms.

Join the webinar to learn how to overcome all the observability challenges and effectively monitor AI platforms and models deployed on Kubernetes. We will also share our AI Stack deployed on Kubernetes, showcasing the use of open source observability tools like Prometheus, Grafana, and Nvidia DCGM to comprehensively monitor real-time metrics from our GPU clusters, inference, and embeddings servers.

What to expect

  • How to monitor AI platforms and models: Discover & overcome the observability challenges of complex AI platforms and implement comprehensive AI monitoring solutions.
  • Essential metrics and data: Find the crucial metrics, logs, and traces necessary for effective AI monitoring.
  • Longevity of AI platform: Add more years to your AI platform by preventing model drift and maintaining accuracy and peak performance through continuous, proactive monitoring.
  • GPU utilization and resource sharing: Leveraging observability to discover GPU utilization patterns to improve resource sharing.
  • Hands-on demo: See a live demonstration of our AI Stack on Kubernetes and the observability solution we implemented.
  • Actionable insights by experts: Get actionable advice on implementing a comprehensive AI monitoring solution to overcome challenges like complex multi-stage pipelines, dynamic workloads, model drift, and hardware temperature management.

Who should attend this webinar?

  • AI/ML engineers: Professionals who deploy and maintain AI models, looking to learn best practices for monitoring AI platforms and preventing performance issues.
  • DevOps and SREs: Teams responsible for infrastructure management, seeking insights on optimizing GPU usage and monitoring AI workloads on Kubernetes.
  • AI Platform Teams: AI engineers managing AI platforms want to overcome observability challenges like dynamic workloads, multi-stage pipelines, and model drift.
  • Cloud and AI Solution Architects: Discover how to integrate observability tools like Prometheus, Nvidia DCGM, and Grafana into AI platform architectures for comprehensive monitoring.

Meet the Speakers

Atulpriya Sharma
Atulpriya Sharma
Sr. Dev Advocate @ InfraCloud
Host

Manual tester turned developer advocate. Atul talks about Cloud Native, Kubernetes, AI & MLOps to help other developers and organizations adopt cloud native. He is also a CNCF Ambassador and the organizer of CNCF Hyderabad.

Aman Juneja
Aman Juneja
Principal Solutions Engineer @ InfraCloud
Speaker

Aman specializes in AI Cloud solutions and cloud native design, bringing extensive expertise in containerization, microservices, and serverless computing. His current focus lies in exploring AI Cloud technologies and developing AI applications using cloud native architectures.

Vishal Biyani
Vishal Biyani
CTO & Founder @ InfraCloud
Speaker

Vishal is an engineer and loves helping companies transform their business by using technology and coaching people. He is a contributor to Fission, Fast and Simple Serverless Functions for Kubernetes and is organizer of “Pune Kubernetes & CNCF Meetup”.