Cloud Native Computing Foundation / Kubernetes Batch + HPC Day EU 2022

Add meeting Rate page Subscribe

Cloud Native Computing Foundation / Kubernetes Batch + HPC Day EU 2022

These are all the meetings we have in "Kubernetes Batch + H…" (part of the organization "Cloud Native Computi…"). Click into individual meeting pages to watch the recording and search or read the transcript.

19 May 2022

Apache YuniKorn A Kubernetes Scheduler Plugin for Batch Workloads - Wilfred Spiegelenburg, Cloudera & Craig Condit, Cloudera

Kubernetes has historically focused on service-type workloads. Stateful workloads have also become better supported in recent releases. Batch scheduling continues to lag in Kubernetes core. To better support batch scheduling, several alternative schedulers have been created, including Apache YuniKorn, which has a growing community and is utilised by several large organisations such as Alibaba, Apple, and Cloudera. Over the past few years, Apache YuniKorn has matured into a highly performant, flexible workload scheduler. Recently, we have enhanced Apache YuniKorn with a new execution mode which allows Apache YuniKorn's full power and flexibility to be deployed as a set of plugins to the default Kubernetes scheduler. This allows service and batch workloads to coexist seamlessly. This session will dive into using Apache YuniKorn to schedule batch workloads leveraging the advanced options like workload queueing and quota sharing without affecting the traditional non batch Kubernetes workloads.
  • 4 participants
  • 30 minutes
apache
server
patch
workloads
batches
api
setups
premise
manage
sharing
youtube image

19 May 2022

Closing - Aldo Culquicondor, Kubernetes Batch + HPC Day Program Committee Member
  • 1 participant
  • 3 minutes
kubernetes
forums
discussions
session
welcoming
contributors
reception
thank
users
googlers
youtube image

19 May 2022

Efficient Deep Learning Training with Ludwig AutoML, Ray, and Nodeless Kubernetes - Anne Marie Holler, Elotl & Travis Addair, Predibase

Deep Learning(DL) has been successfully applied to many fields, including computer vision, natural language, business, and science. The open-source platforms Ray and Ludwig make DL accessible to diverse users, by reducing complexity barriers to training, scaling, deploying, and serving DL models. However, DL’s cost and operational overhead present significant challenges. DL model dev/test/tuning requires intermittent use of substantial GPU resources, which cloud vendors are well-positioned to provide, though at non-trivial prices. Given the expense, managing GPU resources is critical to the practical use of DL. This talk describes running Ray and Ludwig on cloud Kubernetes clusters, using Nodeless K8s to add right-sized GPU resources when they are needed and to remove them when not. Experiments comparing cost and operational overhead of using Nodeless K8s vs directly on EC2 show sizable improvements in efficiency and usability.
  • 5 participants
  • 28 minutes
kubernetes
advanced
tensorflow
gpu
automation
cloud
resources
ray
scalability
workloads
youtube image

19 May 2022

Fast Data on-Ramp with Apache Pulsar on K8 - Timothy Spann, StreamNative

As the Apache Pulsar communities grows, more and more connectors will be added. To enhance the availability of sources and sinks and to make use of the greater Apache Streaming community, joining forces between Apache NiFi and Apache Pulsar is a perfect fit.

Apache NiFi also adds the benefits of ELT, ETL, data crunching, transformation, validation and batch data processing. Once data is ready to be an event, NiFi can launch it into Pulsar at light speed. I will walk through how to get started, some use cases and demos and answer questions. Benefits to the Ecosystem.

https://www.datainmotion.dev/
https://github.com/tspannhw
  • 2 participants
  • 14 minutes
microservices
api
kubernetes
workflows
computing
data
apps
infrastructures
client
pod
youtube image

19 May 2022

Get More Computing Power by Helping the OS Scheduler - Antti Kervinen, Intel & Alexander Kanevskiy, Intel

When Linux schedules a thread on a CPU core, there is no guarantee which memories the thread will access. If the workload is lucky, the thread will use data that is already in CPU caches or in a memory that is close to the CPU core. But if not, millions of memory operations need to travel a longer way to reach the physical memory. Yet this may sound too low-level to be controllable and make a difference, you can easily help the scheduler running Kubernetes workloads, and make a big difference! Antti and Sasha show how to get a lot more computing power out of your CPUs by adding CRI Resource Manager (CRI-RM) on your Kubernetes nodes. CRI-RM affects process scheduling and memory locality by dynamically managing CPU and memory pinning of all Kubernetes containers on the node. In case studies CRI-RM has given major improvements in database and AI training performances without any workload-specific configurations or changes to upstream Kubernetes components.
  • 3 participants
  • 20 minutes
kubernetes
cpus
computing
processors
cache
protocol
scheduling
workloads
cluster
containers
youtube image

19 May 2022

How to Handle Fair Scheduling in a Private Academic K8s infrastructure - Lukas Hejtmanek, Masaryk University & Dalibor Klusacek, CESNET

While the usefulness of container-oriented computing is widely recognized, its adoption in academic environments is not so straightforward. Existing orchestrators like Kubernetes are not primarily designed to support fair execution of (bursty) workloads belonging to various researchers and/or competing projects. While public providers are using efficient pay-per-use model, academic use-cases often expect traditional fair-sharing mechanism which is widely available in current HPC installations. This talk will discuss the challenges related to the application of containerized computing within K8s-operated infrastructure used by various users and research groups in the CERIT-SC infrastructure. Specifically, we will discuss how CERIT-SC guarantees that eligible pods will be executed in a reasonable time frame, making sure that running pods of other users will eventually free their allocations to guarantee fair use of available resources.
  • 1 participant
  • 7 minutes
scheduling
capacity
cpu
allocations
kubernetes
infrastructure
checkpoints
queues
ports
efficiently
youtube image

19 May 2022

Keynote: High Performance Computing on Google Kubernetes Engine - Maciek Różacki, Google Cloud

Google Kubernetes Engine is already a platform of choice for highly demanding high-performance computing workloads. We will present how we're investing into pushing the capabilities of our product further to maximize users' scientific output with ease, cost efficiency and industry leading performance.
  • 1 participant
  • 7 minutes
kubernetes
gcp
users
batch
session
offering
event
cloud
google
thinking
youtube image

19 May 2022

Kueue: A Kubernetes-native Job Queueing - Abdullah Gharaibeh, Google

Most Kubernetes core components are pod centric, including the scheduler and cluster autoscaler. This works well for service workloads where the pods of a service are mostly independent and all services are expected to be running at all times. However, for batch workloads, it does not make sense to focus only on pods, as the partial execution of pods from multiple parallel batch jobs may lead to deadlocks where many jobs may be simultaneously active while none is able to make sufficient progress to completion or start at all. Even for single-pod batch jobs, whether on-prem or in the cloud with autoscaling capabilities, the reality is that clusters have finite capacity: constraints on resource usage exist for quota and cost management (especially true for GPUs) and so users will want an easy way to fairly and efficiently share the resources. Kueue addresses the above limitations, offering queueing capabilities commonly exist in legacy batch schedulers in the most k8s native way. It is a k8s subproject currently under development at https://github.com/kubernetes-sigs/kueue.
  • 7 participants
  • 35 minutes
kubernetes
curing
batching
queue
workloads
job
task
process
handling
execution
youtube image

19 May 2022

Opening + Welcome - Abdullah Gharaibeh & Ricardo Rocha, Kubernetes Batch + HPC Day Program Committee Members
  • 2 participants
  • 6 minutes
batch
kubernetes
initiative
hosting
cloud
session
presentations
workloads
google
days
youtube image

19 May 2022

Resource Orchestration of HPC on Kubernetes: Where We Are Now and the Journey Ahead! - Swati Sehgal & Francesco Romani, Red Hat

Kubernetes has become a norm for orchestrating containerized microservice applications in the domain of cloud and enterprise; it is however not yet widely adopted in HPC. HPC enablement on Kubernetes is still a challenge due to requirements like NUMA aware scheduling, advanced resource reservation/allocation capabilities and managing job dependencies and synchronization. Resource managers in Kubelet facilitate the allocation and NUMA alignment of CPU, memory, and devices. The information disconnect between kubelet and the scheduler however, is still a gap that needs to be addressed. The scheduler is oblivious to the resources availability at a more granular, NUMA-zone level which can lead to suboptimal scheduling decisions placing workloads to nodes where alignment of resources is impossible. Contributors from sig-node formed a team to address this problem and implement a numa-aware scheduler and the related infrastructure. Representing the team, the presenters will educate the attendees about the journey of this feature, challenges encountered, end to end solution, current adoption, its roadmap and cover the deployment steps for optimized performance of workloads.
  • 2 participants
  • 26 minutes
workloads
scheduling
kubernetes
cpu
cluster
capabilities
network
resource
manage
concerns
youtube image

19 May 2022

Volcano – Cloud Native Batch System for AI, BigData and HPC - William(LeiBo) Wang, Huawei Cloud Computing Co., Ltd

Volcano is a cloud native batch system which is also the first batch computing project in CNCF. The major use cases are in the field of high-performance computing (HPC), such as big data, AI, Gene computing. Volcano offers job based fair-share, priority, preemption, reclaim, queue management abilities which are important for HPC users. It has integrated with computing ecosystem like spark-operator, fink-operator, kubeflow, Cromwell in big data, AI and HPC computing domains. This year Volcano is also being integrated to spark with it's custom batch scheduler natively. And many new features are being developed by contributors. f.g. co-location, elastic training, vGPU, throughput optimization and multi-cluster scheduling for HPC users.

The community has helped more than 50 users to deploy Volcano in their production environments around the world since it is open-sourced in 2019. William(Leibo) Wang who is the tech lead of Volcano community will present the latest features, use cases, progress, roadmap and best practices. He will also show how to accelerate AI training, serving, big data analysis and how to improve cluster utilization based on Volcano and other cloud native projects for users.
  • 1 participant
  • 24 minutes
kubernetes
cluster
computing
volcano
cloud
workloads
ai
scheduling
gpu
tensorflow
youtube image