Cloud Native Computing Foundation Kubernetes Batch + HPC Day EU 2022 Open Meetings

19 May 2022

Apache YuniKorn A Kubernetes Scheduler Plugin for Batch Workloads - Wilfred Spiegelenburg, Cloudera & Craig Condit, Cloudera

Kubernetes has historically focused on service-type workloads. Stateful workloads have also become better supported in recent releases. Batch scheduling continues to lag in Kubernetes core. To better support batch scheduling, several alternative schedulers have been created, including Apache YuniKorn, which has a growing community and is utilised by several large organisations such as Alibaba, Apple, and Cloudera. Over the past few years, Apache YuniKorn has matured into a highly performant, flexible workload scheduler. Recently, we have enhanced Apache YuniKorn with a new execution mode which allows Apache YuniKorn's full power and flexibility to be deployed as a set of plugins to the default Kubernetes scheduler. This allows service and batch workloads to coexist seamlessly. This session will dive into using Apache YuniKorn to schedule batch workloads leveraging the advanced options like workload queueing and quota sharing without affecting the traditional non batch Kubernetes workloads.

4 participants
30 minutes

apache

server

patch

workloads

batches

api

setups

premise

manage

sharing

19 May 2022

Closing - Aldo Culquicondor, Kubernetes Batch + HPC Day Program Committee Member

1 participant
3 minutes

kubernetes

forums

discussions

session

welcoming

contributors

reception

thank

users

googlers

19 May 2022

Efficient Deep Learning Training with Ludwig AutoML, Ray, and Nodeless Kubernetes - Anne Marie Holler, Elotl & Travis Addair, Predibase

Deep Learning(DL) has been successfully applied to many fields, including computer vision, natural language, business, and science. The open-source platforms Ray and Ludwig make DL accessible to diverse users, by reducing complexity barriers to training, scaling, deploying, and serving DL models. However, DL’s cost and operational overhead present significant challenges. DL model dev/test/tuning requires intermittent use of substantial GPU resources, which cloud vendors are well-positioned to provide, though at non-trivial prices. Given the expense, managing GPU resources is critical to the practical use of DL. This talk describes running Ray and Ludwig on cloud Kubernetes clusters, using Nodeless K8s to add right-sized GPU resources when they are needed and to remove them when not. Experiments comparing cost and operational overhead of using Nodeless K8s vs directly on EC2 show sizable improvements in efficiency and usability.

5 participants
28 minutes

kubernetes

advanced

tensorflow

gpu

automation

cloud

resources

ray

scalability

workloads

19 May 2022

Fast Data on-Ramp with Apache Pulsar on K8 - Timothy Spann, StreamNative

As the Apache Pulsar communities grows, more and more connectors will be added. To enhance the availability of sources and sinks and to make use of the greater Apache Streaming community, joining forces between Apache NiFi and Apache Pulsar is a perfect fit.

Apache NiFi also adds the benefits of ELT, ETL, data crunching, transformation, validation and batch data processing. Once data is ready to be an event, NiFi can launch it into Pulsar at light speed. I will walk through how to get started, some use cases and demos and answer questions. Benefits to the Ecosystem.

https://www.datainmotion.dev/
https://github.com/tspannhw

2 participants
14 minutes

microservices

api

kubernetes

workflows

computing

data

apps

infrastructures

client

pod

19 May 2022

Get More Computing Power by Helping the OS Scheduler - Antti Kervinen, Intel & Alexander Kanevskiy, Intel

When Linux schedules a thread on a CPU core, there is no guarantee which memories the thread will access. If the workload is lucky, the thread will use data that is already in CPU caches or in a memory that is close to the CPU core. But if not, millions of memory operations need to travel a longer way to reach the physical memory. Yet this may sound too low-level to be controllable and make a difference, you can easily help the scheduler running Kubernetes workloads, and make a big difference! Antti and Sasha show how to get a lot more computing power out of your CPUs by adding CRI Resource Manager (CRI-RM) on your Kubernetes nodes. CRI-RM affects process scheduling and memory locality by dynamically managing CPU and memory pinning of all Kubernetes containers on the node. In case studies CRI-RM has given major improvements in database and AI training performances without any workload-specific configurations or changes to upstream Kubernetes components.

3 participants
20 minutes

kubernetes

cpus

computing

processors

cache

protocol

scheduling

workloads

cluster

containers

19 May 2022

How to Handle Fair Scheduling in a Private Academic K8s infrastructure - Lukas Hejtmanek, Masaryk University & Dalibor Klusacek, CESNET

While the usefulness of container-oriented computing is widely recognized, its adoption in academic environments is not so straightforward. Existing orchestrators like Kubernetes are not primarily designed to support fair execution of (bursty) workloads belonging to various researchers and/or competing projects. While public providers are using efficient pay-per-use model, academic use-cases often expect traditional fair-sharing mechanism which is widely available in current HPC installations. This talk will discuss the challenges related to the application of containerized computing within K8s-operated infrastructure used by various users and research groups in the CERIT-SC infrastructure. Specifically, we will discuss how CERIT-SC guarantees that eligible pods will be executed in a reasonable time frame, making sure that running pods of other users will eventually free their allocations to guarantee fair use of available resources.

1 participant
7 minutes

scheduling

capacity

cpu

allocations

kubernetes

infrastructure

checkpoints

queues

ports

efficiently

19 May 2022

Keynote: High Performance Computing on Google Kubernetes Engine - Maciek Różacki, Google Cloud

Google Kubernetes Engine is already a platform of choice for highly demanding high-performance computing workloads. We will present how we're investing into pushing the capabilities of our product further to maximize users' scientific output with ease, cost efficiency and industry leading performance.

1 participant
7 minutes

kubernetes

gcp

users

batch

session

offering

event

cloud

google

thinking

19 May 2022

Kueue: A Kubernetes-native Job Queueing - Abdullah Gharaibeh, Google

Most Kubernetes core components are pod centric, including the scheduler and cluster autoscaler. This works well for service workloads where the pods of a service are mostly independent and all services are expected to be running at all times. However, for batch workloads, it does not make sense to focus only on pods, as the partial execution of pods from multiple parallel batch jobs may lead to deadlocks where many jobs may be simultaneously active while none is able to make sufficient progress to completion or start at all. Even for single-pod batch jobs, whether on-prem or in the cloud with autoscaling capabilities, the reality is that clusters have finite capacity: constraints on resource usage exist for quota and cost management (especially true for GPUs) and so users will want an easy way to fairly and efficiently share the resources. Kueue addresses the above limitations, offering queueing capabilities commonly exist in legacy batch schedulers in the most k8s native way. It is a k8s subproject currently under development at https://github.com/kubernetes-sigs/kueue.

7 participants
35 minutes

kubernetes

curing

batching

queue

workloads

job

task

process

handling

execution

19 May 2022

Opening + Welcome - Abdullah Gharaibeh & Ricardo Rocha, Kubernetes Batch + HPC Day Program Committee Members

2 participants
6 minutes

batch

kubernetes

initiative

hosting

cloud

session

presentations

workloads

google

days

19 May 2022

Resource Orchestration of HPC on Kubernetes: Where We Are Now and the Journey Ahead! - Swati Sehgal & Francesco Romani, Red Hat

Kubernetes has become a norm for orchestrating containerized microservice applications in the domain of cloud and enterprise; it is however not yet widely adopted in HPC. HPC enablement on Kubernetes is still a challenge due to requirements like NUMA aware scheduling, advanced resource reservation/allocation capabilities and managing job dependencies and synchronization. Resource managers in Kubelet facilitate the allocation and NUMA alignment of CPU, memory, and devices. The information disconnect between kubelet and the scheduler however, is still a gap that needs to be addressed. The scheduler is oblivious to the resources availability at a more granular, NUMA-zone level which can lead to suboptimal scheduling decisions placing workloads to nodes where alignment of resources is impossible. Contributors from sig-node formed a team to address this problem and implement a numa-aware scheduler and the related infrastructure. Representing the team, the presenters will educate the attendees about the journey of this feature, challenges encountered, end to end solution, current adoption, its roadmap and cover the deployment steps for optimized performance of workloads.

2 participants
26 minutes

workloads

scheduling

kubernetes

cpu

cluster

capabilities

network

resource

manage

concerns

19 May 2022

Volcano – Cloud Native Batch System for AI, BigData and HPC - William(LeiBo) Wang, Huawei Cloud Computing Co., Ltd

Volcano is a cloud native batch system which is also the first batch computing project in CNCF. The major use cases are in the field of high-performance computing (HPC), such as big data, AI, Gene computing. Volcano offers job based fair-share, priority, preemption, reclaim, queue management abilities which are important for HPC users. It has integrated with computing ecosystem like spark-operator, fink-operator, kubeflow, Cromwell in big data, AI and HPC computing domains. This year Volcano is also being integrated to spark with it's custom batch scheduler natively. And many new features are being developed by contributors. f.g. co-location, elastic training, vGPU, throughput optimization and multi-cluster scheduling for HPC users.

The community has helped more than 50 users to deploy Volcano in their production environments around the world since it is open-sourced in 2019. William(Leibo) Wang who is the tech lead of Volcano community will present the latest features, use cases, progress, roadmap and best practices. He will also show how to accelerate AI training, serving, big data analysis and how to improve cluster utilization based on Volcano and other cloud native projects for users.

1 participant
24 minutes

kubernetes

cluster

computing

volcano

cloud

workloads

ai

scheduling

gpu

tensorflow

Cloud Native Computing Foundation / Kubernetes Batch + HPC Day EU 2022

19 May 2022

19 May 2022

19 May 2022

19 May 2022

19 May 2022

19 May 2022

19 May 2022

19 May 2022

19 May 2022

19 May 2022

19 May 2022