Cloud Native Computing Foundation Kubernetes Batch + HPC Day NA 2022 Open Meetings

1 Nov 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Beyond Experimental: Spark on Kubernetes - Weiwei Yang, Apple

Apache Spark on Kubernetes takes advantage of containers and the large, rapidly growing Kubernetes ecosystem to maximize the data processing capability on the cloud. However, running a large-scale production environment is not an effortless combination. Challenges at scale, dev-ops complexity, multi-cluster management, job scheduling, and autoscaling are all roadblocks that could quickly fail the mission. In this session, Bowen Li and Weiwei Yang will share their insights on leveraging open source technology such as Apache YuniKorn, Spark K8s operator, and Cloud primitives to evolve ML data infrastructure in the cloud, including considerations for multi-tenancy, observability, scalability, and cost-effectiveness.

7 participants
30 minutes

workflows

kubernetes

workloads

backend

provisioning

batch

server

resourcing

cloud

platform

1 Nov 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Coordinate Workloads Colocation: QoS-Oriented Scheduling Enhancement on K8s - Zuowei Zhang & Tao Li, Alibaba Cloud

Kubernetes provides well-defined QoS Classes on pod as guaranteed, burstable, and best-effort. Users can colocate different QoS workloads to achieve resource overcommitment and improve cluster utilization. However, with scale increasing and workloads diversified, some limitations are becoming more: · Lower QoS will be easily throttled or killed once node runs out of resources · The noisy neighbor problem effects the performance of latency-sensitive application · Local hot spots affect the global We implements Koordinator based on Kubernetes with several add-ons to provide QoS-oriented scheduling enhancements: · Definition of sub-QoS classes for complex workloads in co-location scenarios and compatible with the Kubernetes existing QoS semantics · Using dynamic metrics of nodes and pod to provide a more reliable model for resource overcommitment, including resource usage profile and micro metrics such as CPU scheduling, memory allocate latency · Applying fine-grained resource orchestration and isolation mechanism on node to solve the noisy neighbor problem and improve the efficiency of latency-sensitive workloads and batch jobs

2 participants
32 minutes

workloads

efficiency

utilization

capacity

scheduling

computing

kubernetes

data

services

centers

1 Nov 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Managed Kubernetes — Next Gen Academic Infrastructure? - Viktória Spišaková & Lukáš Hejtmánek, Masaryk University

2 participants
31 minutes

kubernetes

infrastructure

infrastructures

computing

researchers

institute

capacity

cvmfs

administrator

storage

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

6 participants
13 minutes

benchmarking

benchmarks

toolkits

cpu

optimized

workloads

hpc

batch

monitoring

microservice

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Building Armada – Running Batch Jobs at Massive Scale on Kubernetes - Jamie Poole, G-Research

Thousands of GPUs. Hundreds of thousands of CPUs. Learn how (and why!) G-Research designed and built Armada - a system to enable massive throughput of batch jobs running on Kubernetes. In this session you’ll hear how we use large scale batch compute on Kubernetes to spot patterns in financial markets and predict the future. Armada enables us to schedule millions of batch jobs across many clusters and tens of thousands of nodes, getting optimum utilisation of our hardware to enable our researchers to run the latest machine-learning and advanced data science techniques across vast datasets. We’ll cover the architecture and approach of Armada, challenges and techniques for running Kubernetes at scale and some war stories and lessons learned along the way.

8 participants
35 minutes

armada

kubernetes

platforms

research

tooling

operationally

infrastructure

gan

docker

ai

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Hybrid Cloud Bursting Electronic Design Analysis Optical Proximity Correction (OPC) Flows to Public Cloud Managed Kubernetes Services - Derren Dunn, IBM & Gaurav Singh, Red Hat

Everyone is aware of the adage, “time is money”. No industry is more aware of this than the semiconductor industry in which time to market delays can cost billions of dollars. To do anything in this business, one must do 3 things: 1) design chips, 2) transfer design shapes to a photolithography mask, and 3) fabricate designs. In this talk, we will explore the transfer of design shapes to masks using OPC. OPC is an embarrassingly parallel high performance computing workload that is typically run on Linux clusters. Historically, semiconductor manufacturers have finite on-prem compute resources. To address compute limitations, we discuss methods to enable OPC hybrid cloud bursting using managed Kubernetes services. We present scaling OPC workloads from 1,000 pods to 10,000+ pods using managed Kubernetes services. Also, we explore benefits of using managed Kubernetes services in terms of performance, set-up, and the use of autoscalers to control costs at the job level.
11:20am – 11:30am L

6 participants
35 minutes

workflows

workflow

batch

processes

workloads

provisioning

jobs

customers

computational

servers

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Keynote: Fair Share - What Shared Responsibility Means for Managed Kubernetes Clusters - Mickey Boxell, Principal Product Manager, Oracle

Managed Kubernetes offerings provide users with a simple way to automatically deploy, scale, and manage Kubernetes - generally everything you need to quickly deploy a production ready Kubernetes cluster. However, as a cluster operator, you are responsible for more than simply deploying containerized business logic. When you adopt a product with a managed life cycle you need to know what exactly to plan for: more specifically, where does your responsibility end and where does the provider’s responsibility start? When new Kubernetes versions are released, is it your responsibility as an operator to update your control plane? How about your worker nodes? The nature of Kubernetes as a tool with a control plane and a data plane further complicates things. For example, users generally are not responsible for managing and operating control plane components including kube-apiserver, kube-controller-manager, kube-scheduler, or etcd, but worker nodes generally exist in user tenancies and because nodes execute private code and store sensitive data providers’ access is limited. This talk will explain the support boundaries and shared responsibility of a managed Kubernetes service through the eyes of a cloud provider. It will advise users where to look for information about the parts of their system in need of care and feeding and those that can be comfortably trusted to a knowledgeable provider.

1 participant
8 minutes

shared

responsibility

responsibilities

kubernetes

managed

providers

workloads

server

cloud

deploying

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Lightning Talk: CNCF Batch Working Group Update - Alex Scammon, G-Research

This talk will present an update from the CNCF Batch System Initiative Working Group, a newly-created group set up to discuss batch scheduling conversation at the end-user level. It will focus on how the users and operators of today’s batch workloads currently interact with the many various cloud-native batch-related projects like Volcano, Armada, MCAD, Yunikorn, Slurm, HTCondor, etc.. Ideally, the hope is to provide some rough guidance and information for the CNCF community on these higher-level batch scheduling approaches since the landscape remains fairly opaque.

This presentation will discuss what the working group has worked on so far, what it's hoping to achieve, and (crucially!) how this working group is different (but closely related!) to the Kubernetes Batch working group.

3 participants
13 minutes

batches

batch

collective

scheduling

clusters

bunch

initiatives

context

conversation

currently

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Lightning Talk: Evolving the Job API to Become Defacto Standard for Batch Workloads - Aldo Culquicondor, Google

Since Kubernetes 1.0 until recently, the Kubernetes Job API had a rather limited feature set. Meanwhile, multiple batch-oriented frameworks were developed, each re-implementing their own job API and controller to manage pods, with their own advantages and limitations, leading to fragmentation of the ecosystem. Starting in Kubernetes 1.21, contributors to SIG Apps decided to evolve the Job API, implement common patterns and increase its scalability, so that it can become the defacto standard for running batch workloads or for building specialized frameworks on top of. They introduced features such as indexed Jobs, suspended jobs, tracking with finalizers, failure policies, etc. In this talk, Aldo will walk you through all these efforts, the challenges that we faced when implementing them and the remaining opportunities we have identified to make the API more comprehensive and useful for a wider range of applications.

8 participants
29 minutes

kubernetes

workflows

bots

api

parallelism

application

process

controller

problems

cumbersome

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Lightning Talk: Fluence: Approaching a Converged Computing Environment - Daniel Milroy, Lawrence Livermore National Laboratory & Claudia Misale, IBM T.J. Watson Research Center

Adoption of cloud technologies by high performance computing (HPC) is accelerating, and HPC users want their applications to perform well everywhere. While container orchestration provides resiliency, elasticity, and declarative management, it is not designed to enable app performance like HPC schedulers. In particular, Kube-scheduler is not suited to scheduling emerging HPC workflows that require pods placed advantageously. In response to interest in scheduling flexibility, the K8s community developed the Scheduling Framework to integrate new policies and schedulers. KubeFlux, a Scheduling Framework plugin based on the Fluxion open-source HPC scheduler, provides HPC scheduling capability in K8s. We detail our improvements to the MPI Operator and demonstrate its scalability to 16,384 ranks. With the improved operator we compare the performance of HPC benchmark apps scheduled by Kube-scheduler and KubeFlux. We conclude that KubeFlux makes pod placements that enable much higher app performance than Kube-scheduler. KubeFlux is an example of the rich capability that can be added to K8s and paves the way to converged computing environments with the best capabilities of HPC and cloud.

6 participants
15 minutes

workflow

simulations

research

mpi

manage

benchmarks

hpc

bioinformatics

cluster

strain

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Make Kubernetes Networking Ready for World Class AI and HPC Workloads - Sunyanan Choochotkaew, IBM & Gaurav Singh, Red Hat

While use of Kubernetes for various services is growing rapidly, it is still behind in the world of HPC and AI clusters. Part of the reason is that the lack of support for advanced features like multiple 100G networks available in HPC/AI Systems. Vast majority of AI systems in hyperscalers such as IBM Cloud, AWS, Azure, and Oracle Cloud come with two to 8 100G network interfaces on the A100 GPU nodes. However, by default in Kubernetes, a pod has only one network interface, but attaching multiple interfaces is often a requirement in the scenarios. Multus unlocks the potential of multi-networking feature in Kubernetes, but there are still challenges in usability, manageability, and scalability. We present Multi-NIC CNI, a new open-source project, to democratize multiple interfaces capability for everyone. This CNI saves users from the concerns regarding environment heterogeneity and acquiring CNI specific knowledge. This talk will introduce the architecture, use cases, and performance of the CNI, then show how beneficial it is for HPC/AI. We will demonstrate the CNI on a large scale GPU Cluster consisting of over 1400 GPUs and two 100G network interfaces that we build in IBM Cloud.

6 participants
27 minutes

workloads

microservice

servers

cloud

providers

gpus

ec2

throughput

virtual

efficiencies

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Panel Discussion: Fragmentation of the Batch Ecosystem in Kubernetes, Challenges and Solutions - Moderated by Abdullah Gharaibeh, Google; Diana J. Arroyo, IBM; Wilfred Spiegelenburg, Cloudera; Daniel Milroy, Lawrence Livermore National Laboratory; Albin S

Kubernetes historically focused on service-type workloads, support for load balancing, rolling-updates, spreading and autoscaling are few examples of features the community built for service workloads. While support for Batch workloads lagged in Kubernetes core, recent progress has been made to make Kubernetes a native home for batch workloads including major feature and scalability enhancements to the Job API and the establishments of the batch working group. This panel will discuss what is still missing in core k8s for batch support, what functionalities do we need to push upstream, and what should continue to be loosely defined so that we don't impose specific semantics on how batch jobs should run on k8s.
4:50pm – 5:00pm O

16 participants
53 minutes

kubernetes

microservices

providers

workflows

discussion

batch

capabilities

workshops

interfaces

moderating

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Propagating Programming Paradigms: Lifting High Performance Compute Into K8s - Marlow Weston , Intel

Users don’t care where items run, just IF they run and how long they need to wait for completion. We should be building systems where ONLY the hardware, network, storage, and security engineers worry about how to maximally leverage underlying hardware for performance. Increasingly popular AI/ML and traditional HPC workloads have many similarities. It is historically difficult for the users to deploy their workloads in HPC environments. Kubernetes, meanwhile, has focused on simplifying the cognitive load for the users at a cost to both performance and sustainability. We show how to lift paradigms from HPC to make more performant Kubernetes clusters. We give a history of where HPC has come up short in abstracting hardware away from the user. We highlight current Kubernetes projects that do aid in performance in a cloud-native fashion and will go over continuing gaps. We will show how to improve Kubernetes to optimize both performance and sustainability without added pain to the user.

9 participants
24 minutes

hpc

cloud

processors

users

thinking

pod

workloads

complexity

mpi

infrastructure

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Welcome + Opening Remarks - Ricardo Rocha, CERN

1 participant
8 minutes

batch

hpcn

conferences

cloud

scheduling

event

today

oracle

initiative

devops

Cloud Native Computing Foundation / Kubernetes Batch + HPC Day NA 2022

1 Nov 2022

1 Nov 2022

1 Nov 2022

28 Oct 2022

28 Oct 2022

28 Oct 2022

28 Oct 2022

28 Oct 2022

28 Oct 2022

28 Oct 2022

28 Oct 2022

28 Oct 2022

28 Oct 2022

28 Oct 2022