Cloud Native Computing Foundation / Kubernetes Batch + HPC Day NA 2022

Add meeting Rate page Subscribe

Cloud Native Computing Foundation / Kubernetes Batch + HPC Day NA 2022

These are all the meetings we have in "Kubernetes Batch + H…" (part of the organization "Cloud Native Computi…"). Click into individual meeting pages to watch the recording and search or read the transcript.

1 Nov 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io​. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Beyond Experimental: Spark on Kubernetes - Weiwei Yang, Apple

Apache Spark on Kubernetes takes advantage of containers and the large, rapidly growing Kubernetes ecosystem to maximize the data processing capability on the cloud. However, running a large-scale production environment is not an effortless combination. Challenges at scale, dev-ops complexity, multi-cluster management, job scheduling, and autoscaling are all roadblocks that could quickly fail the mission. In this session, Bowen Li and Weiwei Yang will share their insights on leveraging open source technology such as Apache YuniKorn, Spark K8s operator, and Cloud primitives to evolve ML data infrastructure in the cloud, including considerations for multi-tenancy, observability, scalability, and cost-effectiveness.
  • 7 participants
  • 30 minutes
workflows
kubernetes
workloads
backend
provisioning
batch
server
resourcing
cloud
platform
youtube image

1 Nov 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io​. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Coordinate Workloads Colocation: QoS-Oriented Scheduling Enhancement on K8s - Zuowei Zhang & Tao Li, Alibaba Cloud

Kubernetes provides well-defined QoS Classes on pod as guaranteed, burstable, and best-effort. Users can colocate different QoS workloads to achieve resource overcommitment and improve cluster utilization. However, with scale increasing and workloads diversified, some limitations are becoming more: · Lower QoS will be easily throttled or killed once node runs out of resources · The noisy neighbor problem effects the performance of latency-sensitive application · Local hot spots affect the global We implements Koordinator based on Kubernetes with several add-ons to provide QoS-oriented scheduling enhancements: · Definition of sub-QoS classes for complex workloads in co-location scenarios and compatible with the Kubernetes existing QoS semantics · Using dynamic metrics of nodes and pod to provide a more reliable model for resource overcommitment, including resource usage profile and micro metrics such as CPU scheduling, memory allocate latency · Applying fine-grained resource orchestration and isolation mechanism on node to solve the noisy neighbor problem and improve the efficiency of latency-sensitive workloads and batch jobs
  • 2 participants
  • 32 minutes
workloads
efficiency
utilization
capacity
scheduling
computing
kubernetes
data
services
centers
youtube image

1 Nov 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io​. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Managed Kubernetes — Next Gen Academic Infrastructure? - Viktória Spišaková & Lukáš Hejtmánek, Masaryk University
  • 2 participants
  • 31 minutes
kubernetes
infrastructure
infrastructures
computing
researchers
institute
capacity
cvmfs
administrator
storage
youtube image

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io​. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.
  • 6 participants
  • 13 minutes
benchmarking
benchmarks
toolkits
cpu
optimized
workloads
hpc
batch
monitoring
microservice
youtube image

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io​. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Building Armada – Running Batch Jobs at Massive Scale on Kubernetes - Jamie Poole, G-Research

Thousands of GPUs. Hundreds of thousands of CPUs. Learn how (and why!) G-Research designed and built Armada - a system to enable massive throughput of batch jobs running on Kubernetes. In this session you’ll hear how we use large scale batch compute on Kubernetes to spot patterns in financial markets and predict the future. Armada enables us to schedule millions of batch jobs across many clusters and tens of thousands of nodes, getting optimum utilisation of our hardware to enable our researchers to run the latest machine-learning and advanced data science techniques across vast datasets. We’ll cover the architecture and approach of Armada, challenges and techniques for running Kubernetes at scale and some war stories and lessons learned along the way.
  • 8 participants
  • 35 minutes
armada
kubernetes
platforms
research
tooling
operationally
infrastructure
gan
docker
ai
youtube image

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io​. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Hybrid Cloud Bursting Electronic Design Analysis Optical Proximity Correction (OPC) Flows to Public Cloud Managed Kubernetes Services - Derren Dunn, IBM & Gaurav Singh, Red Hat

Everyone is aware of the adage, “time is money”. No industry is more aware of this than the semiconductor industry in which time to market delays can cost billions of dollars. To do anything in this business, one must do 3 things: 1) design chips, 2) transfer design shapes to a photolithography mask, and 3) fabricate designs. In this talk, we will explore the transfer of design shapes to masks using OPC. OPC is an embarrassingly parallel high performance computing workload that is typically run on Linux clusters. Historically, semiconductor manufacturers have finite on-prem compute resources. To address compute limitations, we discuss methods to enable OPC hybrid cloud bursting using managed Kubernetes services. We present scaling OPC workloads from 1,000 pods to 10,000+ pods using managed Kubernetes services. Also, we explore benefits of using managed Kubernetes services in terms of performance, set-up, and the use of autoscalers to control costs at the job level.
11:20am – 11:30am L
  • 6 participants
  • 35 minutes
workflows
workflow
batch
processes
workloads
provisioning
jobs
customers
computational
servers
youtube image

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io​. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Keynote: Fair Share - What Shared Responsibility Means for Managed Kubernetes Clusters - Mickey Boxell, Principal Product Manager, Oracle

Managed Kubernetes offerings provide users with a simple way to automatically deploy, scale, and manage Kubernetes - generally everything you need to quickly deploy a production ready Kubernetes cluster. However, as a cluster operator, you are responsible for more than simply deploying containerized business logic. When you adopt a product with a managed life cycle you need to know what exactly to plan for: more specifically, where does your responsibility end and where does the provider’s responsibility start? When new Kubernetes versions are released, is it your responsibility as an operator to update your control plane? How about your worker nodes? The nature of Kubernetes as a tool with a control plane and a data plane further complicates things. For example, users generally are not responsible for managing and operating control plane components including kube-apiserver, kube-controller-manager, kube-scheduler, or etcd, but worker nodes generally exist in user tenancies and because nodes execute private code and store sensitive data providers’ access is limited. This talk will explain the support boundaries and shared responsibility of a managed Kubernetes service through the eyes of a cloud provider. It will advise users where to look for information about the parts of their system in need of care and feeding and those that can be comfortably trusted to a knowledgeable provider.
  • 1 participant
  • 8 minutes
shared
responsibility
responsibilities
kubernetes
managed
providers
workloads
server
cloud
deploying
youtube image

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io​. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Lightning Talk: CNCF Batch Working Group Update - Alex Scammon, G-Research

This talk will present an update from the CNCF Batch System Initiative Working Group, a newly-created group set up to discuss batch scheduling conversation at the end-user level. It will focus on how the users and operators of today’s batch workloads currently interact with the many various cloud-native batch-related projects like Volcano, Armada, MCAD, Yunikorn, Slurm, HTCondor, etc.. Ideally, the hope is to provide some rough guidance and information for the CNCF community on these higher-level batch scheduling approaches since the landscape remains fairly opaque.

This presentation will discuss what the working group has worked on so far, what it's hoping to achieve, and (crucially!) how this working group is different (but closely related!) to the Kubernetes Batch working group.
  • 3 participants
  • 13 minutes
batches
batch
collective
scheduling
clusters
bunch
initiatives
context
conversation
currently
youtube image

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io​. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Lightning Talk: Evolving the Job API to Become Defacto Standard for Batch Workloads - Aldo Culquicondor, Google

Since Kubernetes 1.0 until recently, the Kubernetes Job API had a rather limited feature set. Meanwhile, multiple batch-oriented frameworks were developed, each re-implementing their own job API and controller to manage pods, with their own advantages and limitations, leading to fragmentation of the ecosystem. Starting in Kubernetes 1.21, contributors to SIG Apps decided to evolve the Job API, implement common patterns and increase its scalability, so that it can become the defacto standard for running batch workloads or for building specialized frameworks on top of. They introduced features such as indexed Jobs, suspended jobs, tracking with finalizers, failure policies, etc. In this talk, Aldo will walk you through all these efforts, the challenges that we faced when implementing them and the remaining opportunities we have identified to make the API more comprehensive and useful for a wider range of applications.
  • 8 participants
  • 29 minutes
kubernetes
workflows
bots
api
parallelism
application
process
controller
problems
cumbersome
youtube image

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io​. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Lightning Talk: Fluence: Approaching a Converged Computing Environment - Daniel Milroy, Lawrence Livermore National Laboratory & Claudia Misale, IBM T.J. Watson Research Center

Adoption of cloud technologies by high performance computing (HPC) is accelerating, and HPC users want their applications to perform well everywhere. While container orchestration provides resiliency, elasticity, and declarative management, it is not designed to enable app performance like HPC schedulers. In particular, Kube-scheduler is not suited to scheduling emerging HPC workflows that require pods placed advantageously. In response to interest in scheduling flexibility, the K8s community developed the Scheduling Framework to integrate new policies and schedulers. KubeFlux, a Scheduling Framework plugin based on the Fluxion open-source HPC scheduler, provides HPC scheduling capability in K8s. We detail our improvements to the MPI Operator and demonstrate its scalability to 16,384 ranks. With the improved operator we compare the performance of HPC benchmark apps scheduled by Kube-scheduler and KubeFlux. We conclude that KubeFlux makes pod placements that enable much higher app performance than Kube-scheduler. KubeFlux is an example of the rich capability that can be added to K8s and paves the way to converged computing environments with the best capabilities of HPC and cloud.
  • 6 participants
  • 15 minutes
workflow
simulations
research
mpi
manage
benchmarks
hpc
bioinformatics
cluster
strain
youtube image

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io​. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Make Kubernetes Networking Ready for World Class AI and HPC Workloads - Sunyanan Choochotkaew, IBM & Gaurav Singh, Red Hat

While use of Kubernetes for various services is growing rapidly, it is still behind in the world of HPC and AI clusters. Part of the reason is that the lack of support for advanced features like multiple 100G networks available in HPC/AI Systems. Vast majority of AI systems in hyperscalers such as IBM Cloud, AWS, Azure, and Oracle Cloud come with two to 8 100G network interfaces on the A100 GPU nodes. However, by default in Kubernetes, a pod has only one network interface, but attaching multiple interfaces is often a requirement in the scenarios. Multus unlocks the potential of multi-networking feature in Kubernetes, but there are still challenges in usability, manageability, and scalability. We present Multi-NIC CNI, a new open-source project, to democratize multiple interfaces capability for everyone. This CNI saves users from the concerns regarding environment heterogeneity and acquiring CNI specific knowledge. This talk will introduce the architecture, use cases, and performance of the CNI, then show how beneficial it is for HPC/AI. We will demonstrate the CNI on a large scale GPU Cluster consisting of over 1400 GPUs and two 100G network interfaces that we build in IBM Cloud.
  • 6 participants
  • 27 minutes
workloads
microservice
servers
cloud
providers
gpus
ec2
throughput
virtual
efficiencies
youtube image

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io​. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Panel Discussion: Fragmentation of the Batch Ecosystem in Kubernetes, Challenges and Solutions - Moderated by Abdullah Gharaibeh, Google; Diana J. Arroyo, IBM; Wilfred Spiegelenburg, Cloudera; Daniel Milroy, Lawrence Livermore National Laboratory; Albin S

Kubernetes historically focused on service-type workloads, support for load balancing, rolling-updates, spreading and autoscaling are few examples of features the community built for service workloads. While support for Batch workloads lagged in Kubernetes core, recent progress has been made to make Kubernetes a native home for batch workloads including major feature and scalability enhancements to the Job API and the establishments of the batch working group. This panel will discuss what is still missing in core k8s for batch support, what functionalities do we need to push upstream, and what should continue to be loosely defined so that we don't impose specific semantics on how batch jobs should run on k8s.
4:50pm – 5:00pm O
  • 16 participants
  • 53 minutes
kubernetes
microservices
providers
workflows
discussion
batch
capabilities
workshops
interfaces
moderating
youtube image

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io​. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Propagating Programming Paradigms: Lifting High Performance Compute Into K8s - Marlow Weston , Intel

Users don’t care where items run, just IF they run and how long they need to wait for completion. We should be building systems where ONLY the hardware, network, storage, and security engineers worry about how to maximally leverage underlying hardware for performance. Increasingly popular AI/ML and traditional HPC workloads have many similarities. It is historically difficult for the users to deploy their workloads in HPC environments. Kubernetes, meanwhile, has focused on simplifying the cognitive load for the users at a cost to both performance and sustainability. We show how to lift paradigms from HPC to make more performant Kubernetes clusters. We give a history of where HPC has come up short in abstracting hardware away from the user. We highlight current Kubernetes projects that do aid in performance in a cloud-native fashion and will go over continuing gaps. We will show how to improve Kubernetes to optimize both performance and sustainability without added pain to the user.
  • 9 participants
  • 24 minutes
hpc
cloud
processors
users
thinking
pod
workloads
complexity
mpi
infrastructure
youtube image

28 Oct 2022

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io​. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Welcome + Opening Remarks - Ricardo Rocha, CERN
  • 1 participant
  • 8 minutes
batch
hpcn
conferences
cloud
scheduling
event
today
oracle
initiative
devops
youtube image