Cloud Native Computing Foundation Kubernetes Batch + HPC Day NA 2022, 1 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Managed Kubernetes — Next Gen Academic Infrastructure? - Viktória Spišaková & Lukáš Hejtmánek

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Managed Kubernetes — Next Gen Academic Infrastructure? - Viktória Spišaková & Lukáš Hejtmánek, Masaryk University

A

This presentation is about manager kubernetes as a Next, Generation academic infrastructure, but first we are I, am lukash. Hitmanik and I am I.T architect at Mark, University and also I am contributing to assessment, which is uh check, National research and educational Network.

B

Hello, my name is Victoria and I am a doctoral student at mesberg University in Bernal Czech Republic, an ID specialist at Institute of computer science at Master University as well.

A

For example, let me introduce the research and educational infrastructure in the Czech Republic, except a real super Computing Center. We have two main kinds of infrastructure available to scientists. There are HPC and kubernetes infrastructure. The HPC infrastructure consists of 32 000 CPU cores, it has 15 petabytes of storage capacity and it's used by three targets: active users. Those users are running about 20 000 jobs every day and we also have 360 gpus of virus kind. This kind of Interest infrastructure is based on PBS Pro batch system, um the other one kubernetes infrastructure.

A

It consists of 2500 CPU scores. uh It has 60 100 terabytes of dedicated storage capacity, expected on flash, only and storage array. It's currently used by about 200 users. They are running 1000 books every day and this infrastructure is equipped by 50 gpus. Some of them are a Nvidia a100 and they are yet to be installed and we will experiment to it. Make technology as well. This kubernetes infrastructure is based on Ranger and RTA. You know, distribution.

A

uh Speaking of manage kubernetes, so what can you imagine? Basically, it means that a devops beam manage the infrastructure. We offer type integration with direct all of our infrastructure like the HPC, and we aim to offer many components that allow easy deployment of user application. We have, for instance, several storage classes like NFS, Samba, sshfs or cvmfs.

A

We also integrated xfs, but this storage class use uses a special version of Jeff FX driver. This driver has been patched so we are able to change user ID and group ID that are locally visible, so it does not matter under which user ID around the container. This page is public as a pull request to step Upstream, but I call as I know. It's still not merge. We also have one data class storage class and both of these storage classes are implemented. As a few CSC drivers.

A

uh We have a workaround so that the CXC driver can be restarted without baking mode point. Next we have integration of DNS system for Ingress and load balancera. It means that DNX name is created for such a service like ingressor and the road balances. We also provide like a encrypt certificates for both Ingress or also for non-web services.

A

We provide a single sign-on service based just on annotation, so if you want to and use a single sign-on form, it's called her application and they just needs to add some annotation to to Ingress Android single sign-on is just automatically registered and provided we also offer shared gpus. It means that a single GPU can be shared by a multiple container or multiple users, but there are no guarantees about consumer resources from the GPU, and we have also a slightly modified GPU operator from Nvidia that enforce the GPU allocation.

A

It means that no user can basically steal on the GPU without knowing letting you know the kubernetes scheduler.

A

uh So, let's look on manage kubernetes from user perspective, users are given project and namespace, and we enforce resource quota on the CPU and memory.

A

The users are allowed to run only unprivileged containers, which can be a bit limiting, but on the other hand, we do not enforce users to use any particular UV ID ufrl, so you have any user ID. They want. uh Users also cannot install custom resource definition or any other cluster scope resources. This you can. This operation are forbidden and only administrator can do so.

A

It means basically a devops team has to install such resources, but we want the users to not struggle with maintaining infrastructure, maintaining kubernetes and maintaining all the compound company that needs to be run and the user can focus only on own application or on workload and fully utilize their service. The devops team provides.

A

However, we do not offer just an infrastructure, we go a bit further and we prepare some prefabricated applications, such as Jupiter, Hub and binder hop next levels to our famous and very popular. uh Also. The Jupiter Hub offers integration which HPC storage systems via hsshfs, and we also have two special instances of Jupiter half one is our studio that runs inside Jupiter app, so user can get our studio on one click that is integrated with HPC storage system and the other one is Alpha fault on demand.

A

This application is based on collabora, Jupiter, app notebook and we also integrated more star reviewer that allows user to preview the folded protein uh goes to application. The Jupiter, Hub and binder Hub are run a web application that has ear in their own logon system, but next to those applications, we prepared another application that are accessible directly in a ranger as a rental application and those applications mainly oh, contains or are based on remote desktops, and we offer applications such as crime Matlab answers, vmd viewer, IBM, C, Plex.

A

All this applications are based on either VNC technology and protocol or web RTC protocol. In the latter case, the user is given a fully 3D accelerated desktop that is pretty capable of almost anything, and also we prepared containers that allows users to use SSH access to this container via Network and those containers are running behave much like virtual machine, because user does not have a root access in the in the container, but on the other hand, using some say tricks and hex, so the user can install any package or anything in initial container.

A

So it should. It behaves much like a virtual machine. We also offer some web-based applications, such as the code, server or neo4j and including other applications such as a personal menu or parallel server, recipient or personal symbol. Server. Those personal service means that the user can run the menu or Samba on its own and can connect the local computer to do this service via X3, or we are popular Samba protocol, for instance, from Windows system.

A

So here you can find some examples of our prefabricated application on the laptop you can see our studio running in a Jupiter hub below you can see the form for Alpha fault on demand. You can see that most of the parameters that are used for for the scripts that are standard scripts of alpha fold you can fill in the parameters in the next two is on the right side down. You can see the most card viewer that offers the preview of 40 protein and, above on the right side top.

A

You can see probably famous game, Witcher 2 that runs in the browser and run from a kubernetes, and it will fully accelerated it uses webrtc and it's based on a Celtics project. So for a while, you can enjoy the gameplay.

A

So now let me reveal some implementation details. First for remote desktops, our solution is completely unprivileged, so it means that none of the participating containers in its privilege escalation on run X root everything just on as a user. However, it required to patched a server. It also requires some minor changes to Nvidia GPU operator and, as I mentioned, the enforce GPU allocation, and this enforcing denies to share GPU among containers, because Nvidia, visible devices all is ignored.

A

If this is the only request for GPU, however, we use some GPU sharing from uh China Oliver background that is publicly available and with the sharing we can, we can share the GPU between Excel container desktop container and streamer container. I also mentioned that we offer an integration with the DNS system. However, we have no solution for a name conflict. Currently, any user can and select any domain name under some specific subdomain.

A

However, this sub domain is shared among all the users, so then can write some name conflicts, and this has currently no solution with external DNS driver uh also with like encrypt certificates. There is a one problem with DNX challenge, because we offer to get a certificates also for the whole sub domain.

A

That is meant for both external DNA and like encrypted certificates, and in this case all every user is able to uh get any certificate in in this domain, because there is no real validation of of the request, and we also are not aware of any any possible solution. You know for this problem. uh Probably one of the solution could be that we can. We create this thing: uh DNX zones for each user or every every group of user, but this is currently not implemented.

A

We decided to use kubernetes also for a sensitive data processing. We set up a small cluster that is uh dedicated only for quantitative data processing. This cluster is separated from the public cluster. However, the single small cluster is used by all the users that want to process the sensitive data we are working on, Azure, 27000 certification and which is equivalent to an ISD 853 certification.

A

But, as I have said, the single cluster is shared by distinct users, which brings some isolation, challenges mainly related to usually single Ingress instance, all of, and also for uh still instance, that is not multi-tenant by default.

A

We do not run just few web applications or remote desktops on our in kubernetes infrastructure. We also use HPC jobs on pretty regular basis. Currently we run the HPC jobs via workflow managers. We use two of them, one is snake make and the other one is next flow. The snake make is integrated with task execution service from gh4ga initiative, and the next flow is directly integrated with kubernetes.

A

B

A

Does the HPC jobs work on kubernetes? There are some bad rumors that it will. It does not work, yes, I have heard, but we can, but all we can say is that it works. There are, of course, some limitations.

B

A

Have to bring some research opportunities, we also create many neck flow enhancements should do because one is adding job job support which make the next flow computation almost Immortal. So if you're pretty stable and it runs, runs just fine.

B

Because Lucas already mentioned, there are limitations of HPC in kubernetes. These limitations are eventually beneficial because they bring research opportunities for the community. There are plenty of areas where research can be conducted that we started with scheduling challenges because they were the most prominent to us. I would like to present to you some of our research interests problems with Techo Solutions we found and new areas we would like to scrutinize.

B

I will talk about four topics. The first is efficient resource allocation in heterogeneous and dynamic environment, which is basically a kubernetes cluster, the second being infrastructure, comparisons of kubernetes and traditional HPC, based on batch scheduling.

B

Third topic will be area of Green Computing, that is with Rising electricity prices and Global Climate status, quite an important topic, and fourth and last topic will be about connecting kubernetes with HPC in a hybrid way.

B

Firstly, I am going to talk about effective resource allocation in kubernetes, as we all attend HPC day. I. Believe majority of you have ever asked answered, discussed or just came around the question of effective scheduling in any Computing environment scheduling is an omnipresent topic because everyone tries to come up with the best scheduling strategy that will accommodate the most jobs on all notes, and no job will wait too long and cluster usage will be above 90 with no down times.

B

Sadly, this is not the reality and we all experience a plethora of problems. We come from academic environment, where computational resources are provided more or less for free for all researchers and academics. This is a very different approach from commercial providers where you can prepay notes for desired time or follow pay as you use model.

B

When you have a records to compute, you naturally don't want to pay providers more than necessary, not mentioning if you have specific requests on resources such as graphical cards, whose usage can be really pricey from the opposite point of view, providers reach very high resource usage because they combine offered plans in a very smart way and efficiently. They overcome it very much.

B

Our experience in the Academia clearly shows that users drastically overestimate their resource requests, as you can see on the image, even the best used to request ratio for a namespace has a two-time difference: users like motivation for precise resource requests and because, as I said, they are free and, secondly, they either don't know how the application works or the resource usage of the workload is not stable over time.

B

However, burstable clothes as we call the applications with unstable resource usage are not the only case that makes that makes scheduling in kubernetes hard. We distinguish between two types of these bursty jobs. One are long-running services that are used three times a week for two hours and the second type are computations characterized by Dynamic variation when most of the time resource usage is low, but for some short time, perhaps a more complex part of the computation restarts resource consumption.

B

Spikes users fear their job will exceed allocated resources which would cause job termination, so the rather specifies substantially more resources that than are needed in order to avoid the situation.

B

Second scheduling problem is already mentioned: the user overestimation, which causes low cluster usage and unused resources. The reason might be just sheer obliviousness to the concept of and logic behind, the resource allocation.

B

Third problem is posed by interactive jobs, which are common in HP sync, for example, when working with a software like Matlab or ansys, if interactive record is created, user doesn't want to wait until job moves from waiting queue to running for too long. They want to work instantly or in the span of approximately two to three minutes in kubernetes, you can set a higher priority on the interactive job and but then you must decide which pod can be terminated.

B

Also, you must watch out for already waiting jobs that may require just slightly more resources than your new interactive job, because these interactive jobs could starve others who are already waiting.

B

Lastly, for scheduling problem is tied more to the Academia, where you need to enforce fairness and, at the same time account everyone for their resource usage. Kubernetes does not Implement any built-in, accounting or fair or user fairness. If we talk about multi-tenant clusters, but these are crucial. Concepts, imagine that you have a user who spawns too much interactive jobs, and so this user will use all of the resources and new user might never get to compute.

B

The good news is that there are some solutions to the problems we proposed. One possible solution to the need to reserve resources in the in the manuscript linked below the solution is based on the existence of small or large. It doesn't matter jobs that can be evicted easily. Maybe they do checkpoints, maybe their inherent logic counts with restarts.

B

Nevertheless, if larger or an interactive job arrives, these jobs, which we call scavenger jobs, are the first one to terminate and the free space occupied by them is instantly located by placeholder jobs that serve just as a reservation.

B

If enough, scavenger jobs are terminated to accommodate newer quote all these placeholder jobs free their resources to the workload for which resources were in the first place, created or reserved. This is actually one way of implementing forward reservations, as we know them from HPC.

B

Another much easier solution would be to create separate clusters where each cluster is dedicated to accommodating a specific workloads type. One more solution is a vertical autoscaler, which should be available from kubernetes version. 125 and vertical autoscaler is able to scale resources on the running container.

B

This approach might solve a lot of issues when you can change the Pod requests on the Fly.

B

Now I will move from effective resource allocation to HPC. In kubernetes we have been researching the potential of kubernetes platform to run big workloads, such as analysis on genomics data using workflow manager. We asked ourselves to questions, can HPC work in kubernetes and will short-living tasks perform better in kubernetes we answered Those Questions by performing several genomic analysis, runs on different infrastructures, that being traditional HPC environment, with batch scheduler, open, PBS and second environment, the kubernetes cluster we compared Numa where and where kubernetes environment, with no matter where open PBS, environment.

B

From our observations, we can safely state that for kubernetes to perform as good and even better as traditional HPC environment, proper Numa configuration is the most important aspect of the success. We have configured just the standard, kubernetes normal settings, so no custom, Solutions or deep system administration work was needed. We also found out that Numa memory manager has limitations because kubernetes scheduler does not see the whole amount of available memory with each Numa node. It observes whole state in the cluster.

B

It happened to us that many pods were rejected from the cluster due to unexpected emission error, which is unreco unrecoverable. This error is caused by not enough memory on the pneuma Node assigned to pod. This truly happened just because the scheduler thought that enough memory was available overall, but the poll was assigned to the specific pneuma node which didn't have the memory Additionally. The time elapsed, from job being scheduled still running.

B

Job is much shorter in kubernetes, because container images are cached and therefore started almost immediately, whereas in open PBS environment there's a bit of the setup which, with larger number of jobs, significantly delays, whole computation and there's a material effect runs in kubernetes were much more stable. Overall.

B

On these figures, you can see the graphical interpretation of our results. The upper left picture shows that the average duration of long running processes, processes of genomics analysis is the highest for non-numa, aware kubernetes environment. If we configure Numa, the time is identical or just slightly more than open. Pbs on the other side is the upper right picture shows if we compare short living tasks, kubernetes either with pneuma or non-numa, configuration performs significantly better than open PBS.

B

In summary, the button image shows total duration of genomics analysis, where we clearly see that kubernetes within my configuration delivers results faster than PBS environment.

B

This is caused by the combination of the long running processes and short running processes, because there were much more Shortline running processes than long running processes, and if these short processes were computed faster than than in open PBS, the whole computation was faster.

B

To send infrastructure comparisons app, we just saw that kubernetes is certainly capable of accommodating HPC workloads and its performance could improve even more. We found out that kubernetes scheduler acts almost as a rifle last in first out queue because it does not preserve the queue and the implemented exponential back off makes just more mess in the queue.

B

Secondly, kubernetes does not Reserve resources, which would be handy for certain workload, types such as once they requests basically whole node.

B

Thirdly, low Global, Knowledge of node resource allocations leads to fragmenting memory and CPUs, which could be used more efficiently with a bit smarter or knowledgeable scheduler, and just to mention this work was all done as a result of a manuscript that is currently under review, and this information marks the end of infrastructure comparisons and now I will move to the Green Computing Green Computing is a term that everyone in it has heard of especially these days.

B

We all hear about and really feel the rising Rising prices of electricity and listen to the stories about how not deleting emails as to the climate change. By keeping the servers on majority of cluster providers. Would agree that there are times when huge clusters are just turned on, but not utilized or utilized with really low effectivity.

B

There is a whole an explored field of better scheduling strategies that would accommodate workloads with higher efficiency. Furthermore, we as infrastructure providers should educate users on the best way of utilizing the infrastructure and the best environment for their application.

B

A small container can, which will a small container, will be truly better for a static website than starting a whole virtual machine. Moreover, we can tune the hardware based on CPU usage and power on and off. The nodes based on true usage is an idea to work on. We came up with a thought that some cluster nodes could be dedicated to running specific workload, types similar to scavenger jobs or short-lived jobs. If there is a sudden Spike of the amount of parts of this type, a new node could be dynamically added to the cluster.

B

Just for these workloads. After this workloads finish, the note could be powered off again. All these steps might look like implementations or just simple thought. They have a great power in reducing the power usage and increasing efficiency.

B

Lastly, I would like to mention the concept of hybrid Cloud, that is, that can be seen as a solution to the scheduling as well. The idea is pretty straightforward and is based on connecting HPC world with kubernetes world. The HPC world has usually more resources or better scheduling, capability and kubernetes world is perfect, for other, let's say short-lived workload types. We are currently working on implementation of open, PBS connector, which would allow moving pots from kubernetes to the PBS world transparently.

B

Without modifying the workload inside the Pod at all, the container would be executed in the PBS environment as container as well, probably just with more resources not available in the kubernetes cluster, and with that side, I would like to finish this presentation on kubernetes as the Next Generation Academy academical infrastructure, and thank you for your attention.