Knative Community Meetups, 25 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Knative Community Meetup May 2022 - Serverless Research with STeLLAR and vHive with Dmitrii Ustiugov

Description

Serverless clouds boost developer productivity by taking over cloud infrastructure management, allowing the developers to focus on their service’s business logic. This labor division opens opportunities for systems researchers to innovate in serverless computing. However, leading serverless providers rely on proprietary infrastructure ill-suited for systems research in academia. In this demo, I will talk about vHive, a full-stack open-source ecosystem for serverless clouds benchmarking, experimentation, and innovation; which is now in use in 25+ universities and companies world-wide.

A

About this session, so without further ado, um I'm I'm really happy to introduce today's speaker, dimitri mustigoff, uh dimitri uh received his phd from the university of edinburgh and is now joining zurich as a postdoctoral researcher, uh and his presentation uh is turbocharging serverless research with stellar and v hive so dimitri over to you.

B

Thank you it's great to be here. um Can you enable sharing the screen.

B

Give that a try all right. Thank you.

B

Can you see my screen now? Yes, nice!

B

So thanks a lot for having me here and today, I'm going to talk about beehive, the framework and ecosystem for service experimentation that we developed and I'd like to share our latest research results.

B

um Some of our team members are here in the call, so it would be great to have an open discussion in the end of the talk as well.

B

So serverless allows service developers to focus on writing their code as a set of functions, whereas the cloud providers take care of all the heavy lifting which is automatic scaling of function. Instances according to the traffic changes, this labor division boosts developers, productivity, allowing fast time to market for their products, so say that developers need to write a video analytics application with two functions.

B

First, the function, the first function, the codes frames which comes from a camera um in video fragments from the camera, and it invokes the second function um to recognize objects in the string. So in serverless, this application can be written with two functions um composed with simple function calls and I'm sure everyone is familiar with that. So this flexibility also comes with pays your goal combined with space, you go billing, where the developers only pay for the cloud resources which functions actually use.

B

So this is why service clouds simplify the programming burden and also on um have cost advantages compared to the conventional clouds.

B

From the serverless call provider's perspective, the situation is completely different and to accommodate changes in the key invocation traffic in front of each function. The provider scales the number of instances of each function on demand, so a function can have from zero to virtually infinity number of instances which can change at any time.

B

Making this scaling decisions timely and intelligently is key for overall cloud efficiency, so for the providers, service raises a lot of challenges from the perspective of the infrastructure.

B

So what is the current status of service clouds? Serverless main advantages are in a simple programming model and flexible building. However, today serverless clouds provide lower performance and reserve more cloud resources than it is necessary.

B

One of the key problems is its slow reaction to load spikes because it takes time to launch more instances of functions on the fly.

B

As a result, providers tend to over provision the number of instances according to their understanding of the cost performance trade-off and the second example of a service problem is it's an efficient communication across functions in a service which is related to their stateless nature providers scale instances on demand. It is allowing instances to hold a shared state so that any instance can process any invocation, and there are. These are entire research directions with a lot of papers coming out every year.

B

So there are a lot of problems, but there are also a lot of other problems associated with the programming model, security and so on.

B

But for us for academics, one of the key issues to date is that service providers rely on proprietary infrastructure, preventing serverless, preventing systems researchers from doing their job efficiently.

B

So these are all big challenges and this what we were focusing on so far. Our work covers all three key types of systems: research from benchmarking, leading commercial clouds experimenting across the entire service stack and innovation across system layers.

B

Our first work in serverless characterizes the state-of-the-art cloud offerings and it was first to compare cloud performance um across different providers even with proprietary infrastructure, and this was acknowledged by the leading amazon engineer mark brooker. Even before the paper got published to analyze various components of the deep serverless stack.

B

We introduced beehive ecosystem, for which we had a full day tutorial at a top conference in systems uh this year. Hospice. Everything is on youtube, so you can take a look, and um there are a lot of universities which uses beehive today for their research studies and also.

B

For coursework, at the same time, we have a lot of collaborators and contributors across many companies, and this keeps growing over time, um and many of these companies have research groups even several uh which use beehive for either evaluating their accelerator products or to understand the server's workflows.

B

Using beehive, we innovated in several systems, domains with the first work called reap snapshots, which improves the reaction time to traffic changes by accelerating launching new instances of functions by using far functional working set of air snapshots. This reap technique is already um supported in in the latest version of aws firecracker and the second work called expedited. Data transfers or xdt introduces a cloud, a novel, cross-function communication fabric for serverless clouds, which enables data transfers at the wire speed, in conjunction with existing servers of the scaling infrastructure.

B

And this is the work we actually prototype in k native um and uh there is another work which um called, which is called jukebox, uh which got recently published in the top conference, uh which talks about specializing micro architecture for service workloads.

B

So in this talk, I'm going to start by presenting the performance analysis of the commercial clouds that we did with two code stellar, which we deviced and then I'll talk about the experimentation framework that we hive represents and show you how to address the real world problems, whether it's a cold start or serverless functions, communication and I'll briefly touch about the future uh work for future directions of the ecosystem.

B

So it's essential to start with benchmarking, the existing clouds and understanding which problems matter for both the production teams and conceptually for the academics as well.

B

So we started by examining what the modern modern service clouds are today and the first important observation is that serverless clouds today, even though they are from different providers around different infrastructure, they all have the same fundamental architecture. Any of these clouds.

B

Which we started have three foundational components, namely the cluster scheduler, which is on the way of all incoming locations which selects or spawns the appropriate instances for each functional location.

B

um Functional instances themselves run on a fleet of worker nodes.

B

And so to launch a new function instance: first, um a function image has to be um retrieved from the storage service, for example, a container image to launch a new function, a new instance.

B

So if we recall that functions are often restricted to be stateless, they usually communicate through storage, which is the same storage service uh as quality use for other purposes.

B

So, to reason about the overall performance of service clouds, uh we devised the method and tool chain to analyze the performance of each of these fundamental components in isolation.

B

With this insight, we introduce stellar an open source framework for performance, analysis of serverless clouds um by configuring function, characteristics and the load scenarios. Stellar can stress uh any of these components uh in isolation, and we showcase some of the results by comparing three leading cloud providers.

B

So first we evaluate the response time of warm and cold functions by steering the invocations to these functions once at a time with a different inter-arrival time.

B

Invoking functions frequently, for example, once per second results in warm starts, while invoking them infrequently. For example, once in half an hour, postal code starts with a high probability, where a function instance needs to be spawned before any location can be processed.

B

The charts show cdfs on the y-axis and the way and the latency in milliseconds on the x-axis, with warm and locations on the left and the cold on the right, and note that the x-axis have different scale, because cold indications are much slower.

B

So one can see that permanent vacations are much faster than the cool invocations, but in both cases variability that we measure as a 99, percentile or often references referred to as state latency is not so far from the median.

B

So the latency is actually quite predictable.

B

We now assess the code function, delays in just aws lambda, which appear on the worker nodes, with a different language runtime and using different deployment methods.

B

The charts show in ccdfs as before and runtime languages are shown in different colors. Sorted wines stand for zip deployment and dashed wines stand for the newly introduced container based deployment.

B

We noticed that python container deployments are significantly slower and more unpredictable than the others. One possible explanation of this phenomenon is that a golden program is compiled as a static binary, suggesting that both zip and container image comprise the same binaries that are likely to be stored in the same storage service. Meanwhile, for python, a container based deployment shows higher median and tail agencies compared to the corresponding zip deployment. We attribute this behavior to the fact that python imports modules dynamically requiring on-demand access to multiple distance files in the function image.

B

When combined with the container-based deployment methods, we hypothesize that this results in multiple access to the function. Image, storage since container support splinters um the the the chunks of the image and the loads them on demand.

B

The additional access to the image store would explain the high cold start time and lengthy variability, which we observed for writing and container based deployments. Particularly combination of this runtime choice and the deployment methods have severe impact on the overall performance, so compared to zip. Container-Based deployments for python can significantly increase the response time.

B

Next, we study the data communication delays for an application of two functions where one function: transmits a payload of a configurable size to the second function. Since these functions are stateless, data is transmitted via storage service.

B

With the first function, saving the payload to storage and the second function load in the payload from storage. We capture these latencies with timestamps taken by the user code in this functions on the left chart. One can see median latency shown with solid lines. A tail latency is shown as dash lines on the right chart. The cdf lines are for one megabyte and one gigabyte payload transfers respectively.

B

So clearly, storage transfers are slow and have very unpredictable response time, for example, for one megabyte transfer in google. This delay the result in 150 millisecond median and around six second tail latencies, which corresponds to more than order of magnitude gap between the tail and the median.

B

So we identified two bottlenecks which are common for all three clouds and there are more of these bottlenecks studied in the paper. So take a look if you are interested and we would like to focus on this data movement problems.

B

So the first problem is related to loading functions, relation state from the storage, uh which is at the corners of the cold start problem, and the second issue related to the data movement is related to the cross-function communication, which has to happen through the storage, uh which contributes a lot to the median and the tail wait and see so for today's systems.

B

um There is a huge for headroom for improvement for both of these issues and it's fair to say that today's clouds are bound by data movement. So this is the problem that we targeted um first in our research.

B

So is there any other any question, quick questions? We can take them now.

C

Yeah, I have a quick question.

C

So the reason I think that storage is external is to enable scaling to keep functions. Stateless, so did you evaluate actual scaling of function, not just looking on latency of execution? So like what happens when you increase number of invocations like 1000 times, and you know a storage becoming bottlenecked there or keeping function. Stateless is actually keeping scaling very nice and also cost, because if you look on it, the cost is maybe the most important thing about performance I can buy.

C

uh You know like dedicated instances and whatnot and I get very low cold start and maybe dedicated storage. But then I will be paying a lot and also I cannot scale that much.

B

This is an excellent question and we actually did uh experiments with uh bursting vacations where we invoke the same function like 1000 times in a short period of time, and we actually see completely different behavior from different providers.

B

But it's not the storage which becomes a bottleneck. It's usually the the way people structure, either the placement service or.

B

The scaling like a cold start because it's easy to have a caching layer in front of the storage, however um yeah, how it's rather easy to have a question layer, uh but all the providers have different ways to deal with scaling. So this is what we observed and regarding the course this is an excellent question. So um I'll talk about it a bit later, foreign.

B

In one of the works, um the cost of transmitting the femoral storage is comparable um with the compute costs uh of function execution. So we know it is like 30 to 60 percent of the cost going into the storage cost, and we are talking about s3, which is rather slow, storage, so cost is definitely a problem.

C

Okay, thank you. Thank you.

B

So I see there is like quite a lot of activity in the chat.

D

um Sorry I worked on the google infrastructure. This is um fascinating to see major from the outside.

D

B

You know we have like this is an overview. um You know overview um talk, so we can go as deep as you like. Actually, and we can have. uh You know a much uh deeper dive together with you and your team.

D

I'm not working at google anymore, but but it's really interesting to see um lambda actually has a slightly different architecture than google cloud. I don't know what azure's internal architecture is, um but um I link to a youtube. Video aws has a great um how lambda works. Talk from reinvent 2018.

D

um So, if anyone's really into this stuff, um that's definitely a video to watch.

B

Yeah for sure just uh to share um yeah so, and we also compared like all three providers, so you can actually see which provider is stronger, in which sense so.

E

A quick question I have um because you said the communication happens through storage. um I'm going to assume that not all communications happens through a storage like you can just do an hp call potentially to another function. Do you measure those things.

B

Yeah we do measure so the problem with inline communication when you put arguments in the http packet is that it's limited, so you can transmit a lot of data, so it's usually like several megabytes and another problem is that it really limits the function uh execution model. So you cannot efficiently support scatter gather patterns, for example,.

E

I see so, I guess to clarify you're saying using the storage lets, one function spin down and another one spin up unless you get better pricing and so forth and so forth.

B

Yeah we're going to talk about it in more detail, actually.

E

B

All right cool, so stewart allowed me allowed us to pinpoint and evaluate the data movement issues uh at the component level. So now uh we need to dive deeper into this. So what we did is we tried to analyze the tools available for service research in like open source academic environment. So what we looked at is first, the production servers, deployment and these systems feature complex, distributed software stacks with many proprietary components and, on the other side, the two chains which are available to the academic researchers.

B

They are um insufficient. Often insufficient and uh um academics often focus on distinct components rather than complete systems, and also many prototypes rely on technologies like containers, while most provider like at least the biggest providers, they moved on to lightweight virtualization.

B

um So it's not entirely representative.

B

So what we had to build is a complete open source framework for service research. The good news is that there are so many companies which open source their key components and we integrated them in a single representative framework for serverless experimentation. So we adopted k native and kubernetes as a function as a service programming model and the orchestration framework for sandboxing technologies.

B

We support firecracker and juveniles and micro vms, along with the vanilla containers that k native supports and for uh life cycle of micro, vms uh and other control, plane messages we support container d, just like a native and jpc, is used across the whole stack, uh which comes with a lot of advantages, for example, for metrics collection and profiling.

B

So what behind framework today is? It is representative of production clouds, and it includes only open source reaction grade components and uh what we are actively working on and keep expanding is the tool chain for holistic benchmarking both end to end and per component, which includes a representative suit of the workloads distributed, tracing support and also full system. Psychoaccurate simulation support in j5 cpu simulator.

B

So we have allowed us to innovate in three different system subfields, uh and this talk will focus on two works, uh only yeah, so the first one is uh innovation in the operating system and the cold starts, and the second one is actually the communication.

B

So, let's start with characterizing the cold start, so cloud services commonly experience, frequent changes in internet traffic and service workloads are no different, so it is essential for the cloud infrastructure to monitor the load and timely react to its spikes by quickly spawning new instances of overloaded functions.

B

This problem is commonly referred to as cold start problem and we'll look at the problem from the data movement perspective and ask the following question: what fraction of data is actually necessary to move to start a function and let it run quickly?

B

Also, if it's common for the functions to experience cold starts, can we learn something from the past invocations to accelerate the future ones, and beehive was a great platform to drill down to the fundamental root cause of this issue.

B

So we built a system which mimics aws lambda, which is the leading service offering today and as we discussed previously at the high level, the response time of function breaks down to three categories.

B

First is the time that the scheduler takes to select the worker host to launch a new instance and uh according to aws, this is the shortest and most predictable delay. So I actually exclude this uh from the analysis, even though in key native, it's not as it's not the case. It's still quite long and the second one is the second category is related to loading the state from the storage, and this depends on the function instance. Isolation, technology um and the third phase is actually the the processing in our experiment.

B

Each function runs in the firecracker micro vm, like in aws, and the state-of-the-art method of loading. A micro vm from a snapshot works as follows.

B

So the snapshot is taken when the microvm has a function instance uh which runs inside and is completely ready to serve in vacations um during the restoration first, the hypervisor loads and restores the state of the virtual machine monitor and the emulated devices. Second, the hypervisor maps, the guest memory file into the main memory without populating the memory contents- and this is important- and then the hypervisor resumes uh the execution of the virtual machine from the point at which the snapshot was taken.

B

And finally, the micro vm manager establishes a connection with a function inside the virtual machine, so it reconnects to the rest of the infrastructure.

B

Note that the last step is not essential for virtual machine restoration, but it is required for the serverless case and after this the function instance is ready to process the navigation again.

B

So the data movement happens in these three phases. So this is where we're going to take a deeper look at right now, so we characterized the latency breakdowns for a number of uh python functions.

B

In the case of warm and cold invocations.

B

On this chart, you can see the measured latency breakdown as pairs of stack bars for each function, type with left bars, showing the warm allocation lenses and the right bars showing the cold and vacation weights, and you can notice that the right bars are much higher than the left bars and particularly, um there is a big orange fraction, which is related to establishing a new connection which is not even visible in the worm case, because it's not there and second, you can notice that the green part, which is function, processing increased in order by an order of magnitude.

B

So what is the actual issue here?

B

And the deal is that these functions are written in python and use quite a lot of different functionality inside the guest printing system, for example, the guest networking stack now we should recall that uh firecracker doesn't populate the guest memory with its contents from the snapshot and instead of relies on lazy paging, which results in a series of page faults arising after the function. It resumes its execution.

B

These page faults are processed one by one and take a lot of time, because many of them require retrieving their contents from disk. That's why we found that the disk accesses upon the page faults dominate the whole cold start latency. We traced these page folds and found out that the following the following key observation: when functions execute, they touch almost the same set of memory pages, which means that the function have stable working sets across invocation.

B

So now, if you imagine that, like a function which rotates an image, it kind of makes sense that, whether it retains a cat image or a dog image, you will still engage the same python modules. For example, it's the same networking stack and so on. So this led to an intuitive solution to record and prefetch the memory pages from this workings working sets of uh for each function.

B

Our record and prefetch solution called the read snapshots consists of two phases. First is the record phase where, upon the very first invocation, the system intercepts the page faults and loads them into the guest memory on demand, while at the same time it captures these pages as a working.

B

The working set file and after the location processing is finished. The captured working set is written back to storage. Now, all indications from this first after the first one enjoy the expedited prefetch phase.

B

So first, the entire working set file is read from the storage in a single io operation. Then all these pages are installed eagerly at once into the guest memory, and this allows to avoid the bulk of the page faults, except for the rare accesses to the pages outside of the working set, which are still retrieved from the storage on demand, and this way reap snapshots accelerate all the calls start after the first invocation of the little at the cost of the little extra storage.

B

So this plot shows the caused awareness of different functions. Each pair of bars corresponds to a single function type and in each pair the left bar stands for the vanilla, firecracker snapshots. While the right bar stands for their rip snapshots, first, one can see that the reap significantly reduces the time of restoring the connection between the function, server inside the micro vm and the rest of the infrastructure showing the efficiency of profession.

B

The network in stack and the commonly used used and reused code and second, the function processing is a fraction is reduced by more than four times on average for all these workloads and the overall speed ups for uh all, these functions are significant as well. So it's a software technique which delivers a multi-fold acceleration very quite impressive, and this all comes with a small extra file being recorded.

B

So the way it works is that we found that it's the lazy paging, which causes the long series of page faults and what rip does it introduces selective eager pre-folding by in moving the working set pages all inside the guest memory at once, uh exploring the trade-off of just a little bit extra storage for this working set files.

B

To get rid of the bulk of the page faults- and this is already supported in aws firecracker- uh these do support um like user page fault handling, which, however, they don't really open source. The only uh like one day, wirecracker had why the code is available.

B

So if there are any questions about this, I'd be happy to take it.

B

D

I had a couple questions over in the chat um this was running on vhive yeah um were the container images already prefetched locally onto the disk for the timing.

B

Well, in this case, uh uh in this case, it was just loading the snapshots right, so uh the dev mappers, the shutter, was already full, so okay, this was.

D

Not a critical.

B

D

I was looking at the cold start times and I was wondering if that included the time to actually fetch the container image or if that was pulled out of the graph.

B

Yeah, this is actually a great question and uh like in the production system, this would be another fraction as well.

D

um There's also some work with container d and um crio to um fetch docker images on a sort of lazy basis um on demand. Basically, because if you use the default mechanism, um you need to pull down and unpack all the tars and write them all out to the file system. Before you can start your container yeah.

F

B

D

B

Is another cold start hit yeah? Absolutely so, um with with the guest memory uh like with snapshotting, the dev mapper should not be on the critical path, but it can be for, like you know, uh four parts of the file system, which is not memory, mapped and that can be a big hit because, like docker images are compressed all over the place and across the board. So decompression takes way more time than you know, map and uh even page faults. So we need to restructure that.

D

Part uh there's a star gz effort that attempts to restructure that compatibly within the docker ecosystem, but it sounds like these had already been unpacked into firecracker memory maps things yeah.

D

Have you looked at sharing those those learned maps across nodes.

B

um So this is something that we're just about to merge. Actually, so we're going to merge the support for that, and then we actually wanted to explore how this extends to multi-node setup and that's very interesting as well, because there is a lot of cost radius performance trade-offs all around that. If you can store.

D

Both the memory mapped things and those prefetch maps in shared storage. um You can get a lot of scalability exactly.

B

Yeah but the most simple case you just store them together, basically, uh but obviously like we can take a look at how to scale it better.

E

Yeah, that was my question about like. Are you moving snapshots across the worker nodes? um It's interesting, I'm glad you're. Thinking about that. um One thing I was gonna ask, but then you answer it here about, like I'm surprised, the page faulting seems to be just like a firecracker issue and um I'm surprised like. Is there a reason, I'm assuming there's a reason why they didn't introduce um the like re-hydrating the guest memory right? Is that do you know what that reason is.

B

Well, it's actually it's not a firecracker specific issue. It's just related on. uh You know the host os, uh relying on lazy patient uh across the board right. The way lazy patient is the default policy. So if you.

E

B

Right, yeah, exactly so, if you uh use divisor snapshotting it's, you would probably also have like either lazy, paging or eager patient, which means moving everything which is also suboptimal as we find so.

D

Are you dave, were you asking about snapshotting the actual guest memory, including the python application.

E

Yeah, that's what I thought um I guess to clarify. What is in the snapshot is probably useful for me to know. Are you including the guest memory or just the like? um Just the app essentially.

B

It's the subset of guest memory pages and uh like the like, the hypervisor doesn't need to know what's inside, so we actually, we are completely dumb and be in the uh oblivious to what we snapshot in this working set files. But it's um like it's gonna, be the modules and the libraries which are touched upon the invocations, for example, not the ones which are used for which are attached during boot time, for example,.

D

But if you're using, um for example, java this wouldn't capture the jitted code after hotspot, would it.

B

That's a good question actually, so I think uh this question is uh how much the applications of wolves right during their lifetime- and there is nothing preventing us from recapturing updating this memory pages and so on, and also if the updated code resides in the same exact guest memory pages, then it's going to be reused. Naturally,.

D

um So one risk is that you might have processes that are, for example, each creating their own um cryptographic key and all.

A

D

um You are reusing those and, depending on your aes ciphers, that can be catastrophic.

B

That's 100 true, and this is not something that we add to the security problem like the security model. This is uh in an inherent problem for uh snapshots in general and actually aws uh is working on this issue. They are there and there is like more issues related to that. There is, for example, aslr is also compromised by default yep. So there are a bunch of stuff that needs to be fixed and, as far as understand, aws folks, they already start patching linux, at least for.

D

Generators, I would love to find a standardized way to do this and to fold that into k native but um yeah aslr is address-based layout, randomization.

B

That's exactly right. I can just point you to one of the archive papers that aws folks published, and they at least enumerate the problems and the patches that they're they were in work. This was here that would be very interesting yeah. Definitely it would be good to actually somehow connect to the community like to exchange. You know more information, maybe questions answered, and so on. I see a lot of like you, know, good questions and reactions.

B

So I'd be happy to have more of these discussions.

B

All right, so, let's talk about the communication now so service workloads are diverse and many of them have a lot of communication. As you know so, to devise a solution for fast cross-function communication, we need to account for special characteristics of service programming. Mode first functions are stateless and they cannot hold state or share it uh directly with other functions due to scalability.

B

The requirement which people pointed out before in the talk and second functions may have any number of instances at any point. Each of each are identical and they are able to process any invocation.

B

So this leaves functions to communicate only through an external storage in general case and there's a fundamental fundamental contradiction here, because serverless compute scales purely on demand- and it is stateless, but it has to be coupled with this classic stateful, always on service, which is storage and as a result, often the storage people choose like s3, um it's slow, because it's not um device for such workloads and also it's pricey and um like for the workloads. We consider it's up to 70 of the cost.

B

So this fundamental contradiction raised the doubt that the truly serverless system should actually require an external service. So let's try to understand why?

B

Let's consider the same video 96 service, um which I presented in the beginning, the first function decodes frames from the videos, video fragments coming from the camera and invokes the second function for object, recognition. So in serverless. The deployment of this two functions would look somehow somewhat like this. um So first, let's assume that the service has been operating for a while, so there are already active instances for each of these functions and the everything starts with the decoded function, uh producing a frame in which it wants to recognize objects.

B

The user code in this function has to store this object in the storage, and then it triggers the invocation sending the http packet to the scheduler again specifying the the key from the storage.

B

The scanner then receives the invocation and forwards it to the least loaded instance of the second function, based on its own metrics. The function starts processing by retrieving the object from the storage.

B

So now, if you look at the sequence of these actions, it is clear that the real fundamental problem is due to um the fact that uh the decode function instance, uh which is the source of the object transfer, doesn't know the destination of this transfer. And that is why an external storage service needs to be in the in the past.

B

But it is the scheduler which makes the decision on selecting the instance of the second function. So the key idea is to delay the transfer until the scheduler makes its decision.

B

So we introduce a technique called xdt or expedited data transfers, which enables communication without storage again, so a recorder function instance produces a frame and then the user code invokes invokes the second function directly directly with the object as an argument which can be a heavy object and then the runtime in that decoder function instance uh keeps this object. Buffered in memory of the decoder function, creating an xdt reference.

B

The runtime then replaces um the object in the invocation request with the x80 reference and then forwards it to the scheduler and this lightweight packet, with the object stripped and like in the baseline case. The scanner chooses the most appropriate instance of the recognition function and forwards. The invocation to the runtime inside that instance, the runtime.

B

Before transferring the control to the euro code, it fetches the object directly from its source in the decoder function and replaces the xct reference um with the object itself. After that, the recognition function can process the object.

B

So this way xct achieves direct object, transfer communication in conjunction with existing auto scaling load, balancing uh which is driven by the scheduler.

B

A few words about the implementation, so the key design principle of xt is in separating of serverless control and data planes which we implement in xdt sdk, which is kind of conceptually similar to aws sdk that is used for packaging the functions so at the transfer size as a transfer source. The sdk offers the same api as the baseline system um invoke api or get put api transparently.

B

Extracting all the large objects from the control plane requests the invocation itself, which still goes through the scheduler, whereas the heavy objects go through the expedited data plane and at the destination. The sdk reassembles. The original invocation with all objects before passing the control to the user code, so note that the objects are located inside the memory of the source function instance and add no extra footprint compared to the basement, transparent data. Point separation allows cloud providers to choose any protocol.

B

They want for the data plane depending on their design goals and available hard, for example, high performance rdma.

B

This way xzt allows wire speed, direct transfers while keeping the legacy function. Location api.

B

So we will at xdt using three real world applications. Data intensive ones, each of which features different communication patterns like scatter broadcast gather.

B

Namely video analytics ensemble model, training and mapreduce, and here I put the lindsey breakdown for a single request going through each of these workloads, uh where it flows. um Each of the workloads, as you can imagine, have several functions, uh each of which has compute phase shown in orangish colors and communication waves shown in the bullish colors. So you can see that for all the workloads existing significantly reduces the communication pattern, accelerating overall and execution for all of these workloads.

B

So, to recap: the system design lessons here, uh the key problem, why modern serverless systems require functions to communicate. Certain external storage relates to the fact that it is the scanner that makes decisions for the destination of the transfer, so exete delays the actual transfer until that decision effectively replacing um eager storage, page transfers with lazy transfer but through a high bandwidth, low latency fabric.

B

So the main away from this work is that uh storage is actually not required because um the data can be buffered at the source um and quickly consumed and recycled uh by their destination.

B

So I I would be happy to take some questions here.

E

um Just the observation I have and I'm kind of curious, if you dug into the thing like the problem you're describing, looks to me like a streaming problem in a way, um have you compared?

E

Would I want to run this in a serverless fashion, or do I want to run it in like a classic? um Whatever people used before like apache storm, we have topologies or heroin is like the the new version there like um is the functional benefit like I guess. What would you like, given the use case, if I know I'm going to have a continuous stream of video, and I need to do processing should I use classical approaches, is maybe what I call it versus like a serverless approach.

E

If we're not going to be scaling down right.

B

Yeah, this is an excellent case. This is an excellent question and the answer is that uh in this case it's we do support complete skill to zero as compared to classic approaches. So I just showed this simple. You know warm everything case, but you can imagine that cold starts can be overlapped with data transfer for further efficiency.

B

So we deliver efficiency similar to a streaming system. uh Wheeze the scale down capability.

B

But it's a great noise. This is uh like a common serverless systems. They are not, uh you know, compatible with streaming uh services, um but with xdt, where we decouple the data transfer from their uh like transfer of references, xt references, then we can steer xt reference through uh streaming engines like kafka and then like scatter gather, reconcile like join, and all of it is totally possible. So this is something that we plan to work on in future.

D

um I understand why s3 is an attractive target for that kind of storage, but um have you looked at using using less durable, but still external storage services like redis or aerospike, or something like that um and how that compares, because um yeah s3 does a lot to make sure that once you've handed the handed it data before it returns, it's actually landed on some disks somewhere in a way that you won't ever lose it again.

B

Yeah, this is a great question and uh the answer is uh we can exit you can perform better than radis. We actually evaluate.

B

Like we have been invalidating with xct versus elastic cache, which that is really is from aws and we can perform better because uh it's fewer network round trips and we are not bound by their redis replica itself, uh like if you scale your producer instances, consumers to like tens, hundreds and so on you can. You can scale your bandwidth uh together so now. The question is: how to you know, organize the programming model so that we can have continuous scaling of compute, um and then you get the bandwidth scale and automatically session.

A

Hey, I uh just interject here real quick, thank you dimitri, so much. I know people are going to have to start leaving, as we've only got about a couple minutes left.

A

May I just go ahead and launch the polls real, quick I'd like to just make sure that we get feedback from as many people as possible. So I'm going to launch these if everyone can just please answer these two questions.

A

And then we will uh hand it back over to dimitri, as well as to evan and carlos and any other toc members or anybody else who has uh an announcement that they would like to make for the uh today's meeting.

E

I guess, are you saying, vote for the toc election.

A

If you're, maybe.

E

That is the announcement.

A

E

What I'm, assuming unless there's another announcement that I'm not aware of deadline to vote is friday. If I vote today, you'll get a cookie.

D

And um if you think you might be eligible to vote, you can try logging into electo, and it will tell you um also um if you've been doing a lot of stuff with the canadian community and you aren't list is eligible to vote, which means at least 50 interactions with one of our github projects. um Let steering know asap and they might be able to do an exception, but obviously that would be very last minute, given that voting closes tomorrow.

G

Yeah, you can find an exception. People contribute many ways that are not counted with the methods that we that we use so yeah. You can file an exception.

G

um Another another announcement is uh candidate khan.

G

A one-day event uh happened during kipcon last week um and the videos are available online already. So if people want to watch those, uh you can go ahead and also there's other k native um talks in the main schedule by others.

G

I think that's it.

D

Is the chat going to be visible and available when this goes up on youtube.

A

Yeah I'll make sure that that um is included great.

E

B

Feedback, yeah yeah. We have some more, um not a what actually we had some questions for the canadian. You know engineers which can help us and further research.

B

D

Of us can stay for a few minutes. If you want to ask them now or you can follow up on the mailing list, yeah.

G

We have some former uh researchers.

B

Cool yeah and uh we would actually appreciate the connections to like people who work on this thing and uh like many feedback over there.

D

And you can talk to me too, but dave is uh the working group lead and paul as well um isn't on his way there, I think, um on the the serving side of things.

B

Okay, maybe members of my our team can put down the names and we can contact this. It's also.

E

A weekly meeting um it alternates so, for example, I think like today we cancel it because of this meeting. Conflicts might shift it around, but it's on the canadian calendar. So there's an alternate. This is where the timing doesn't work out for eu so essentially like um I might. Maybe what I'll do is I'll shift the. If you pick a date, I can ship that meeting earlier to accommodate uh european time.

A

E

Yeah, because right now, it kind of favors like um one time zone, favors japan, which one of our contributors is in and the other time zones just favor kind of the majority, which is like eastern and west coast. Time.

A

E

um If we, if you want to ping me, it's just deep rotatio on the k native slack I can and like with the date, I can shift that time to accommodate um your team. If you want to have it, so I think, for the majority part like um a bunch of us are mostly like uh um eastern and um west coast, so maybe like a 10 o'clock start is not too extreme for us or nine, and it might be okay for you. So.

B

Yeah for us it's perfectly fine great.

B

Yeah we can arrange like further meetings on, like you know, particular topics. Basically.

E

B

um So yeah, I just want to mention like what we're going to work on. What we are working on right now like a bit further, is that we are actually looking at the scheduling space and uh we are working on a methodology to replay traces from production clouds like azure functions into uh like small scale clusters, so that we keep all the characteristics of you know. Cold starts and uh memory footprints and so on.

B

Another is that we have a benchmark suite that is called beastworm, with a lot of like more than 40 workloads right now, with small functions like multi-function workloads and so on, and we also look at simulation uh of the hardware um and plan to extend uh two emerging clouds like edge, but we haven't found uh k native enabled platforms for now. So if there's any feedback from that side would be great, and I think there were a few questions that we wanted to ask.

B

So we so far we focused on cold start communication efficiency of the hardware where functions execute, so are there any other directions that you think we may address in future? Work.

C

Hi, uh I have one I was asking it before, but did you look on uh you? You had this diagram where k native and then all other systems that are public cloud servers, but there are today k native api compatible systems like google cloud run and I think the version there are two versions and there is also ibm code engine available so and they are all serverless and you can run, I think, all those workloads on them but be very interesting to find out. You know how do they compare or is those optimizations?

C

I don't know latencies or apply.

B

Yeah, this is a it's a great point, yeah, um like obviously, with all these platforms, we have visibility. You wanted to like some of the you know, uh user level infrastructure, uh some optimizations, that we do. They need a further uh deeper uh visibility or modifications to like hypervisor or um like q, proxies, for example, so xdt we basically implemented in two process and the sdk, but some of it is possible yeah good point. Thank you.

C

Because then, you comparing more orange to orange, where I think it is very difficult if you are just running black box and then on the other hand, what is the other box.

C

Yeah, that's actually excellent point yeah.

D

Do you do you want me to answer any of these questions on here, because I can answer two or three of them for you? If you want yeah sure absolutely okay, so um I can probably start it.

D

Why can't allows container concurrency greater than one um so could so um k native, evolved out of google app engine and also looking at some other platforms like cloud foundry and in a lot of those they allow you to run a full-fledged application um and, as you noticed, cold start can be a little bit of a pain and also, if you allow only one request at a time you're, you have to scale really high in terms of concurrency and also cost, um but a lot of the time. These functions are doing.

D

Things like make a network call and wait. So there's actually a lot of opportunity for overlap, assuming you're not doing something. That's pure compute.

E

And lambda allows for more than one request at a time. If you configure it, I think, does it, uh I think it was something in the.

D

Last two years.

E

Let me double check.

D

Okay, so historically, aws lambda had a model where the worker calls out to the controller process and it could only have a single basically ticket in flight at a time. This allows you to do some simplified programming models like having globals and not have to worry about it. um It sounds like maybe in the last couple of years they may have relaxed that somewhat, but by default they started with one and that gave a simpler programming model and if you know, there's only one request in flight at a time.

D

If you see an outbound connection, you know it's associated with which request is associated with, whereas if you have more than one container request at a time- and you see an outbound request- you don't know which one the request is associated with. um Unless you do some really fancy ugly stuff that google app engine did early on, but I don't recommend it to anyone.

B

There's a construction to me like with okay with aws. They mention security implications of running several requests within the same memory. Essentially.

D

Potentially, but you can also run into the same problems um if you reuse a container for more than one request, and they do that so.

D

um But yes, if you don't do locking or things like that, then it's easy to get corruption and crashes and stuff like that, whereas if you restrict it to one request at a time, it's real hard to screw that up.

D

um The next question was about a k-native worker running less than 200 pods, while aws um their workers will do a thousand, or so um it's in those eight that aws internals talk um so kubernetes assigns an ip address for each pod um and the default kubernetes ip address layout um restricts you to about 110 pods, which is a cubelet level flag. Theoretically, you could do lots of big workers by changing a bunch of kubernetes level flags in practice.

D

Very few people do that, um but it's due to ip address, limited limitations in kubernetes and yes, the kubernetes scheduler is a bottleneck. We've talked about what would it look like to change kubernetes so that you could pre-schedule a pod um but not start it yet and lease those resources back to the cubelet for lower priority tasks? um That's going to be a major engineering effort.

D

My estimate would be one and a half to two years for four to five engineers, at least one of whom is pretty senior and deep in the kubernetes, the kubernetes ecosystem. So uh yes, we're aware. Yes, um it's a blocker, also propagating um ip address. Decisions um in kubernetes also ends up being part of our slow path, because you don't know what I p address.

D

A pod is going to have until it's assigned to a node and that node cni has made a decision about that ip address and then whoever's upstream, like the activator that needs to call needs to find out what ip address that pod decided on. So if you could either have the activator be on the same node so that the latency for that was near zero or um if you knew that ip address in advance, which starts to hit you up against those ip addressing limits.

B

So yeah, that's a very good, very good production points and we are actually actively looking at the scheduling right now and uh we are trying to understand the bottlenecks and we are trying to see which policy which we should use like from the tune. Theory perspective for.

D

If you're looking at that for for k native I'm going to point at people who may not be the same position but nader, I don't know if nader, if you want to say hi or anything, nader is working on the auto scaling um and measuring.

H

That more deeply that's high uh yeah! If, when you get into more details, we can talk about it more.

B

Oh actually, my our team would love to have an extended discussion on on on this front, like I actually had to hold their hands a little bit to you know, ask more questions: we're friendly. We don't bite.

F

Yeah, if you come to the working, the surveying working group you'll find all of us at the same time. It's probably the best uh time to start, and then we can like have other meetings started like pick it up from there.

B

Yeah, amazing, um that's pretty much what I wanted to like talk about and thanks a lot. I hope it was interesting and uh we have a lot of projects in flight both in edinburgh and uh in the th, and the beehive is getting a lot of popularity. So uh actually yeah join the community, follow us and see how we can do knowledge transfer, for example, both ways.

E

Thanks for coming.

D

Yeah and there's.

E

D

There's a link to the different working groups in the chat too. So, if you save out a copy of that, you will be able.

B

To find where to find all these books like in a dm, uh probably for I need some kind of you can.

D

You can dm me on the slack I I will have a file of all that at least public chat messages exactly that would be perfect yeah if anyone's been back channeling. It won't be in there.

B

Great. Thank you.

D

A

Wonderful, well, is there anything else before we close, we've run a bit over, but I think this has been a wonderful discussion.

A

Everybody good going once.

G

Yeah, I think we can close and thank you uh greg for hosting this first time, meetup appreciate it yeah. It's my pleasure.

A

Wonderful: okay, I'll! Stop the recordings.