Cloud Native Computing Foundation Technical Oversight Committee, 15 Oct 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CNCF TOC Meeting 2019-10-15

Description

Join us for Kubernetes Forums Seoul, Sydney, Bengaluru and Delhi - learn more at kubecon.io

Don't miss KubeCon + CloudNativeCon 2020 events in Amsterdam March 30 - April 2, Shanghai July 28-30 and Boston November 17-20! Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects

CNCF TOC Meeting 2019-10-15

A

What do you think should we get started? Yeah might as well all right Chris did you want to add anything before we get going.

B

No I just I want to make sure you all can hear me. Sorry I mean Chris I, oh yeah,.

B

Okay, we're good okay.

A

Right, yes, so we got further ado. In that case, let's hand over to you Chris over and Michael DC, for the Falco incubation review.

C

B

I'm gonna look at slides on my end, so I'll just let folks know when to bump to the next slide, but yeah thanks for letting us propose a proposal to move the incubation today. So what we want to go over is just give folks a quick overview of what is Falco and the problems we're trying to solve in case you're, not familiar with it or in case you want a brief update. Then we're gonna go over some metrics that Ducey put together for us talk about how far we've come from last year.

B

When we did this talk about folks who are integrating with Saco that's building out software, that is using Falco and then, of course, folks who are using Falco as its fan in just a vanilla form talk about our roadmap for the upcoming 12 months and then talk about why we desperately need to move to incubation and why we think we deserve it. Okay, next slide.

B

Okay, so what is Falco so I? Think sort of the the goal that we've agreed upon in our open source community is we're trying to solve cloud native runtime security. This has security for any cloud native software and we want to do it at run time, which is drastically different than some of the other solutions that are available today as we're doing this, essentially as a as a daemon over time. So we're focusing on kubernetes intrusion in anomaly detection and we want to integrate with a wide variety of services for alert collection and correlation.

B

So brief, history of Falco, we joined the sandbox in October of last year, big shout out to Brian and Clinton for helping us get into sandbox and of course the project was started out of fisting in May of 2016, so we've been around the block a few times and we've been iterating on the project for a number of years. Now, okay, next slide, I think this is the DC side.

B

B

Dropped well, let's see if we can't get him to rejoin.

C

Okay, sorry about that, can you hear me yeah.

B

C

Hear me: okay, sorry about that bad timing, right. It happened right at the beginning in the fall it's right around when it's up so anyway. So let's talk about how Calkins works, so one of our key things is that we take data from the Linux kernel. We do that either via a kernel, module or an DCF probe, and essentially this is a stream of all the system calls that are going through the hosts that are running the container or the nodes in the case of kubernetes.

C

We also take all of the audit log data from the kubernetes audit log API, and all that is sent through this big processing libraries, and these are OSS libraries that we borrow from our companion project- big, it's all open source at these. These are the main libraries that we borrow from that project. As part of that, then we have a rule set that applied and that rule set basically as applied to this event.

C

Stream, that's coming from the orchestrator and from the underlying those hurdles, and that have been streamed basically allows it to check for things like is a container opening. Outbound connection to the Internet is all of a sudden by nodejs container 'ok, starting to run other process. These others that nose and things like that when we detect these suspicious events, we send it off into the alerting.

D

C

Then the alerting engines will forward it into one of those destinations, one of the things that we did do over the course sandbox add in these destinations. We don't want to get into the business of kind of doing data, processing or detection and all those sorts of things. We're kind of more of what we want to do is be a generic sensor that focuses on providing a really good data stream and then that data stream is chat process in some other third party system. So what we've been trying to focus on it?

C

Getting the data out of the stage and having kind of generic interfaces I push that data Freelander, something that's life, and this is an example, reference architecture, it's something that we actually published ourselves, but we see end-users thinking it up and using it, and we have this example that we built out for.

E

C

Where we use Google absolves and Google functions, and then we also have a generic one, that's using the MGF project, so it uses Nats and then, in the case of the serverless functions that uses coulis. But basically, what happens here is chocolate detects something abnormal. We push it off into the pump some service in this case Amazon SMS, and then you can have lambda fire to do different actions, so it can enhance the event, which is one of our end user- that we're going to talk about abuse cases that they do.

C

You can actually have the lambda to take action and still the SME container are filled the offending pod, isolated with network policy. Other things like that as well, so that you can begin to do your process of incident response and automating that process of incident response as well.

C

B

Okay, so on this slide, we wanted to focus on the community as well as folks from the Cystic side of things on how we're structuring the project and the first thing we wanted to do. Is we wanted to give a huge shout out to five particular community members who have stepped up over the past year and have taken complete ownership of small sub projects in subsystems of the Falco ecosystem? The first one Thomas he's, probably the most active maintainer who's out of cystic that we see in the community.

B

Leonardo Grasso recently published the Prometheus exporter, which is a way of plugging Sacco and Prometheus. We have Luke pertinency, okasha and rajeev, all of which have taken ownership of various projects in the ecosystem and then on the right side. You can see the course I'll go team, which is a newly curated team, sponsored cig, which is myself michael ducey and then Leonardo and Lorenzo and Laura Steve Shani as well. Also, we wanted to highlight mark stem for all the work he's done over the past few years for bringing Sacco where it is today.

B

C

So we've seen great growth, one thing I want to say is working with the CNCs has been really great. They provided a lot of support and I think the marketing support the.

E

Minimal marketing.

C

Support that they've even helped shines light to the project and it brings them a lot of external people. So I definitely want to speak to the fact that sandbox has helped us tremendously and really has increased our momentum, which is one of the reasons why we want to have the conversation about going to incubation and so keep that momentum going. But you can see some numbers here. One thing I want to point out is we definitely have increased the number of external committers other people contributing to the project? One area of improvement.

C

There, though, is that we need external committers that are working on larger things, and that's one of the reasons why we have those other kind of sub projects. I would call them like focal sidekick and the Prometheus exporter and client go hope. You can start pulling more people into the community to work on these kind of sub projects and they don't necessarily have to work on the core Fargo engine, which is written in C++, which is a little bit of a barrier entry to some people in the community.

C

Another interesting one that I'll point out is that were 70% on our way to passing for the CI I progress. We started to try and get ahead of some of the things that we need to do for incubation and the CII progress was definitely something that we were. We were looking at trying to progressive all next slide.

C

We do maintain our own slack and this shared with the Cystic OSS projects, we're looking at moving that over into the fantasy of slack.

C

But one thing I like about this slide that I'll just highlight is that there's much more participation on a daily and a weekly basis, much much more people are active than the channel sandbox and they were free sandbox, and we see that, of course, on our github repositories as well and a lot more activity and a lot more contributions that we also see it in the in the online way through slap involved next slide, and then downloads we've seen great momentum in the increase in downloads.

C

These numbers are kind of in fluctuating as I've been learning how to use Amazon Athena and update those download numbers and the RPMs or Debian packages.

C

The numbers, the growth has been really good in this area. One thing that we wanted to try and figure out if we could highlight in this slide is: are people using taco more from a container perspective or, if they're, installing it via Debian our RPM packages, and that kind of gives us an idea of the use case whether people are installing directly on the node, whether if they're easing using kaavo in a container environment at all, because that's one of the challenges that we have right now is we're trying to balance between.

C

Do we go full on kubernetes and containers, which is kind of in our path. Well, we have some end users who are wanting to use this, which is a generic host intrusion detection system, not in a kubernetes environment. So that's one of the things that we're trying to balance the time next slide.

B

So this one is me, so a big part of the work that I've been focusing on since recently joining system, because two months ago is trying to move our decision-making process completely to open-source indiana's hygienic as possible about how we're calling the shots in Falko and how we're pushing all of our work that were generating through the open source process that we were continually iterating over and trying to improve and make us friendly and it's easy to contribute as possible.

B

So a big part of this has been making decisions in the open, so every decision that we're making, whether it's technical or process driven we're, recording we're documenting we're being very good about how we're we're chartering this three, what the ecosystem that we're building.

B

Furthermore, we did implemented prowl for all of our repositories, or rather all of our major ones that we're currently actively developing, and this has been instrumental in how we're managing contributions to the project and keeping track of not only our issues, but our roadmap as well and last but not least, we've been working with Chris a over at the CNCs on how we want to start to migrate about a thousand users from our current flat channel over to the CNCs official slack. So we can start contributing, even more so in the open.

B

So just some process clerical items that we've been focusing on over the past 45 days. Next slide next slide we're just going to go over the sandbox progress here. So here are some items that we've shipped since the sandbox. So this is a year ago and do see, keep me honest here because you've been much more involved in the process and I have, but I just wanted to highlight a few, a few ones here, probably the most instrumental one here is UPF support. So this is.

B

This is solving the problem of folks who don't necessarily want to load a custom kernel module into the kernel, so we're able to pull our kernel metrics from the kernel using it the ebps protocol, instead of having to load a kernel module. So this has been a very exciting part of the work that we're doing in upstream.

B

We also hooked up the kubernetes audit engine to falco, so we're now able to start enriching what would otherwise be just regular kernel metrics, but with kubernetes made it information as well as the coop, in that he thought it on its dream. Here you can see that we've we've got a number of other features that we've been working on, and these were all highlighted in the original roadmap that we presented last year. So exciting progress for the team and again just wanted to give a shout out to everyone. Who's been a part of making.

B

This whole thing happen. It's a good job to everyone in upstream next slide.

C

Yeah and when we went into the sandbox, we laid out a roadmap and a proposal. This is a sampling of that roadmap and he'll are the things that we promise in that roadmap and the things of your shift. What I like about this is that highlights that we have, you, know the engine of the project going. We can define a roadmap, we can ship features from that roadmap and we're able to have some sort of a process that we're following to actually have releases.

C

That matter with useful features for end-users next slide, and you can just go on to the next slide. Basically,.

B

This great yep, so I hear some integrations that we've been focusing on both pre sandbox and post sandbox. As you can see here, we've grown exponentially and we've been able to work with various project from the in the ecosystem, such as Prometheus elasticsearch, as well as Splunk, that's another one that we see commonly for our end users and I. Think Duty is going to go into a little bit more detail for some of the users that we wanted to highlight here.

B

But this is just gives you a good overview of where we were versus, where we are now in the progress that we've been able to make over the past 12 years, both of the features that we've been shipping, as well as the folks who are currently integrating with and using self. Oh yeah.

C

I know just caveat: these integrations I put that in the proposal as well, but not all integrations are created equal. Some of these are where were we written documentation, our blog posts, where we show you how to get data Ronco data into one of these tools? Some of these are things where we've actually incorporated code directly into the photo engine like container D in cryo.

C

We actually had to make core core changes to the file code code itself, so each one of these are kind of a little bit of a different integration and the amount of work and effort that went into each one is going to be varied. Based upon the operation. Applause next lighting, one interesting one that we saw is so there's one thing to you know end users using your tool or your projects.

C

What we have seen is that a couple companies have actually embedded Falco into products that they're offering one of them is a company called Ultron who does IT consulting and they've created a secure cloud dated fabric and they've, incorporated this or they've incorporated Falco in to act as runtime compliance at runtime secure. They focus on telco workloads as well, so next generation, 5g workloads and things like that. They also actually incorporate a lot of open source as well into this platform. So can you go to the next slide?

C

Amy, so they're, incorporating things like clear and on core kubernetes to bench? Os stack this do and a number of other projects as well to kind of create this.

C

This whole secure cloud native fabric and including the container runtime the cloud data stock and everything like that as well. So I find it interesting to how people are kind of taking the cloud native landscape and building building products around it. That's lighting and then another one is sumo logic which is offering up a container intelligence platform as part of that container intelligence platform.

C

They integrate in Prometheus, let the fluid bit and Falco, and then they have applications that have pre-canned dashboards for you, so that you can actually pull all those metrics and data out of your kubernetes cluster and then have a holistic view around monitoring and security as well next slide and so for end. Users and users have been a little bit of a challenge for us for people to go on the record and I think mainly that's just because people don't want to expose their security tools but of these of these companies.

C

The interesting thing is the similar theme is that all of them have compliance challenges and in their compliance challenges they're using foul code to meet that compliance requirement of having a host intrusion to touching system installed and their kubernetes cluster. A couple of these are healthcare use cases. A couple of these or government are one of these is government. Then industrial control is sight machine and then Shopify a course. It's PCI compliance and then frame got al which we'll talk about here on the next slide is movie movie studio.

C

Compliance which I didn't realize movie studios have their own compliance, but apparently they do so framed. Io is a SAS based video review company and they use Falco as an intrusion detection system, and they have a really interesting use case. I won't walk through the slide because everyone can read themselves but on the next slide is actually what's interesting, is actually looking at their architecture.

C

Amy. Can you go to the next slide? Thank you. So what they do is they take panco events and they publish it through Amazon CloudWatch logs and then that pushes it off into AWS lands. Oh and then.

E

C

They do with AWS lambda is that that function actually will then go and query their environment where either AWS environment and enhance the fungal event with things like the V PC that the instance was running in and other information as well, and in that lambda actually Ford's an event into several different locations. We put it into Amazon's event, processing agenda with an empath buried so that you store the raw and for long-term storage and other processing as well, and then eventually, it ends up into elastic search where they can actually see the event in Cabana.

C

They have gave a presentation USENIX about this event stream. In that presentation they don't mention Falco, but it does give you a good idea of the architecture behind behind that fog event and how it gets process and how it gets enhanced with more better data. Next slide.

C

Booz Allen Hamilton is another one of our end users. They basically offer up a platform to developers and offer up what they call pipelines as a service so where any developer can go and get a new pipeline, and it's part of that they incorporate security, best practices in that pipeline. So it's kind of a repeatable process for developers making sure that they can embed security varies from from the development start into their prophecies and then what they do is, as the container is actually running in production.

C

They have cycle rules that are actually watching to make sure that the container is not violating any policy that they had put in place earlier and their development cycles, so they check and then once I actually deploy they check again by using Falcon and they're. Giving a talk at that time is American this year as well and framed. Io is also giving a talk with us as well, and then Shopify Shopify, of course, major retailer and they use Falco as part of their host and network intrusion detection system.

C

Once again, they forward the events, often something that slump and then they use funk to actually go and slice the data and actually look at what's actually happening in their community sponsor as well, although these are in our adopters and B file.

C

So if you're looking for those references there in the adopters bottom, B and.

E

One thing that I'll.

C

Point out about the adopters, not MV file is that if you ever thought it was hard to get in users to go on the record. Probably the easiest thing you can do and it's such a simple thing: it's put that adopted on D file out there and ask people to commit to it and funny they will commit to it. So it's good to see open-source communities working.

C

Next, thank you and I think that this press.

B

Ok, yeah sorry I was texting. Someone okay, so talking a little bit about our future roadmap here, and this is kind of like what we're playing for the for the next quarter, all the way up some Cilla this time next year. So we want to look at reevaluating how we're handling our events coming up from the kernel via the ring buffer.

B

So we have our resident C++ expert in PhD, loris who's, going to be helping to spearhead that effort, and that's going to be doing working on performance improvements and looking at how we're solving dropped events that are coming out of the kernel. We also want to improve the Prometheus exporter, so this is written in go, and this has been monumental in how we're driving contributions to Falco and getting folks involved, who not aren't necessarily the best C++ engineers so again just pushing the table metrics to Prometheus.

B

Right now we have usually TLS encrypted grcc support for stock of outputs. We want to look at broadening that to building an entire kiai out for sale.

B

Go so that other folks, including folks in the kubernetes ecosystem, can start vendor in falco and using it in different ways which segues into our next goal, starting to look at playing with ideas about how we can start to secure kubernetes by default woods, although we're still in the process of sort of coming up with ideas of how we want to start proposing the community, but we've been looking at ideas of integrating with clubski Badman cuba, corn and other infrastructure management tools, and so far the folks that we've talked to has been very supportive of this ambition of figuring out a good insane way to secure kubernetes by default.

B

Talk of CTL Kafka I'll go cuddle whatever you want to call it. Basically, the administrative and operation style management tools for Falco, again right thing go, so we could drive more contributions. There, we're looking at building out a what we're calling a cloud native security hub, so sort of imagine this as helm charts, but for sailboat rules and policy. So how do we start defining and what rules and what policy we care about as a security ecosystem? And how do we start sharing and versioning these rules over time?

B

And last but not least, we've been working with folks over on the Aqua side of things, with developing a what we call RPI or runtime policy interface. I encourage everyone here to go. Take a look at that. We would love your feedback and this is just a effectively a CRT. That's going to solve the problem of how do we start interfacing with runtime security policy and configuration in kubernetes at runtime, not as employment time, which is substantially different than how OPA has approached the problem here and explain.

C

Hold on just one second: can you go back to the you're? Not spot I just wanted to call her a couple things, so the performance improvements as part of that we were participate in google Summer of Code and through the CNCs and the student who participated in that actually went through and wrote some tooling.

C

That allows us to actually measure the performance of a fossil engine itself, and so that work is actually going to be very instrumental in helping us drive these performance improvements I'm actually using that too late and then also around the cognitive security. We also it's starting to imagine this for a generic location for things like on security policies, read go file and other things like that as well. We talked a little bit about it with the broader community and I.

C

Think, there's just some things that we need to clean up on our side before we can open this up law, but the code is actually posted on github and it's out there and we want to share trying to develop topics.

C

You know at its wide 33 yeah.

B

Thanks juicy and 5:33, why incubation? Why do we think we deserve a incubation, and why do we think we're ready to take it to the next level here I think primarily, there's something to be said about keeping up with the momentum and growth of the project. We have a lot of folks interested in adopting Falco there's a lot of folks that are currently reviewing Falco and one of the the bits of feedback we've gotten is a reluctancy to to run it in production until we've graduated to the next stage.

B

So in order to push ourselves and make the software as strong and as battle-tested as we can, we would like to move it to the next stage to keep up the momentum that we've already been developing over the last 12 months. Furthermore, we have real end users who have real compliance requirements and again we want to just continue to focus on promoting the software and make it as secure as possible and as tested as possible, and in order for us to do that, we would like to move to the incubation stage.

B

We have a cnc FK study that we've been working on with frames io. We would absolutely love to get published. We've been looking at offering some literature on our and around it as well, and I get an order for us to do this. We need to be an incubation. Furthermore, we want to start pulling our build out of system managed infrastructure into the open source ecosystem so that we can manage our builds and our releases as an open source community, and we would love to leverage the CMC up here again moving.

B

The incubation would help out with this effort dramatically and last but not least, we want folks to be able to collaborate on the RPI with us, as we start to figure out what exactly this means and how we're going to start proposing this to the kubernetes upstream ecosystem I. Think it's going to be helpful to have us in the incubation stage, as we look at implementing real-time solutions for folks running in and last but not least, there is a link to the proposal that he put together for us.

B

If folks have any questions or would like to see the official TOC proposal that that we put together um so yeah I think that about wraps it up, unless folks have have any questions, I.

F

Had one brief one, you mentioned that your RPI approach was was substantially different than on purpose. Could you just very briefly give us an idea of where the key difference is that.

B

You kind of broke up at the end, but I think what you were trying to ask is concretely what is the difference between RPI and OPA correct.

F

Yeah, in summary, yes,.

B

Yes, so basically, if you look at how open is implemented right now- and this goes for gatekeeper as well- it's every time you mutate- an object in the kubernetes database is when we take action. So this is different than how we would look at what we're calling run time, which is a continual monitoring and auditing throughout the course of an object's life, not just on create update, delete.

F

Okay, as far as I'm aware, but can be used that way as well assume there might be a performance difference between the two participants, but there are people using our pad for runtime enforcement.

B

Interesting: okay,.

A

G

A

More about RPI than it is about the actual incubation.

A

ah Chris has answered the question, which was: how, where are we on key religion, so is it actually built it's built into that.

B

C

And I think that's. The question is: what sanity are? Is it sufficient to have a boat called.

A

C

That there's no questions.

A

Okay, so we'll need a TC member to kind of take the lead on reviewing that.

D

I'll uh I'll go through it and you know we can. We can talk about it amongst ourselves and.

E

So did this go through the sig as well? Are we trying to formulate the I know it was sandbox already. Should those Sam bucks projects that want to go to incubation first go through the security sig?

E

That is a great question. I.

A

Think we should ask the sig to take a look and give us their recommendation.

B

Okay, I can take an action item to follow up with the sig here.

A

Chris has just posted a link which makes me think it's already being done.

A

Okay, yeah: there is an assessment underway.

A

Great, unless we have any other questions, I.

F

Would just curious who did the due diligence if not the sig.

E

Is it in that link, Chris, hey.

A

I think that's why we need a TSE monitor to review, what's been put in there, because I think that has been written by folks from Falco right, but.

C

Correct correct.

A

So I think that's what Joe has volunteered himself for. Thank you, Joe.

A

All right so shall we move on to you Tess.

A

Thank you very much. This.

H

Is subbu, can you hear me yes,.

A

Heisuke all.

H

Right so I am supposed to be joined with two other people, but you know I think said he may not be able to make it so I am subbu and the co-creator of with us I am joined with Michel temer. Who is a principal engineer at slack and if you know who is from square can't make it I will speak for his slides so going forward. What is with us, we Tess actually has many descriptions. I mean I.

H

Think the broadest one is that it's in the new sequel category, some people call it a char charting middleware. Some people call it an orchestration system, so it solves a few problems. The big ones are one. Is it solves the scalability problem? It is massively scalable while still giving you relational interface.

H

It solves a high availability problem, which means that you can generally comfortably learn with tests with five nines of availability and last, but not the least least, is that it is cloud native. The word cloud native does get used loosely, so I will cover specifically some points about what makes me test cloud native next slide. So these are some of the stats about with tests.

H

I think the most significant one is who are the adopters, but before going into that, the the thing about something like we test, which is a storage system, is actually a really difficult software to conquer, mainly because companies that decide to adopt a technology like this are making real long commitment like 5 10 years or even for the rest of the company's life kind of commitment as compared to other software systems that are more easily interchangeable like if you're using an analytic system, you can easily swap one for the other same with tracing or any like.

H

If you are using data- or you can say, oh I want to use signal effects, so those kinds of changes are relatively easy, but to change. Your core storage system is a much bigger commitment, which means that companies take longer to make decisions to adopt a software like this, but once they make the decision, they also stick with it for much longer next slide.

H

So in that kind of environment, it's exciting to see some really impressive names of adopters in the Vitesse list, and here another point is the way saw. Storage adoption goes, is everybody wants to know if there is somebody else who has used this, and so it kind of becomes a chicken-and-egg problem to gain adoption in this in this area. So.

E

H

S has a pretty impressive list of adopters where and also in a wide range of deployments. Here you can, you can have. There are people who run on bare metal. There are people who run on public clouds, both AWS and GCP and azure. There are kubernetes deployments, there is actually somebody who is actually working on a nomad.

E

H

Also so with us does show that it can run on a large number of platforms and I'm going to cover a couple of use cases. Yeah next slide. Let's see if Yoon has joined us I, don't see Yoon soo I'll speak on his behalf. He gave me permission to say anything on his behalf, so I think I'm allowed to amplify so with us. Square has been one of the early adopters of Vitesse and they've been participating with the project for two or three years now.

H

Their cash app now fully runs on with us, and they started with one instance, but then they've now grown into a large number of sharks and pretty large data set and query volume while being involved in with tests. They also have an engineering team that contributes of which three are actually official with us maintainer, which means that they can approve and merge pull requests and they are also growing their usage within square.

H

Their existing systems are on bare metal, but all their newer clusters are being launched deployed on the kubernetes next slide and the next slide actually I'm joined with Michel temer, who is known as ice F, 45:31 F, and he's going to talk about how slack is involved with this yeah.

G

Thanks everyone, like sucio, said my name is Mike Tamara I'm, one of the engineers here at slack and I, was the lead on the projects that brought the TAS in as the choice for slacks database solution. This is kind of a standard slide that we showed just to kind of illustrate the growth that slack has experienced over the last several years.

G

It's been kind of a great experience, but, of course, growth like this brings a bunch of stress on the infrastructure and in particular the problem that I was looking at was how to make sure that we had a primary database storage platform that would sustain slacks current and plans for a future growth. We were and well. We were very heavily invested in mice as our kind of data storage choice for the entire application.

G

We had a bunch of code written that was expecting my sequel level semantics and we were running a kind of homegrown scale out sharded my sequel salute system. We wanted to keep a bunch of those primitives in place, a bunch of our operational knowledge on how to run my sequence scale, but bring in something to help us both manage the instances and implement more flexible, fine-grained sharding to kind of handle. Some of our emergent and kind of evolutionary use cases beyond the kind of original model that we set.

G

The application out on so starting at about really in 2016 is when I started working at slack and then in kind of the middle of 2017 is when we started rolling out the test into production. So if you go to the next slide, this is kind of the adoption curve. As comparing our legacy my sequel solution with the tests, the axis is deliberately obscured, but this is QPS kind of roughly aggregated, so really just a measure of the amount of query volume that is going to the two systems.

G

So, as you can see, the aggregate query load goes up over time. That earlier slide indicates why we're getting more usage for more users. The share of the test has been steadily climbing, as we've ported, more and more application use cases over to it. We averaged at about 35% right now, it's a little choppy. You know we go through the states of bulk copying and back filling jobs.

G

That kind of skew some of these metrics, but overall, we've started to adopt more and more, and it has is really a Tier one service in our reliability and kind of service posture where we are dependent on it and we've been incredibly happy with its performance and its reliability and its overall kind of operability.

G

Next slide just has a couple. Just other kind of key stats, like I mentioned we're about 35 percent migrated when it comes to our overall application usage peak ups are potestas around 500,000 queries per second total is about 10 billion queries per day and adding the Vitesse middleware had a noticeable but non kind of non-material impact in overall performance, because we are going through an extra hop to get between the application servers in the database. There's about an extra millisecond of latency on average.

G

In many cases, that's amortized by the finer grained sharding, giving us better predictable performance at the my sequel layer itself, but in any event, those are kind of the key metrics for our deployment and then the final side. Here, just click one more we've been pretty heavy adopters of the project, both as users, but also as contributors.

G

So these are senders kind of call-outs of PR titles that have been primarily written by people from slack I'm not going to go through all of these, but from the very beginning day, as we saw this as a project that would serve a lot of our needs out of the box, but where we had an opportunity to and frankly a need to and then an opportunity to build upon the platform to suit slacks needs and then kind of extend the applicability of a set the test beyond some of its original days at YouTube, but to fit more and more use cases.

G

So these have to do with both some reliability related things. Some there is query. Planner features that we needed to add. There's a query execution simulator engine that we built a bunch of work on the kind of workflows for managing restarting at scale, really that we've been able to build here internally and then contribute back to the community, and so overall we found it has to be a great platform to both build upon and also to kind of deploy out of the box for our use cases and sock.

G

So with that I'll turn it back over to Susan very.

H

Cool thanks Darren, so this there's actually a case study. That's about to be published by slack to the CNCs website. That's coming out soon, and there is also an interesting talk that they are going to give the next cube con where they talk about how they treat their database as cattle. So that's pretty exciting, too, to see that to hear that cool. So the this a couple of kubernetes workloads that I wanted to highlight because of their significance, the stitch labs one is actually the most exciting one.

H

As you know, kubernetes was released in 2015 and at that time people who are barely trying to figure out how to run even stateless workloads but because of it Tess's background the fact that it could survive in borg the Google's cloud, not only because not only did it survive in borg, it actually was deployed as if it was a stateless application which means that it knew how to deal with ephemera, ephemeral, movement of instances and loss of the underlying data and survive that kind of environment.

H

So we could confidently tell people that you can run with us on kubernetes as if it's a stateless application and stitch labs actually was the first one to try. This out and they've been running on WE tests since 2016 later later, HubSpot came, and they said. Oh, we are not interested in really the charting capabilities of the tests.

H

We just like the fact that you can orchestrate well with it and they have hundreds of key spaces and, in the meantime, JD comm quietly just deployed thousands of key spaces and tens of thousands of tablets in their kubernetes environment, and then they told us that they did so, which is pretty exciting. To hear and nozzle is actually I would say, kind of the poster child of why you should use kubernetes.

H

They actually deployed with us on Azure, because they had free credits and then at some point of time they got a better deal from Google and they migrated from Azure into GAE in one hour completely.

H

So there is a talk by direct Perkins called gone in 60 minutes that he's going to talk about how they did this next slide.

H

So, while all this is happening, Kelsey has been tweeting about being very, very careful about moving storage to kubernetes. These are his tweets from actually last week and he talks about using extreme caution if you want to run stateful workloads or databases in kubernetes, but at the same time he says you can use orchestration systems. In that case, it is more safe to do so next slide.

H

So to highlight why I will talk about the Vita's architecture a little bit. I'll cover this quickly, since we are, we may be running out of time, but the three main ideologies of Vitas is simplicity, loose coupling and survivability for simplicity. We said that we should not have too many layers in the system, so this is essentially a two layer system where the app server connects to the stateless servers which are VT gates, and then the stateless server is orchestrate into send queries down to different databases.

H

The loose coupling comes from the fact that all these pieces operate independent of each other, which is the reason why we test can scale massively. As far as I know, there are no known limits, to wit. Essence, scalability and the third one is the survivability, which is if one of the parts go down which has quickly promotes a new master and continues to operate without introduction, and these are the areas where it is difficult to run a storage system inside kubernetes, because if a pod goes down, the local storage is wiped out.

H

You cannot get access to it, so you need to have good repair attic story which we test gives you and the other one is the ability to inform the application of repairing which is really hard in kubernetes system, where everything is treated as one categories like a stateful set is all one or replicas set. Is all one it's difficult to single out a single part in a system like that, so that is the Vitesse architecture next slide, and there are some alternatives that people have used, who have not chosen to use with this one?

H

Is the application managed sharding at this point? It is pretty much recommended that one doesn't do it unless you've already done it before. So that is one option, so other people have just been growing their databases by buying more and more expensive hardware, and there are some newer new sequel systems like cockroach Taibbi, which is also gaining an option next slide.

H

These are the other CNCs projects that Vitas uses. That's one scary-looking Jaeger out there next slide, so I put a ribbon to make it look less scary, so these are and another name that keeps coming up is on. Why, typically, if we test really scales out into thousands of shards, we may need to bring in on why to actually consolidate some connections and spread them out a little. So that's one project that we are looking at, possibly adding support for next slide. And finally, this is the last slide.

H

The maintain esteem is now actually quite diverse. Slack and square are major contributors, but also Pinterest and HubSpot, and nozzle nozzle actually contributed. The help. Charts Pinterest has made many query. Contributions and HubSpot has added orchestration, related constitutive contributions, and that's it any questions.

A

I think my main question would be I'm. Seeing that maintain this team is, you know very encouraging. I would be interested to know you know if planet-scale were to vanish. Do you do you have confidence that Vitesse would still have the maintainer x' and the expertise to keep the project going.

H

That's a good question and I: let demmer maybe talk about it and then maybe I'll add what I think about this.

G

So dissing the question I think putting there is a bunch of institutional knowledge in a handful of people, so I think like many complex projects, tsukuru has a bunch of knowledge about areas of this. That I think are regardless of planet scale, as a company they're, just a couple key individuals with knowledge that we've learned some over time, but with the kind of tenure and involvement there's a bunch of backstory and history around why things are the way that they are. That is sometimes not necessarily captured.

G

That said, there's areas of the code that I feel like I know the best that Raphael who's on our team knows the best so I don't know that that's anything per se around planet scale, but it is not an enormous community of developers and it's also not tiny either. So it's kind of a hard question to answer, because it's a little bit hypothetical, where we're coupling the existence of a company with the kind of continued involvement with a corpus of kind of key individuals.

G

So you know that's a sufficiently dodgy answer for being put on the spot, but the sentiment I never know. Yeah.

H

I think to qualify that statement at this point. I have definitely made a lot of effort to disseminate what I know of what we test to various people at this point, I think every area of with us at least almost every area of Vitas has at least two people that can jump in and take care of the only one, the last one that his left would be. The query, parsing and actually Angeles Taylor from square, is now starting to ramp up on that area.

H

So we are basically striving for a bus factor of greater than one I. Don't know if that answers your question, but it's more. Basically, we are focusing more on more than one person knowing different areas of the software. Not we haven't really thought about distributing that across companies. I.

A

Think bus factories are very good way of putting it. Maybe we should be thinking about process like defining what we mean by bus factor, but yet I think that's a an important aspect of making sure the project is mature to make sure that the bus factory is at least greater than one do you have other questions out there.

A

That seems like a no, in which case we've managed to get to the end of the presentations with you know four minutes to spare. Thank you very much everyone for doing that. In the meantime, Shang has volunteered to help with the due diligence. So that will be the next step.

A

Okay, I think that's it for this for this week. Thank you very much. Everyone thank.

C

D

G