Cloud Native Computing Foundation Storage Special Interest Group, 24 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CNCF Storage SIG Meeting 2021-02-24

Description

CNCF Storage SIG Meeting 2021-02-24

A

Good morning, good afternoon, good evening, depending where you are um we will, we will start to call um shortly we'll just give it a couple more minutes to to allow a few more people to join.

A

I'll just post a reminder on slack as.

B

B

C

Good morning afternoon evening,.

A

Good morning, how are you today.

A

A

B

Hi good morning, everyone good morning.

A

We'll just give it one more minute.

A

Sorry to keep you waiting sure no.

A

Problem: hey louise.

A

All right, I think I think we have a quorum. um So the first item on today's agenda um is uh the um presentation from the uh chavo or chubao. Is that the right pronunciation, fs um project, um which is currently um a member of the cncf as a sandbox project and they've, made some amazing progress over the last year and have built the community and are looking now to move into incubation stage. So we look forward to uh we look forward to the presentation.

D

Okay, thanks again, shall I share my desktop.

A

D

uh Okay, uh hi everyone good morning uh evening, okay, um so uh this is shoreline speaking, uh I'm the co-maintainer of uh turbo ifs. So today I'm going to uh present on behalf of troop ios, uh hoping to move it from sandbox to incubation.

D

uh Can you see my desktop yep? Okay,.

E

D

uh For those of you who are not familiar with superiors, I'm going to brief some backgrounds.

D

Of yfs super fs is a cloud native distributed storage platform designed to design for containerized applications running in large-scale container platforms such as kubernetes.

D

As you may recall that super fs was formerly described as a distributed file system uh when it was first open sourced and presented to the storage sng when trying to trying to go get into a as a sandbox project.

D

But since version 2.0.

D

We have a feature added to the project, which is a s3 compatible interface. uh So the using scenario uh expands tremendously after that, and it is the case in real production.

D

It is uh already beyond the scope of their system, so we prefer to create a distributed storage platform right now uh and uh what makes it so special for cloud native applications uh here are some challenges we summarized uh based on our experience serving applications running in kubernetes cluster.

D

First of all, we find that there are a lot of customers sharing the same kubernetes cluster, which means in terms of storage provider we have to.

D

We have to have the ability to provide different volumes in the purpose of data isolation.

D

It is impossible for us to deploy different cluster wi-fi clusters for each customer, so uh multi-tenancy is a necessary feature uh for triple fs, and um I think, storage, usage and uh throughput uh is hard for a single customer is hard to predict.

D

uh I think the main advantage for cloud native application is that it is highly scalable and it is very common that the storage usage and throughputs grows as business grows.

D

So so this is a requirement for a cloud native storage platform as well, and since we have a lot of customers, uh the file sizes are diverse, ranging from kilobytes to terabytes, which means we have to support uh both large and small cells.

D

And I think it is a big challenge to support small cells very well for distributed storage systems and also we have to support sequential and random, redirect patterns for four different uh applications.

D

uh And then we find that the applications uh running in the same kubernetes cluster uh are sometimes connected to each other, so it is better to afford the data to be can be shared by upstream and dawson users, and it is better to provide different interfaces for different customers.

D

So these are the challenges uh we taking took into account from the very beginning when designing a survivors project to solve those challenges. uh There are some key features uh listed here uh of tobias, um for example, optimized resource utilization, uh multi-tenancy relaxation and scalability for metadata and the data.

D

I think it is very common for data to be scalable, but it is a big challenge for metadata to be scalable and we also can serve large amount of clients in a single cluster simultaneously.

D

This means that there is no theoretical limit for the number of uh containerized applications uh using the same truevivos cluster.

D

And also trial, s is optimized for both large and small files, and it has converged right now. It has converged file system and s3 compatible interfaces, and here is the timeline for triple fs.

D

It is used in production at gd.com first in june, 2019 uh 2018, and uh it was open sourced. uh In march 2019.

D

we released version 1.0 in april 2019 and was presented to the storage smg in june 2019..

D

The the industrial paper based on truffles project was published uh in sigmoid 19 in july.

D

uh In order to integrate with more cloud native ecosystem. We developed super fast psi, plugin and triple fs help and they are released and used in production. Right now, in december, 2019 roberts joined as a sandbox project. Then in april 2020 we released version 2.0, which supports actually a compatible interface.

D

Also, there are several uh external uh users uh listed on github, uh the most recent external user uh is meizu, which is a consumer electronic company in china, uh and I'm going to cover uh the new using scenario uh of different companies in the later slides and in august, 2020 oprah joined as a key contributing company, as we know that such projects, such storage projects, require constant investment for someone to be able to make key contributions or key improvements to the project.

D

So oppo has established a development team and right now, digi.com and opel are collaboratively, leading the project right now,.

D

Then we have a plan to uh for the next release to improve their stability and we are trying to uh to support more of big data applications uh in the future.

D

And here is a high level architecture for triple fs.

D

As you can see from the diagram, there are several components forming the whole system, which is resource manager to manage the the whole resource of the cluster and volumes and uh data data. Subsystem is at the place where failed contents are actually stored and the meta data subsystem is where fair metadata is stored, comparing to a local file system. A resource manager is where fail system metadata is stored and the metadata is where a fail metadata is stored.

D

Okay, and uh to take the request from application and users and resolve them into meta and data requests. We have fused client and object node, uh providing both bell system interface and s3 compatible interface respectively.

D

Here's a detailed architecture of survivors or uh what a true firefight cluster looks like.

A

Hey just just uh just a quick question, um so so the the primary way of accessing a volume is through a fuse clients. Therefore,.

D

Yes, our client is is providing a file system interface.

D

D

Okay, uh here's what uh drivers cluster looks like I'm not going to uh uh going to dive into the technical details of this diagram, uh but I think there's some key points I was mentioning first of all uh is that the metadata is highly scalable.

D

So, as I mentioned before, uh data data subsystem is, uh it is very common for data subsystem to be scalable, but for metadata um it requires a design, a careful design source for metadata to be very scalable, and it is documented in in our industrial paper in sick mode.

D

uh Second, second of all is that, um as you can see, there's only one potential bottleneck in the diagram, which is the resource manager, but we can deploy a manager proxy before resource manager, so we eliminate bottlenecks as much as possible and thirdly,.

A

Yep, just just a quick question on that, so that that manager, proxy is, is kind of like a load balance, a load balancer for the resource manager.

D

uh Actually, we use antiques as a manager, proxy yeah. It takes just a uh deal with other reader requests because that's uh okay! That's uh where the bottlenecks! That's a high frequent request uh to resource manager.

E

So they're caching service.

D

uh Yes, attaching the uh the reader request results, excellent.

E

D

And uh thirdly,.

D

D

Oh, as you can see that um one clients, the control, plane and data plane uh separated, which means that the normal read or write process does not go through resource manager or manager proxy.

D

It only goes through a metanode and data node and last, but not the least. This. This architecture can withstand a high traffic peak, which is very useful during um commercial promotion, festivals, okay,.

A

So we just just sort of just to clarify and maybe to also make it clear to others so so effectively um the data part for any files.

E

A

Objects um goes to the data nodes and it's partitioned and sharded.

D

A

The the attributes of a file and things like a directory structure and and things like that, go to the to the meta, node, correct.

D

A

Correct does the method node also contain.

A

Information on the layouts of of the data partition.

D

uh Actually, uh yeah, yes, uh no! No, not the data partition.

D

As you can see, a meta partition is indexed through a anode number range so, and the the meta data consists of where the actual data is stored.

A

D

I mean class gets the partition field and uh uh in order to access a single file first, it can get metadata from a meta node and it. What it gets is a data partition id and through this data partition id it can. uh The request can be rotated to the actual data node yeah. So.

F

Actually, it's.

D

The client that has the data partition view.

A

On the set, I understand.

F

Yeah, in fact, I have a question regarding uh so you say that it's basically the metadata you have trying to find the data based on the inode, which is um one per file, so that that means that um you have to have the like the data node, which can be big enough to fill in like one five one file. So um every files the file cannot be shared across different uh data node or it's. uh It can be sharded.

G

D

What one, what failed.

G

F

Question is, is this one? Is one file have to be like located in the one single data node or it can be like sharded across different data, node.

D

It can be charted uh two different data nodes, because the minimum data, node storage unit is extent and a large file. For example, a large file can consist of uh several extents and this extents can be sharded and can be distributed through data partitions and uh for small fails.

D

Maybe a white extent can consist of uh several files. So that's why uh we uh I said uh I said: uh 12s is optimized for small cells, because small files are aggregated to a single extent and so for small files.

D

Several files can be aggregated to a single extent for large cells. um One large valve can consist of several extents and that that extents can be uh distributed through data partition.

F

Okay, so um so, what's the algorithm or strategy we use to distribute or decide that well to store those uh files, or is that my name um because, as you know, staff they have like consistent, hacking and stuff, but so what about? What we have here?.

D

uh Actually, uh data partition uh can be marked as a writable or uh read-only, so the client will pick a one, a writable data partition and write the actual data to uh to specific data partition and update the method. Node update the metadata.

F

Okay, so so the information is stored in the metadata node associated with starting node, so you can find out where else the file is right.

G

F

D

uh Okay, so uh we talked about the technical details and here's the community growth. uh Two ifs was open sourced uh since march uh 29, and it is, it was enrolled as a sandbox project in december 2019.

D

So this this date statistics are extracted from devstats.cncf.o. The data is before versus since joining sandbox before joining sandbox. The duration is uh nearly a nine months and uh since joining sandbox, uh it's about 14 or 15.

D

So this statistics are from uh development perspective. As you can see, our commits uh increased uh 300 code, committers uh doubled and uh we have a boost for pro requests, and uh I think the most important thing uh here is that uh now we have two: uh two companies uh constantly in uh in invest uh constantly uh invest on this project, uh which, which is gd.com and opel.

D

We have two development teams, making key improvements to the projects.

D

And as I mentioned, such storage projects have has a high barrier for contributing key features. So we are now having two teams and, as you can see from the commits that more and more key commits are from oppo team as well as rgb.com.

D

Okay, I'm going to brief some uh user uh adoptions uh from different companies. First of all, of course, jc.com is the top e-commerce company in china, and it still has the largest triple fs cluster in production.

D

It served as the default storage for containerized applications deployed in kubernetes cluster.

D

There is no local local storage provided for containerized applications, just only only two bypass, and it has been supporting more than one hundred internal business customers at pd.com.

D

And then opal, um there are several using scenarios uh in opel right now. Some of them are already in production and some are under development. For example, uh drivers served as the back-end storage in the ai platform.

D

uh What this one is uh in production and we are trying to expand the using scenario for chopass, which is hoping to support more big data applications.

D

For example, we are trying to use tribalfs as a backend storage in a data lake data lake architecture, and this was this example, development and also we are trying to use it as a remote spark shuffle a plug-in storage- and this is also under development and one the unwanted uh usage are in production. uh We plan to open source uh in our github, and also it is a consumer electronics company in china.

D

They don't have a development team, but uh drivers are used uh in production uh in meizu. uh There are several uh business customers uh use attributes as back-end storage, such as ad algorithm platform and database, push service, risk, control and cloud backup.

A

Can I can I um can I ask.

D

A

A small question on some of these use cases.

A

Are they? Are they therefore um kind of geared towards um read intensive um workloads where the client can do sort of lots of caching um or or are they or or are some of the workloads also sort of geared towards lots of right activity, and the reason why I'm asking this is because you know there are some obvious um gotchas when you're doing intensive rights with with fuse file systems, for example, and I kind of wondered if you had done anything if there was anything um specific in the project to kind of deal with that.

D

Well, actually, um we may have to find a balance point between uh like posix file system, semantics and uh the performance, uh because, uh as we all know, that uh project expel system semitics are not very suitable for um for distributed storage. So we have to make some compromise about the project semantics to to balance the performance, but the the principle is that uh we had to fulfill uh the application requirements uh for such uh semantics. uh Compromisation uh like uh one to cache the data and why not to catch the data.

D

We have to do some balance and the principle is that we have to fulfill applications requirements. So as long as we can support uh the customer applications, uh we are, uh we can release uh the project's semantics.

D

So there's a yeah there's a balance point between our semi-ticks and the performance.

A

Okay, understood um rob, esco, wrote up question in the chat um asking. If you would you you would um be able to um maybe cover what motivated oppo and jd.com to undertake the development of of of uh versus.

A

Perhaps you know using other other technologies, and is it is it due to you know some particular applications or perhaps scale.

D

I'm sorry, the development, uh I'm sorry.

C

Just if you don't mind me just attempting to to recast the question, I guess how. How should we think about fs and apologies if I've screwed the pronunciation up um versus other fs options that you know may predate it, and you know if there are some fundamental limitations that could be adapted for the use cases.

C

You know what why a met new option versus adaptation of other things, and it's not meant to be a criticism, I'm just trying to understand and form a mental model around what is style. What differentiates this versus other things that could potentially be adapted to provide a similar capability.

D

uh Well, actually, uh the main advantage of tribal fest comparing to other distributed uh storage is that uh it can support a small files um very well, both in capacity and uh performance wise right, I think I mean uh it's very hard for distributed storage to support uh small fails, so this is a big advantage uh also uh for for large sales. uh Well, I think.

D

Everyone has a similar performance uh regarding large sales, but for small fails. Triple fs is highly scalable and is optimized specifically for small fails. But of course there are some scenarios that suppliers cannot support, which is, I think, it's my circle. uh I circle my circle directly using uh trouble. Fs. This scenario cannot be supported right now, but for my circle, uh history table, which requires a lot of a read a request, but not a right to request uh troubles can support that using scenario.

D

Okay, so um I mean what what I mean is uh for most using uh use cases. uh Drivers can support, but there's uh certainly some database scenario. That cannot be and that's what we are going to cover uh in the next development.

D

Phase uh does that uh answer our question.

C

um I I think it gives me some some sort of direction on uh and better understanding it I'll have to pour through your docs a little bit better. But I appreciate it. I think artelyn had a sort of follow-up question, maybe in the same vein.

B

Yeah, so basically, how much of the performance improvements are due to relaxing the process, requirements and client-side caching for small files.

D

Actually, we have a performance uh stats in the paper comparing to self-fs and well I I I didn't uh bring some uh some of the graph here, but they are in the paper and uh you can find it on github.

B

D

Good yeah check.

B

Out the sigma paper later, thank.

D

You, how shall I go on.

F

Sorry, just one one more question, so uh you mentioned that the database cannot be supported or like mysql cannot be supported. What's exactly the reason because of.

D

F

Of right or something else or the.

D

Logic or actually a magical, uh masterclass uh right pattern is uh direct l, uh especially for innodb uh storage. Engine uh innodb storage engine uses a direct io as as a is right header um for direct io. There's, really nothing a fail system can do because uh the semantics uh requires the I o to be sent to the server so there's uh well, actually, there's no um I'll say no, no.

G

D

For the file system to do optimizations uh for directional red requests, so uh what we want to do to support uh direct io is we need to reduce io latency, which requires.

D

Dptk or rdma to reduce the network latency, but that requires a hardware support.

D

So I mean my circle right pattern uses a direct l uh which.

F

Yeah, okay has no room for yeah um yeah. That's basically a whole become a performance issue right so because they're using directional they're asking the file to be persistent so become a crash, persistent right so uh based on your current design, can you just like simply send the content to the backend and the persistent beer, because you already still have a distributed file system? It's just like performance. It's not that's good, but in theory, if you back past your cache layer, you still can able to do it or for some reasons it's uh it's.

F

uh It's not desirable.

D

Yeah we are trying to uh to cover this uh using scenario. um Well I mean direct l. uh Semantics requires the I o to be sent to the server, but so if we can reduce the network network latency between client and server, then we may be able to support this scenario.

F

Okay, thank you.

E

You could move the app, I don't know if you guys support hyper convergence, but if, if you could move the app to the cluster, that runs the file system, then the app will be on the same node and you could reduce it right on. I don't know if you support that or not.

D

uh Actually, we have a strategy to select data nodes and if there is a strategy that, if a client, if they know the available data nodes, has are on the same computer node of clients, then it will choose this data node with high priority. Actually, we have such a strategy.

B

But if they're starting individual files, co-locating data and compute may not mean much because the data is charted, if files started across different data nodes yeah, they may.

E

B

E

And then you you don't charge, uh then you.

F

If you lost your note.

E

No, no, no I'm not saying don't replicate, I'm saying don't shard, uh then you, your recast, can can really benefit from that.

G

But anyway, that's a discussion for another day.

D

Okay, uh you're all welcome to uh I discuss technical details offline and uh to to to uh issue to to propose some issues on github, yeah you're. All very, very, very welcome.

D

Okay, so here's the future plan uh I'm going to uh cover this cover these plans from a community perspective and technical perspective for community perspective. The objectives are to attract more companies contributing to to the project, and we are going to make it easy and uh stable to use.

D

um Difficulties are that, as I mentioned, such a storage project has a high barrier to contribute key key commits.

D

So what we are going to do is that we are going to open some a series of technical lectures um to uh which can do some source code analysis for someone to be able to familiar with 25s more quickly and we are going to pro. We can also provide internships for college students, and this is what we are going to do in this summer, and also we can develop tools to simplify deployments and cluster operations.

D

And for technical plan, um where we are planning to integrate with uh cncf ecosystem.

D

uh Since we already have uh tobacco, csi and survivor's health and uh the monitor subsystem is.

D

It is highly reliable on promises and we are going to integrate with book, and this is under progress. Well, actually, we have pro proposed a pro request to the rook community and we are waiting for the feedback and also, as you can see, we are trying to integrate with a big data ecosystem as well, for example, use tribal fs as back-end storage for data lake and develop remote spark shuffle service planning, and we are also going to the next biggest feature- is across zone optimizations, to improve robustness.

D

Okay, so that's it for today's presentation. Thank you.

A

Thank you very much. Shiran.

A

A

I would, um I would just like to ask a couple of questions around um how somebody would, if somebody wanted to run this in production today.

A

um Are there you know until sort of the rook changes are, are committed? How how would somebody sort of deploy and and and manage this across sort of um a number of a number of nodes or a number of servers? What what's? What's the what's, the sort of current um best practice.

D

uh Well, actually, there is no hardware limit to deploy uh to buy fs uh for best practice. I think um I can give some uh suggestions, uh which is a resource manager can be deployed separately and the meta node and the data node can be deployed hybridly, because a meta node consumes much of the memory resource and the data node consumes the storage uh resource of a a single node, so they can be deployed uh hybridly and uh for you.

A

Know I I I understood, but it's I guess what I'm trying to get at is. Is there some sort of process to to sort of automate that deployment, and perhaps you know a cluster? You know like a for example, in a kubernetes cluster or orchestrated in in in some sort of way or where is this um something that you install on a sort of server by server basis.

D

Well, actually, drivers can be served as an on-premise storage uh for kubernetes platform, uh but if you are planning to deploy a trooper fs in and orchestrated by kubernetes, I think first of all, data node can and the meta node cannot be migrated uh from different between different nodes.

D

Does that answer your question.

A

No, I I I think I understand I, I understand the concept of the of the data node and the and the method partition. I I guess what I'm what I'm trying to understand is you know what in in the typical use cases where you know jt or oppo, or some of the other companies that are um using it in production, are? Are these typically deployed, therefore, on um you know some sort of bare metal nodes or or or vms or something but but is it? Is it under?

A

You know some sort of config management or, or it's actually deployed, perhaps as a container or orchestrated in some way.

D

Well, uh most of the clusters are deployed on bare metal nodes, and can it is served as an on-premise storage platform for the applications running in kubernetes cluster.

A

A

So there's there's there's. I guess that means. There's, there's quite a barrier to entry to to deploy this in in in production, because the the deployment and the operational aspect of this is is is going to need is going to require some work. um I imagine.

D

Yeah but as we have a survivor's helm, it can also be uh actually it can be deployed and orchestrated by kubernetes cluster.

D

But I don't think.

D

They are used in real production, but we have drive scale which can deploy drivers in kubernetes cluster.

A

All right, okay, that.

A

A

Were there um any other fun questions for for sure.

A

A

D

Thank you very much.

A

Thank you thank you and uh all the team, all the project team. um This was a great presentation. I think we all. We all learned a lot about sugar offers. um We. We also had another agent item to go with uh rafaela to um discuss uh the cloud native dr, but I I think, given that we only have um a few minutes left to the hour, I'm going to propose that we um that we uh move that to the next call. Hopefully that's, okay with you, rafaela.

F

Yeah, it's okay! That's perfect!.

G

I also learned a lot from this presentation, so thank you. Sharan.

A

Indeed, thanks everyone. Well, in that case, um we'll give everybody a few minutes back um and we'll close the crawler um unless there is um any other things that anybody else wants to raise.

A

I think we're good all right. Thank you again. Sharon and the project team and look forward to seeing you in the in the next call have a good rest of your day and good or good evening, depending where you are.

D

Thanks uh have a nice day.

F

D

A

A

A