Cloud Native Computing Foundation Storage Special Interest Group, 27 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CNCF Storage SIG Meeting 2021-01-27

Description

CNCF Storage SIG Meeting 2021-01-27

A

B

A

C

Sorry I was from mute good morning or good.

B

C

Or good evening, depending on where you are.

C

We'll wait for a few more minutes. um There should be a few more people joining.

C

I see the alibaba team are on. Are you um going to be presenting the uh vineyards.

C

A

Okay, thanks alex yeah, we are um so shall we just wait a few minutes or we just get started.

C

um Let's maybe wait a couple of minutes sure a few more people should be should be joining us soon. Thank.

B

C

All right: it's it's nearly five past. Why don't we? Why don't we begin hi aaron.

A

Hi alex okay, let me get started share the screen. First, uh can everyone see me yep?

A

Okay, uh hello, everyone? My name is zen yan yu and I am from the alibaba demo academy and today with me. Also there are a few, my colleagues, andy tau and the syrian online.

A

uh Today, I'm going to uh present our recent work called the one yard. uh It is a distributed memory, mutable data manager and we have a plan to donate it to cncf as a sandbox project, and we would like to hear more from the community, especially from the sikh storage community, uh about the feedbacks on our project uh feel free to interrupt. If, like you've got any questions.

A

Okay, uh the first question is why bozer like why we need uh like, like another uh data storage, and so the the problem is as follows. You can see uh you can see from the like. Pi data is a de facto standard for data analysis like people like building various data applications in python and normally, if like they may like, involve multiple ex libraries or projects in the pi data ecosystems for different kinds of works. For example, if we want to do visualization, we use matplotlib and we used to analyze uh data frames.

A

We use pandas and if we want to do like a numeric calculations, we use numpy and we want to machine learning. We use high touch of tensorflow and all those libraries works together very nicely. Okay, because sharing data, especially intermediate results, very very efficient between the systems or libraries.

A

uh Here, is an example like we want to pass uh an array from numpy to python. Okay, it's basically tito understands the data structure of numpy and understands all its metadata and for like uh for the payload part, like the actual. Like the actual array part, uh it's just share the same array like with the pointer like passing the pop c pointer and its lens between those two libraries.

A

It's very easy, and in this case, like you, change up like you, change the tensor position, uh zero with two minus one and for the and the array part in the on the numpy. uh It also changed because it shared memory.

A

So, in this way, sharing data between those two libraries are basically with zero copy, zero actual copy, and there is like zero actual cost. uh But what? If like? We? For some reason? We cannot do the like.

A

We want to access the same piece of data, but we cannot do them to do it on the same process or we may need to do the multi-process processing on the same like data. It is not as easy as the first example like we can, but it's still possible with plasma from the apache error. It's basically a local object store using shared memory uh like it has a python also comes with a python client where people can like uh like get an object.

A

Basically, that object's memory just mapped onto that process and they use that memory like you can access the data, but for the like a metadata part, it's not as straightforward as the first one, because for the object, store part is only like manage like continuous memory for for each object. It requires everything like must store in a continuous memory, a section of continuous memory.

A

Okay, but for the metadata part you either you handle that yourself. Okay, you've got the metadata. You try to try, because metadata typically does not take much space. So you take the metadata data part or you just serialize. Those metadata, along with the payload part, the the actual like payload part together and put it as a single object. Four plus, just like what apache error does.

A

Okay, that's the two ways you can solve the problem of sharing metadata, but you can, but you can still share the um you- can still share the payload part with your recovery, with the plasma just just between different processes or runtimes, on a single machine but then like if we want to handle like a even bigger application like we handle big data, the big data itself, like cannot fit into a single machine and plus we want to like do many different tasks, different workloads on the same piece of data or or like uh um on the result of the like, like different workloads.

A

What can we do like in this case? We want like um leverage, kubernetes and also uh the the project where we were doing layer.

A

Okay, uh let's look at it in real life, big data application just like what we have done in python, like uh actual the real life big data applications is actually very very like complex. They are involving many different tasks. For example, we from the raw data visual logs, we mean you need to do some etls to do the drawings to do the transformations, and then we can feed that data. We may want to feed that data to a graph system to do a community detection.

A

For example, we do the label propagation and, and then we we want to feed that data to a deep learning system like tensorflow or high touch like to to fine-tune. We try to learn. Some patterns, learn something models and finally, uh we want to like use: do some visualization on the on the like uh classification of the graphs and we or something like that we want to like do some inspection, whether there's this classification makes sense?

A

Okay, okay, if that's the pipeline, you see we between for each each workload, there are some dedicated systems and between the systems we need to shuffle the data in a distributed file system and no more like zero copy sharing of the anymore. Okay.

A

Here's the observation, uh the the the big data application involves many systems and each system shares the media database uh internal file system. And finally, um this kind of the workload are often organized as a chain or dac and like each individual task requires the results produced by the previous tasks.

A

Here are the problem: okay, because first like, if even for the big data like we have these three files, the building of production, ready systems like hive tensorflow spark high torch. Those are very hard to develop. Why? Because, like we need to consider different kinds of different like distributed file systems, we may need to consider like uh what's a file format like we use, do we use csv for like tables or like orcs or or or apocryps like?

A

There are many so many like uh file formats, not to mention photographs, there's no standard way to store graph right, so we need to dump them as a table, but that table may lose lots of information and very efficient and, like you need to coupled with this kind of like input, output and the systems.

A

Second, sharing this data with external system of course, involves huge io costs and sometimes those cause another unnecessary and finally like. If we want to optimize those tasks as a whole like we want to pipelining this kind of things and the jobs is very challenging. uh So that's the motivation. We want to build a system like solve those problems. We want to make big data system easy to build and we want to like reduce these kinds of io costs in the workflows.

A

And finally, we want to open the opportunity for like cross task optimizations, that's the quest for one yard. So what is wire yrn is a distributed. In-Memory object, store for immutable data and it supports zero common memory, data sharing between different systems. It comes first out of the box high data instruction for developing big data applications.

A

For example, we have a tendency, we have data frames, we have distributed graphs, we we have like scalars and also like common data structures like arrays, hash tables, etc.

A

Like those kind of data structures are just like uh come out of the box and those kind of data structures can be mapped to the memory like just like native objects, like c plus plus objects.

A

You can use that data, like you, could do the local data success just like a native objects and finally, we provide drivers for data, partitioning, io and checkpoint migration, etc.

A

That means the big data applications that do not need to take care of. I o like when the the computing engine is started. The data is already there in yr for for its consumption. The computing agent itself does not need to care. Okay. uh Do we need to to to to load the data from an external file system, all it's coming from another stream, all? What's what have whatsoever?

A

So the yard comes with drivers that can do this kind of task, for the like the application build on top here is the architecture of wire. uh A wired object consists of the beta payload pillar, basically those consume much of the memory and the metadata and the data payload store. Storing in the shared memory.

A

Just like a plasma did okay, we we open a big chunk of share memory as a pool for storing the payloads, and the metadata in wired is like synced through a cluster using eccd, and we currently, we support data frames, graphs, tensors et cetera, many kinds of, and the binary daemon instances access via ipc and rpc connections and the data payload can be accessed by ipc connections so rpc. So you can only access the metadata for ipc.

A

You can just map the share memory into your process and we have comes with many like blackboard drivers, which can provide certain functionalities to certain data formats, for example, migration or io load, all data or and load the data, save data to an excel file system, etc.

A

C

Could I can I just um ask a couple of clarifying questions there.

C

Is is the, is the data um uh replicated across the different instances, or is it sharded or or is it just separate data sets.

A

uh Currently, um actually, we are just starting sharing the data or partition. The data is for the common, like big data applications, but we also support replications, like, for example, um uh like I will come back I'll, come to that later, like, for example, yeah yeah yeah, but uh that's like, um for example, that, for example, if the data is replicated, maybe we have two like two processing process like working on the same piece of data to speed up or or for some.

A

Some reason like we want to have a data backup in the memory or for some reason we just we just dump the data to an external file system, and then we free that data from the memory as that's all, can be controlled by uh like something called the driver and we can build drivers to do that and the drivers can work with kubernetes and the like on the applications on top to design that so on the very low level.

A

The data like we uh on the for the one year does not care like um the likely. It does not really understand the the the. What does the data mean, but because we have metadata manager and drivers, we can easily plug in new kinds of like data structures or types onto like one yard, so like it's yeah yeah, basically to to make sense of the data is more like a client thing or the applications and not the wired. These.

A

It's an adjustment in the metadata.

D

Do you have any concerns about uh leveraging at cd at a certain scale, due to its brittleness, using it to sync the data through the cluster? Have you run into.

A

The the performance you mean, like performance, consideration,.

D

Well, not only performance considerations, but just the additional traffic, then that flows through etcd that isn't part of its normal functionality at scale. It seems like that, would you could possibly run into issues there.

A

uh Yeah we have done tests that currently we deploy the etcd like uh as a standalone cluster, uh then long workload like in addition to the the etcd required by kubernetes, um also like it's not a big problem, because we only use etcd for like metadata, and only if that metadata is consumed like by cross different kinds of applications. We put that data to epcd. Otherwise we maintain that data as local as possible like if that data is not required by a remote walker like that metadata or that object is not required by remote walker.

A

We try not to like expose that, and our data is mostly like immutable, so those kind of metadata kind of static unless you're creating something new. So that does not happen too, like frequently. So that's not a problem like like for like too many much traffic or something like that.

B

A

uh Here I will show you example to access uh like global data framing on lanyard. That means like we have a data frame big table and we partitioned the table onto into many chunks and for like each chunk may locate in one of the wire instance and like we have a client online like on top of one worker. If we first, we connect wire through a domain socket, and then we like try to get an object of some specific id.

A

We can get all the charts from that data frames and, and we can access we can. We can access a chunk like a local chunk. We can check whether the chunk is local or not, and we can get a local chunk and then we can like just use a chunk like normal, like pandas chunk, and we can do. We can inspect the metadata of it so and each step involves different components and basically, if the chunk is local, you can use a chunk just like a normal native object.

A

And currently we have some integration with kubernetes and we have a vision that, like with vr support and the ability of kubernetes we can make. Maybe we can um find a new build, a new uh cloud native paradigm for big data tasks like first we'll cover how to deploy one yard in kubernetes and how we leverage the power of kubernetes and uh further ahead. How to how we use like kubernetes ability to co-schedule data and the workload on top of it.

A

Like just just flashback to the previous task we want to solve like uh like um we want to like uh replace. First, we replace the decrease fastener with wire okay, and then we we we will.

A

um We want the different kinds of workload to share the data through the means of crds, so they they find the crds they want to access and they use the ion to map the data to its workers and, uh for example, we changed the workload as this we use a data frame engine called mars. It's already integrated has has had integration with one yard and it's building universe built by alibaba as well, and the graph system called glasgow is also like built on top of yr, and those kind of things can be directly like shared.

A

The data can be directly shared through the crds, and then we we use google flow and the graphing.

A

Now we can still use the the lanyard to share because it talks in tens talks in the language of india, race, the chance of india race, and we have a uh python sdk to usually like a python sdk to easily integrate any like python, based like libraries and in this way the end-to-end big data task is deployed on kubernetes and the intermediate results, like data abstracted as crds and lived in one yard and in the memory, and maybe we can have a scheduler to optimize the locality of the next job and if the the job.

A

There are some mismatch. We can as a scheduler or we initiate another job to migrate. The data for the alignment or we will partition the data for the alignment.

A

C

Was I was, I was just thinking that that would be an ideal candidate for a custom controller or a mutating controller, or something potentially.

A

Oh yes, indeed, so so uh uh first, we uh how to deploy wireless. Actually, it's a little bit like uh not not very straightforward, because why are the requires ipc communication between wire server parts and the application file for memory like currently, we deploy vinyard as a daemon set? Okay, so the units where we need to like either uh using a host path for an ipc software as a domain socket or we need a build.

A

A separate persist, volume claim for like just just to put a circuit there until and we have done the experiments, it's it works and as long as we can have a domain socket mapped to different containers. We can share the memory in my yard, and the users can bundle ir and the workload to the same part and the domain circuit could be shared using mtdir.

A

And for deployment of one yard, we, yes just, I just said it could be deployed as a demon set and we have used leverage the help for like to install one yard quickly, install and deploy on kubernetes and, uh secondly, we all by our objects we expose it as coordinated resources like basically, if some job requires some kind of data. Let's that's that's in the means of buying objects, it can look up through the where necessities and uh um yeah- that's that's.

A

Well, let's just and the and the further we have plans in progress like to build a like one yard operator. That's that's be responsible for the devops of the yard of on your own kubernetes cluster, and we want it to be responsible, managing the status of biocluster and managing the crds and provide the skill in and out capability of layered on kubernetes and to be responsible for like. If we want some kind of data checkpointing, we want some recovery for tolerance, etc. We can use a valid operator further ahead.

A

We plan to like leverage the scheduler plugin functionality of kubernetes and we want like use kubernetes to aspirate the data like how the data is partitioned or how the data is migrated, how the data is replicated, how the data finds the workload and how the workload finds the data like they first, the worker part describes the required wine objects in their specs and schedule tries to align the vocal parts with required wired objects by retrieving the locations back from the series, but if the data is misaligned and we can trigger a data migration, all replication to ensure that the parts can accept data they require here.

A

Exactly here is an example like first we have a job, let's generated part data part one and the data part two on the yard instance v1 and v2- and it's it's a part of when part two and they come they they together, they uh maybe they.

A

They are part of a global graph, some graph global graph and for the next job we have opportunity to like uh collocate. Maybe we may want to like put p2 and those p2 together uh like on the p1 part, one part tool, but if not, if it's not possible, we can like trigger a migration like to to satisfy the the requirements and the when the job is launched. The data is there and it can be directly mapped to their to their process.

A

C

Sorry, just a couple of questions there, um the data migration that would be happening, sort of shared memory to shared memory.

A

Yes, it's handled by a driver actually driver is a special client, a special client that leaves using the wireless container.

A

So basically, it's in a separate process of the wireless demon, but in the same container it's just like okay, like for each node, there's like one container and there will be also a driver. We send it a command, so you need to migrate. This object to this object to this, like instance, let's just create a new object here and maybe remove here, or we just keep it here. It's it depends on the what command it was given.

A

So uh it's basically like the drivers is we're doing very primitive jobs like it's a special client or special application like, but it can provide like um some. Some like, for example, you want to check pointing is that we want to save the date a copy of data to to the disk. We just like down the data to this girl, and then we can wow yeah. That's that's. That's basically the meaning, I feel really yeah and then for the interesting.

C

Just just one one, one other small comment: um we had a a project present um a few a couple of months back called the data set lifecycle manager. It was. It was an ibm project which um had at least not the shared memory aspect, but but it it had put together a process where you could have crds that would identify data sets and load them onto particular nodes, specifically for sort of research and and and potentially big data type use cases. So so I'm I'm kind of wondering if, if maybe there is yeah.

A

Yeah, I think that's a good good idea. Actually, we don't currently have uh like bandwidth to like to to going very deep on the scheduling part, absolutely there's too too much work to do. Actually, I I'm just thinking just the model of wired very feeds, this kind of the the ability provided by other cncf projects, and actually we are currently looking at another project- called fluid like to kind of achieving the similar things, but I will definitely also look at the dataset project, uh but I didn't pay attention to the previous meetings.

A

I will definitely check, thank you very much very cool and for the roadmap, and currently we have like everything, viewed testing. We did that through the github actions, we have the various data types supported already, for example, arrays uh graphs and the data frames etc, and we support several like computational engines, pytorch mars grass scope, except for pytorch.

A

Those two are from alibaba and we release as a docker image on docker hub and the koi, and we have integrated with helm for the deployment the further ahead we plan for, we have planned for the viral operators and we we aim to further improve its performance either by those like those data types and als. Also by the like uh the basic, primitive operations outliner, such as create object or remove, object, etc, and also we plan to add more language. Support such as java go, etc and rust.

A

Maybe, and we we may want to look into how to build this storage hierarchy, such as for like objects or memories in some device item gpu. Also, the small storage like local ssd, something like that.

A

Whether we can leverage that and they want to either build a scheduler plug-in or like people can integrate with other, like scheduler framework that can handle the data locality problem.

A

uh The status of the project currently is hosted on the github alibaba orc and it got uh 343 stars as of yesterday and we got uh 33 issues and 113 prs and we have six maintainers. Currently all came come from alibaba and uh and but we we will come any contribution from the community and then we have a clear path for newcomer to become maintainers.

A

uh Currently, it's apt 2.0 license and we have either issues discussions prs of income and we have website to host the documents uh for the community governance. uh We have a clear path for newcomers to become maintainer. uh We have designed many like good. First issues like open many good facilities for newcomers and before becoming a maintainer.

A

We, his perhaps as a newcomer, can submit five like at least five pr's uh to one yard and he can reach well for editing maintainer and we we were happy to like have a vote and the majority is required and it's just the routine stuff and we they very welcome the external maintainers and the contributors actually, and we know by building one yard. It's like.

A

uh Actually, it's not one.

A

The the expert for like from one team or like just one vendor is not enough, and the community is the key, so we really will come as maintenance contributors and this that maintainer could like spend like uh at least one fifth of his time on on the project and the enhancement. Decisions like is like propose the other issues uh and we can have a vote by maintainers and for developer to develop it. You can feel free to self-assign all maintenance, design and the release cycle.

A

We will follow the server to cut release packages uh like for one year, a major release and every two months with a minor release and the patch version every one or two weekends, and they plan to distribute our first major release in april, and we have already released the packages in like pipe meals, docker images and the helm charts.

A

Oh no, sorry, the the the the wrong place for question mark. uh Why yeah?

A

Actually, my art is a very uh natural fit to cloud native computing uh like it provides efficient, distributed data sharing and in our client environment, and it all can be observated by communities and we find that is really exciting uh and we're already leveraging existing abilities provided by many of sincere projects.

A

Currently, it's chronic native uh scaling out. It comes with a scheduler plugin and we use helpful, develop deployment as a cluster and we use etcd and the crds from keep coordinated for the metadata management and- and we really want to make this like ip vendor neutral- to encourage collaboration and innovation.

A

Actually uh it's it's kind of a foundation of like building like new big data system, or is it or making the same system better? So I think that's like something we want the possibility. We one can get feedback and confusion from the user community that since afk engage and we want to work with sincere community and we believe together, we can build the next generation cloud native paradigm for big data applications.

A

uh That's all for my presentation.

A

Any questions thank.

C

You thank you so much this was. This is a really good presentation. um Are there? Are there any questions from anybody on the call, perhaps louise or or jane.

B

No, I uh I think this is great. I I learned a lot actually from it and uh one of the things I look forward to is maybe a demo of it. uh That would be great. I I can see the architecture I I would like to see how it kind of uh yeah.

A

Actually gets used.

B

A

Yeah, maybe maybe maybe we can do it in the next meeting, if you guys uh interested because we were like for many like uh kubernetes work, especially the scheduler part, we were just doing some like a very like early poc version, like proof, concept versions and.

D

A

Can't having get everything uh linked together so uh maybe in next meeting we can do a demo, let's see, what's the how the same rule and the.

B

Yeah, um other than that um you know, I guess we just had to go through the you know, the due diligence that you know we have to go through the project and check it out, but I I am um you know quite impressed with what you showed so okay.

A

Especially all the memory things.

B

A

Okay, thank you for.

C

Liking and just uh just to to double check, so are you applying? uh Are you planning to apply for sandbox or for incubation level.

A

um Actually, I want to like hear from you guys said: well, uh we are playing for the sandbox, it's I think it's easier and uh our project only open source a few months ago uh on, like october or november, I can't remember it's it's it's pretty new and we only got maintainers from the alibaba and we really want to get more like external maintainers uh like before we can make move to the next level. I think that's that's our plan, but that yeah, where the incubator is now maybe we.

E

Alex, I believe, uh I think they looks like.

A

E

A

uh Foreign alibaba, of course, but uh yeah we need to well other.

E

A

Not not yourself other hand, yeah yeah, I understand yeah. So basically, so we have many. uh We have two other open source projects like built on top. Why are already open sourced by alibaba and they got many. uh We got a few uh in the users, I'm I'm not sure whether, like that counts as our users as well.

A

uh I think yeah, but we just want to improve it. It's that's not yeah. It's not enough. Yeah.

E

Yeah, it looks pretty decent for a sandbox project. Yes, may not be nothing.

B

I'm just concerned, I agree on the amount of time the sandbox project has a lifetime. Is there a limit because I'm this project seems good, but I'm just concerned that it may not collect enough uh end users or or other community members? Is there a lifetime for sandbox, uh or do we wait until maybe the team has more contributors.

C

Well, I I think I think I think the project is is just at the right at the right stage, because um you know you're you're close to the to the 1.0 release, um and you know using sandbox to increase the number of maintainers and to build out the community is, is, is perfect, um the you know the once you're once you're in sandbox there are, there are reviews every six months or so, um but uh yeah I don't. I don't anticipate that that that's going to be a problem. You know.

D

C

Can then make the decision to to move to incubation once once you're ready.

A

B

A

I think sandbox.

B

C

Brilliant okay, um one thing I will do is I will. um I will share the recording of um of the presentation and and the the deck with um with the toc at the next call, so that they'll have um they'll have uh some background of the project before they they go into the next sandbox review and and voting because the the toc have a have a regular, um a regular schedule now, where they, where they do, they review all of the sandbox um uh applications in in in one go every month or every two months.

C

um I need to double check when the next review is but I'll I'll find out and let you know as well.

A

Okay, okay, thank you very much alex thank you yeah. If there's anything more like we, we can provide or make it like more solid. Just let us know sounds good.

C

Any other questions or or comments for the.

C

Team all right, I think, we're good. um We we also had. um We also had another item on the agenda to to continue to review the dr document that that rafaela had showed out last time, but uh unfortunately, uh rafaela had to have to drop off. Something came up, so he'd drop off about half past. So, um if uh I I would strongly encourage, if, if you have comments or or any other contents to feedback to that.

D

C

D

C

The document itself um and we'll we'll go through and review the comments uh in the next uh sick meeting.

B

Okay, I'll make sure I review it.

C

Cool thanks, luis it's it's a really good document. It's coming along nicely um does. uh Does anybody else have have any other items they want to to raise or discuss.

C

No okay, so we get 12 minutes back thanks. Everyone have a good rest of your day.

E

A

A