Ceph CephFS Code Walkthrough, 23 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CephFS Code Walkthrough: MDS Locker, Part 1

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

I think we should get started uh okay. Well, thanks all for joining uh this uh code, walkthrough talk, uh we are discussing about uh our walking through I'll, be walking through and discussing about the mds locker.

A

uh So the the way I want to do this is uh to break it up into different series, uh because locker mds locker is a pretty vast topic with a good amount of complexity, but it's hard to kind of fit in everything in one hour code walkthrough, so I'll, probably just split it up into like, uh like a series of three uh code, walkthrough videos, uh you know covering uh in incremental fashion, about details of the mds locker and the code walkthrough itself.

A

uh So the part one series which is this uh will be about understanding. You know uh why do we need an md? uh Why and how we need why we need uh locking in the mds um what we'll do is we'll go through different data structures involved and the way the mds uses these structures and the locking structures we'll do a walkthrough of some of the file operations and see how locking is used in those and probably, you know, discuss and walk through the code a bit about the locking itself.

A

uh The next part series would be the actual implementation of the locker, which is pretty complex because in in a sense the mds locker is a distributed. Lock. You can imagine all kinds of complexities that can happen there. So part 2 series will be the implementation of the core locker itself and part 3 will be the different types of lock classes. uh So the the logging in mds is divided into different classes.

A

uh We we have pretty simple ones and really complex ones, so we'll and all these is kind of uh entangled by using state machines. uh So it's pretty complex so we'll we'll we'll cover that in the in the third series.

A

uh So uh so now, um which is the actual uh you know why do we need a locking? Why do we need locking in dmds and what is a locker locking in mds? uh So we all know, you know: why do we use locks? We use to protect anything we use to protect state metadata. We, you know to mutually exclusive access to a part of storage or part of data. So that's how you just log and the concept is obviously same.

A

uh You know we protect state of metadata in the mds in different data structures such as inode and d entries, and things like that. uh So uh so why does the mds need stock in the first place? uh So we all know that you know. Data managed by the mds is pretty large. The metadata is, can be you.

B

A

Huge, so it's impractical to put everything in memory in one mds and you know and overload it and you know, and that would cause all kinds of scale issues. So what cffs has is you know you can have multiple active mdss and then the metadata load is kind of shared across these dss. So there's a concept of a dynamic sub repartition, where a directory tree is divided into small sub trees.

A

This is done by recording. You know heat of each node in the directory tree um and whenever uh you know when a subtree is divided, so the sub is divided when a heat for a particular node goes above a threshold value. So when the sub trees, you it the so the sub is divided. Then, when the sub tree is divided, the node is changed from a single day, fragment to multiple their fragments.

A

And you know each fragment is responsible for a part of the original directory, uh but there'll be only one authoritative node in the in for these fragments. So that's called the auth uh auth mds or you know, authoritative, for a particular directory node. Now each mds can bear you know. I can have can bear the corresponding, read and write requests for these nodes after fragmentation.

A

So if a file is like you know, highly hot or excess lag from by multiple clients uh very frequently, um the mds you know will generate multiple copies of that fragment and distributed across different mdss.

A

So in a sense uh you can have a directory tree split because it is, it has used number of uh directory entries. So you know you can a single fragment will break into multiple fragments, which is called as a diffract, and then each of these fragments can potentially, depending upon the intensity of access and the heat, the heat maps uh they'll be replicated and other active mdss will now a copy of this particular directory uh fragment. So that can service three requests.

A

Now you know there are, since there are multiple clients conveniently reading and writing to files. uh So the mds, you know, defines different usage rules for locking for different fileman data. um So just to give an example, you know the uid of a file by modifying the uid of a file. um That's rarely modified concretely right, so you know kind of a shared read and an exclusive write can be guaranteed while for this particular kind of metadata, you know things like stats of a large uh directory may need to be updated by multiple clients.

A

At the same time, uh and this because you know this particular large directory is divided into multiple points, uh and you know different clients read and write to different charts charts in this sense different. You know frags, so these shards must first ensure that you know they can share that read uh and they can share the read and also uh achieve simultaneous right. So it's possible that a client is updating one. A client is, you know, writing to one chart and the other connecting to one.

A

So dm just needs to ensure that you know that the these shards can share the reads and they can also achieve simultaneous rights and only when needed. You know these shards data need to be aggregated to the authoritative uh node. uh So all this kind of managing clients- you know, uh requests for accessing different parts of med of metadata requires different locking strategies and locking rules. uh So um you know think of it. As you know, uh we have all done.

A

You know some form of uh uh sometimes we have all written, somehow some form of code, which involves walking and say you have a particular structure, uh a data structure that you that needs to be protected uh for from multiple access congruent access. uh So we typically have a lock right, that's protects, you know, uh n entries uh um say you have 10 fields in that data structure. uh Log products, these ten fields. uh Now you know you you, you could probably optimize this say out of these ten fields.

A

You know a bunch of fields, um you know. So so when you update any any any of these ten fields, you need to grab the lock update and then unlock. That's typically how it's done uh so you so to optimize this. What you'll do is you will have different logs covering different fields, so you can have lock one. You know that covers the first two fields and then log two for the other two fields. uh That's because you know uh maybe some of these um fields are infrequently updated.

A

You know, uh and uh you know- and maybe some of the fields are like written. Rarely but read very frequently, so you might have a read, write kind of log for these fields uh so that you know you can have shared read, uh but for those which are like updated frequently, you will have a normal mutex for them, so we have done we've all done.

A

These kind of you know optimizations and lock breaking and things right so with mds is um the concept is more or less the same uh like for an inode or a d entry? There's, not one log that protects everything in that particular for that particular uh uh for for metadata hold by these structures.

A

uh So what the mds does is different and you know, have different kinds of flock types. That is one thing uh then the um what the mds does is have, as I said, log classes, uh so we have different kind of log classes that you know have different rules for locking. uh So, for example, we have something called as a simple lock that is like a base class implementation and typing to um it, defines you know the the the rules for the the base rules for the distributed locking.

A

um Then we have kind of local lock, so I can so you can understand local lock. You know it's kind of only used by an mds because it's just used for say updating a version called or things like that, and then we have something called as a catalog, which is the complex one which involves you know uh where you can for things like. You know where the mds can delegate some authority to another mds say a replica for a frag.

A

You know so that the other mdss can generate capabilities for clients, and that requires the updating the state protected by by them.

A

So you know these, so these catalog is the most complex one and we'll see you know how probably not in this part series but in the next part series how the the different kind of lock classes are used.

A

Okay, so, let's see you know what uh let's see some different so yeah, so we discuss log types, uh log classes and then we have log states, uh so log classes and log states are kind of very, very much related. That's because the law classes implement state machines and log states are basically just different states in the state machines. We have a lot of states, I mean around 38 states so that just kind of increases the complexity but we'll come to that later.

A

So don't worry about that now, uh we'll start from the beginning, which is like you know the lock types. So, let's see uh go to.

B

A

A

Just give me a second.

A

Just give me a second I'll, just rearrange so that I can see them properly.

A

Is this okay, anyone able to see someone else.

B

Yeah looks good yep, okay, so.

A

Let's see include.

A

Okay, so we have these different, lock block types.

A

So so we have these different block types uh that are that that are used a lot different states in the metadata. So we have so. The mds has different locks covering different portions of the fields in I know in the inos and the entries.

A

uh So uh so we have these defined here. uh What's interesting, is you know uh the the the the the way that you have to find? uh You know what these log types cover which which which uh metadata feels these lock types covers? It's actually you have to go into the code, uh so I have kind of done that uh myself.

A

uh So it's actually pretty evident. Some of these are very evident. Some of these are not uh like you know. You can say uh the uh cephalocyte snap is kind of protect, so all of these uh fields they are 13 or 14. I think 13. they all of them protect the version. So you know that is the base one. um So let's say I snap and that's used to protect the snaps and the uh c time.

A

um Then you know, uh let's see, uh iot iot is like for c time mode uid and the gid link is for the link count uh exciters, you know for the ix order for the ix editor.

A

This is for the versions. um These are for the versions and we will see you know these are like local locks. uh Policy um is for c time layout quota. You know, uh export pane and you know they recently added ephemeral, ephemeral star, uh you know types, uh the interesting ones are the file and the nest, and- and these are the complex ones, because these are the ones that use these catalog mechanism.

A

uh So file uh is for c time and time a time uh and this stat so directory statistics information is for is is is protected by I um inest is for the recursive stats, so the um so mds has these because it does things where for a given for a particular node, you can have so yeah they're a bunch of x, artists that you know do this cumulative upward marking, um so those are protected by ines.

A

uh I dft is for the uh defactory, that is about the the frags itself.

A

Yeah. uh Most of these locks are like simple locks. Are the local locks like the versions are like just uh you know, local locks. uh Bunch of these are just simple locks. The lock class is what I mentioned, and the ifilantinist uh and I file yeah I file an ionist is for uh is the uh complex, uh scatter log that we'll probably discuss later.

A

So these are the log types and we'll see you know when we see the inode structure in the mds, we'll see that you know how these locks are defined. Let's do that cmds.

A

A

A

We have so so what I'm not showing is these these log classes uh simple, lock, scatter, lock uh yeah? If you go to that, we have to then understand state machines and the different, lock states, and that will kind of unnecessarily complicate things right now and probably confused too.

A

uh So, we'll just think these are like this is just a lock implementation, so the locker class kind of takes care of the implementation based on the uh on the lock class, so the action so that we have a locker class um that locker down that's the locker.cc source and that takes care of how to lock uh based on what type of lock class this particular lock is defined.

A

So we have uh like you can see that you know. As I discussed, um this is a ci node uh structure and there's no one log that you know locks the entire metadata or different from that protects the uh states. So we have like authlog that protects uid, gid, um link, clock the link count and then so forth, and so on you know, file lock is for um body c time and time and their stats uh policy lock nest. Log, the complex one is for the rc for the recursive stats, so these are the log.

A

uh So so we discuss log types uh and then the um the the the actual uh c I know, structure which, which kind of uses these logs and similarly we have the cd entry.

A

So here we only need two like the actual lock and the version lock. The version lock is for incrementing the rc cd entry versions. uh The interesting one is the ci node ones, because this is what is mostly used. The the the the entries one are pretty easy. um You know there's no uh scatter log or things like that. uh It's just basically, either we increment, we need an exclusive log to increment the version or just you know, use uh you know, unicycle lock to or for that particular d entry.

A

uh Okay, so um covered the um log types and the inode structures yeah there. There are other uh data structures that we should be uh looking at um before. We kind of you know to understand the whole flow of the locks better, uh but we'll do it on demand. You know once we there's there are things like a log of vector.

A

You know that is a kind of a vector of logs that the m that you uh fill and ask the mds to kind of uh acquire those logs in a certain fashion, uh we'll we'll do it on demand uh and yeah. That's probably the only uh other thing. Let me quickly see my notes.

A

B

Yeah I yeah okay, yeah.

A

So what I'm doing with all this is you know uh since so this is really old code. uh Some of these uh these this some of these sources are like written in in 2009 and practically never been updated or changed. uh So this is really old, uh and these are like the most you know undocumented and complex part and the most complex part of the mds, uh so yeah. So what I'm doing is you know just just just to quickly cover this?

A

um What I'm doing is you know, as as I do this uh code walkthrough um uh I'll, there's some effort that I want to put to actually uh document a bunch of these things, so you know very simple right. uh You know, uh you know you.

B

A

Probably know how I figured out what all these and what metadata fields these lock is to go into the code.

A

Basically here the inode and then you know, uh go to the encode uh iot and then see what kind of fields these these this log, you know protect, so uh you know so improve some documentation, and then you know, uh and and and and the some of these codes, especially the locker class- is that's probably one of the most oldest code and you know it uses, doesn't use the you know the newer version, the newer facilities, what c plus plus provides, and so it uses all those higher indexing and things like that.

A

um So there's you there's go for improvement there and to make it more readable and the other thing is once we come to these lock states. uh You know they are very poorly named. You know, uh I can probably not show you now, but you know their names like lock, lock and lock, sync, which do not make much sense.

A

uh They are not locked themselves actually, but these are like states and that that that guard or that you know, and that that define a rule that who is who who can actually lock whether an authentic unlock or a replica is allowed to log.

A

So those are pretty confusing, so you know so as part of this, we'll, probably just you know, refactor some of it and make the code more approachable, so yeah so back to the locking thing. uh Okay. So let's do some um uh part you.

B

A

Let's cover some file operations uh and see how locking locker is actually used, and these different lock types and auth locks and these uh link locks and policy locks are used.

A

uh So that will give you a feeling of you know how how uh how whether, if you, if you're, if you're, implementing a a new file operation, uh for whatever reason uh you know you would know, uh we would know that you know what type of logs together uh so that um so that you know so that you can implement that particular file operation, suppose that file operation uses monster, touch, etc. So you just need the ix at a uh lock type.

A

uh So we'll see, let's do some basic, uh you know mkdir, probably, and all this is tied to you know pathwa part reversal, uh so how these operations are implemented. Are you know um so, we'll see you know you fill in a bunch of locks and then you know, ask the um mds or the locker class to actually acquire a lock. So if the mds, you know can't acquire lock, for whatever reason uh you put it in, it puts it in a queue, and you know later, you know when it can actually lock wakes.

A

It up requires lock, and then you know, your request is granted and the other thing about lock is you know it's kind of um tied with the whole capabilities thing. uh So what can happen? Is uh you know um so.

A

um So so for a lock request uh to be uh successful, uh it's also uh if, if that particular lock request needs a particular capability again, it's you know, you're put on a you put on a weight queue and then the capabilities are revoked. Probably one of the other clients has that gap. So a cap rework is sent once the revoke is done, you're waking up, then you try to you know, acquire that lock again and then the request is granted. So the whole thing is and we'll see this in the state machine.

A

The state machines also tell you know what kind of capabilities uh need to be there for this lock to be granted? Okay, so, let's see uh so, let's quickly do handle uh very basic handle client uh mk there.

A

This is the server source so- and this particular is this particular um um function is invoked when the client is trying to make a directory uh create a directory. uh So we have these um helper functions called rdlogpath xlr entry so quickly. So whenever the so, let's take an example of a client doing an mkr, uh so it passes on a file path, slash a b c d and then say file zero.

A

uh So what the mds does is, you know, um for each of these part components, uh take a read lock on each of these path components. So the the the the directories from abc to d are taken. A read, lock is taken on these uh and then um the the the the entry to be created uh and there's a exclusive log taken on that.

A

The reason for that is, you know uh the read lock is taken so that you know it allows parallel uh congruent read so that if another client is, you know, looking up this particular path itself that can go go through um because it's a read lock, uh but since uh um you know if, if if somebody is trying to if another client is trying to modify one of these directory plots parts that won't be granted because you know somebody already has a read lock on it, so we have this uh anti-lock path extraordinary.

A

That essentially does is. This is the one that actually goes into the locker thing. So the first thing we do is uh to rd lock path, extract entry, we uh what we do is uh uh yeah. So the interesting part. These are all some checks. The interesting part comes here.

A

uh What we do is define a bunch of flags uh that instructs the locker class to what to what all to lock and what and how to lock uh so for a particular path and we do and the client doing a make directory and created on it.

A

uh We want to rd-lock each of these path, components uh and the actual de-entry to be created needs to be exclusively locked and one, and uh what we do is uh uh uh we also lock the snapshots um from the uh uh from for each of the path component except the last component. uh So we'll see that. uh So, if you see, if, if we happen to see something kind of a lookup uh that doesn't do exclusive locks, because it's just a look of designated requests it basically it's it's basically read mostly reading.

A

uh You know a bunch of stuff, there's no right involved or updating involved. So you know there's no need of a wr log or an exclusive log, or things like that.

A

So and- and that is uh very simplistic- uh but here we, you know kind of need. These read locks on this path, confidence and then exclusive lock on the the entries. Okay. So let's see- uh and this is all you know- kind of tied with the power travel thing in the mds, um so part ourselves is kind of implemented in the cache where it you know, kind of uh you know, does uh breaks up into different path, components and there's a resolution.

A

So let's see that okay, so once we have these flags, we call these md cache path, towers.

A

So, depending on the flags, we kind of uh you know assign these uh boolean fields. uh What all we need to lock so with uh rdlock path, extra entry, we we need the uh we need, the auth. You need to lock the snap we'd lock the path and um exclusive lock did entry, so so yeah. So we have so the the locker implementation kind of breaks, the consumption of three things: rd, lock, wrlock and an exclusive lock, so rd lock is shared.

A

Read uh wr, lock is shared right, uh so the wave wrlock is used is you know for uh for these, uh for the file lock and nest lock, uh sort of your lock is special and it's mostly used for file lock, and this lock, uh so file lock is responsible for protecting statistical information in a in inot, and you know, which is, uh which is the distracting uh and the uh nest lock is responsible for protecting the recursive statistics.

A

Aster thing in inode t uh so all this required since our directory can be divided into multiple charts, as we discussed and even each chart can have multiple copies, uh so you could have in order to allow the statistical information on the starts to be modified. At the same time, w or log is kind of introduced.

A

There are scatter locks and these are complex ones, uh not really discussed right now, but you know that's where uh you know a doubly alloy comes into picture, so we have rd log. We have uh wlog and xlock rdlock share, read wlog is shared right and exclusive. Lock is you know the? We need exclusive access, so there's no sharing there. It's like a it's like a mutex.

A

Okay, okay, come here. The interesting part starts a bit below, uh but before that, what it does is, uh if you have asked uh to log the snapshots uh you go here, there are a bunch of checks. If the directory is free- and you know if it's deleted and things like that, we do the thing like return the character now. uh So if you want to lock the snap uh we uh we call it okay, I should probably cover this a bit later.

A

uh Let's try our d-log snap-out thing might confuse the things supposed to be: let's do the interesting part where we actually walk the path component and try to take clocks, okay, so uh yeah. This is the loop. That kind of you know uh big one that does a walk on each of these components and then tries to take a log depending on which component it's um it's in which path component. It's currently it's currently accessing.

A

So uh a bunch of checks. It's a snap, uh not interesting, so we we get the current directory. uh So for um say we are doing slash a b c and then file zero. We are walking like we are walking each other part component abc.

A

So we come to a first, which is we get the uh current directory c data of a uh we have some checks. If we are an auth or not, uh the locking part starts here but um yeah. So uh we try to look up that particular path component, uh like you know, so, if it's like slash a, I take the uh so the like.

A

It's the current area is for is the carded off root, and then we try to look up uh a on that particular uh file, name or a uh directory name, and obviously you need a snap id if you are traversing a snap id, but just ignore that for now, uh if we are able to look up which means the the the the uh entry exist, uh some we try to lock it.

A

So uh what we do is so uh remember that rdlock path, export entry, uh you know uh says you know I want to lock each of these path components and next lock the directory entry and the last directory entry. uh So once we are in rdlog path, uh you know see if we want to x lock the entry, uh the entry and, if it's the last path component, uh since it is not the last part component, we are just in the first one.

A

uh We go to this part and we do we add an rd log to that particular lock. So this is the directory uh entries lock, uh which we saw it in c, then see the entry dot c this one, so one is used for locking for version. This is this local lock. This is used to lock everything else except the versions uh and cd entry is simple because it doesn't have a bunch of locks, a bunch of logs, protecting different uh different fields in the metadata in the structure.

A

It's just have lock, so you need version lock if you are incrementing the version and for everything else, you just use the other lock. So uh so, if so, if you see this, you know we'll we'll walk through the entire uh tree, uh the the path component, um and then you know, uh add these and this as an rd, lock so yeah before that. uh I think I just moved a bit uh ahead. uh We mds has this log of vector, uh which is fine. Then I think locker dot, h pop.

B

B

A

See, maybe not it's a mutation.

A

Yeah, so it's locker vector is just a vector of logs uh different types of locks that you fill up and then ask the mds to acquire these locks in a particular fashion. So we have different helpers. So we have, you know rd log if you want feedlock, x, lock and then wlog.

A

So this takes care of you know um when, when we do something, like an add rd lock, um it basically just puts that particular lock into a list or a into this particular vector, uh but depending on with assigning marking it as an al, whether it's a read lock or a light right, love or an exclusive box.

A

So everything that needs to grab block um uh defines a lock off pick. uh So you can say you can see here, uh define a lock effect and then fill it in so while traversing the whole path component. We fill this lock up back with different types of locks that we want.

A

So there will be um quickly. I think.

A

Hold on yeah, so for so for all the path components, we added an rd lock for the the entries and once we are in the last path component uh and we want to x log b. The entry, uh which is the xb entry, is basically done for operations uh when we want to update a particularly um like creating a file or creating a directory or assembling. uh So on the last last path component, uh we add uh two logs. We had a bunch of logs like uh file lock.

A

So why is this file lock uh required.

A

So uh so file lock and next lock. We add as that as uh shared right locks of the wr logs since um so these are the on the on the parent directory. So if you have like a b c and then file zero, so the uh wr lock, so the phi log and s lock are for the directory c.

A

Okay. This is done since the clear start and the r start of this particular high note, which is the parent inode, will be modified. uh So you know once so so this start is like you know. If you quickly see, there's that.

B

Up hold on yeah.

A

Which is like in four.

A

Hold on hmm there is this.

A

I think I okay, this is defined somewhere. I can't find it.

A

Nest in 4t, basically, I think it's in the inot structure. I can't really find it right now, but you know it's like the r start is the r start is protected by the nest lock and they start by the uh file lock. um So you need that because you know once you so they start is nothing but number of files and- and the number of directories in that particular uh under that particular particular inode and the nest lock is like the recursive one.

A

So so so we add these logs, since we once we create this particular directory, we are changing on the number of files of the number of directories and under that, uh so those will be modified. So we add a you know uh right lock on it. So since their status is protected by file cast at bonus, lock so add those, uh then uh we add a read lock to op lock uh for the parent, uh for this particular dng for the the entry to be created.

A

um Since you know, for while creating this d entry, we need to access the parents permission, uh you know, that's like the ydj and different other things uh for for normal. You know, permission checks, so we had a re-talk on those. uh Then we had an x-lock on that particular final path component and the entry itself. Since you know uh once we, this is required since we'll be kind of creating a new d entry uh and then you know um and and then fill in the different.

A

uh You know uh things like link hd, uh which is like and the actual persistent information for that particular uh uh the entry. We can see that too, so I can see link hd uh yeah. So since we need to fill in this link hd, uh we, you know kind of take the exclusive lock on the the entry of the uh of the um of the entry to be created, uh which is kind of here.

A

These are the durable bits, so we take a um exclusive log on that uh and then you know once we fill all these fill on these locks and the vector we call into these acquire locks, uh which is the thing that actually goes through the state machine, see if you can acquire your. um You know your your your you can you can particularly lock that particular um metadata fields or if it's already logged in you, are put in a queue so uh yeah.

A

So what happens is most cases you know you'll find this lookup to be successful, except in the last path component, because you're, mostly creating those in that case. You know, you'll jump to this part where you actually create a null entry uh which, like the lookup, is a miss, and then you want to instantiate a null entry for it and then again you do the same thing uh uh which is like you know. If you want to exclusive lock that the entry at the file lock and the nest lock for the parent, inode.

A

Then add the outlook for the parent node for permission checks, the actual exclusive lock on the dnt itself yeah, and then you call into acquire locks so uh yeah, so so yeah. This is how uh so I can go into acquire logs um I'll I'll touch it very briefly in this talk, uh but you know this is how a normal operation works like so for, while writing implementing a particular file operation. You know you need to know what all different fields in the metadata for a or an inode is going to be touched.

A

uh So if we see something like a set exciter, uh you know that will just you know, do an ix header, take a lock on that ix, header and and call acquire locks. um Look up, you know, or any other get out of call would be basically just reading information, so that probably won't take any that doesn't take any uh exclusive, lock or uh write logs, basically just read locks, uh and then once everything is read just right just do a lookup on you know, read up different fields and then return.

A

We can quickly see.

A

A

I think yeah, so we do an rd, lock path. This is one of the other helpers similar to um you know that rdlock path x, logged entry, um xbox the entry, this just there's a just a basic rd lock on everything, because we really don't need uh uh okay, I think uh we probably yeah it just rd locks it and then it does an x-lock on the x-acto lock itself, uh so it blocks the wall. So this will actually just walk the walk. The path component. uh Let's quickly see this.

A

Helper, I think it just calls into yeah just do an rd lock uh of the path, uh so there's no exclusive lock of the entry, just hardly lock the path and then audiology snaps and then do a empty cash back drivers. So, depending on the flags you pass, you know the md cache part travels. uh Modifies different functions are depending on the flags.

A

You pass so you're, not passing any um exclusive lock of the entry, so it just does a simple read lock on each of the path component and back to here yeah, once we have already locked everything, we just add an x lock because we are going to modify um the excited value.

A

So we do an add um this yeah there's just like one lock here and an exclusive lock on the catalog. um The xrtel lock protects the uh version c time and x address. Then we call into acquire locks.

A

So, uh let's see what we can do. I think if we could also do client mk, not it's yeah. So since we are creating a new directory entry, we xlock the directory entry uh and, let's see if we can see if.

A

Link clock won't be used here. I guess it would probably be the client link. Let's see, um yeah the same.

A

Now there is this flag, such as snap and snap 2, since we are kind of handling two different path: components even rename as that. So we have something called as a as a snap, 2 or path like just like path and part 2. We have snap and snap 2, which is like point snap is like the snap locked will be for one particular path component. The two prefix will be for the other, so we have that notation, um let's see where we are doing uh yeah.

A

This is a bit tricky too, because uh now we want to lock two particular parts, because now we have two parts to operate on, so that becomes a bit tricky, but let's see if we can spot if, if you're doing some, I think it's somewhere here yep once you have taken. So I'm not going in into detail into these this two-part xlock uh destination the entry. uh uh It should be uh fairly straightforward once if you understand the uh normal rd load path, xlr of the entry thing. So this it's probably it's probably straightforward too.

A

So when we are doing a link, we are bumping up the link count. uh It's like the hard link thing, uh so we need to take a look on the we need to bump up the link on. So we take the uh add an exclusive lock on the link lock, which link block, protects the version c time version. Everything is you know? If you see the encode uh functions in ci node source, you will see everything encodes the inode version, so that's by default and then the link clock protects the gods.

A

The end link uh the number of links for a particular regular file. So uh we add that as x, clock and then call acquire logs. uh Let's see if we can quickly do get at a.

B

A

A

Yeah part travelers with basically nothing so everything is, is uh like read, read log yeah, so this is mostly does we saw the uh attack sort of thing right. This mostly does read locks and yeah so yeah.

A

So this is probably needs some change. This is practically you know unreadable uh if you don't have all these gap notions in your head. It's basically unreadable, but you can see that most of it is just doing. uh Read, locks rd, lock, handy, lock, udlock and rdlock, because you really don't need because you are just looking up or doing a get attacker. There is no um right or updation involved, so you just need audi locks, read locks and then once that's done, the same acquire locks.

A

Let me see if I can take you guys through locker dots dc without confusing myself and everyone else, uh yeah I'll, probably just defer it to the next talk. But you know there are a bunch of things that happen here. um You know, like um you know, once you give a particular a vector of lock locks for the mds to grab uh it rearranges this it rearranges it uh for. First of all, for correctness you can add in any order and the the the locker api, which is the acquire locks. Interface will rearrange it.

A

um You know in a particular order and optimize it. I guess so so it does some lock merging and things like that. uh Basically, to ensure that you know the locks are grabbed correctly in correct order and as efficiently as possible.

A

So uh yeah, I think, we'll just cover this in the subsequent talks, uh but you know just to uh not to frighten anyone. uh These are the different, lock states. So once we go into things like scatter, lock, uh simple lock is is, is is probably the easiest easiest one uh we'll go through we'll we'll do that in the next series.

A

uh You know the these catalogs are the most complex ones. You know, and that uses a bunch of these states. uh So the the the main difference between you know, the the simple lock and catalog in terms of definition is like you know, simple, lock, says.

A

Simple lock, uh you know, says anyone can actually uh read lock. Sorry.

A

Yeah right so um simple log is a base class that handles you know all these distribution and distributed locks and these catalog handles locking for most complex uh situations.

A

So let me see if I can't define yeah. I think so, we'll just do this next time in the next series. uh I need to probably go through a bunch of these to explain you the actual differences between the simple lock and the um catalog. So you know uh just for the sake of avoiding confusion and dragging this too long, uh we'll just do it next time, uh so yeah and the the so this source is probably you know last test in somewhere in 2009 and never just touched again.

A

So it's really ancient and the logs.c source it's a c source, not even a c plus plus source. uh So you need to duplicate a bunch of entries from ffs and the squares the header we just saw. You know we will. Probably you know, make this much more cleaner, so that you know so that it's much more approachable for people trying to um you know make sense of the whole locking thing. So these are the state machines that use. That is used by uh you know the the locker class.

A

uh So you have, you have state machines for simple lock. um Then you have state machines for uh these catalog, which is the the the actual distributor lock where we do uh share, read, share rights uh and then and then the state machine for the file locks, which is extremely complex, so yeah, the local lock is, you know, is the probably the most simplest one, because it's it doesn't involve anything. It's just like a basic um anyone can lock anyone. Can anyone can read log, but you.

B

A

uh The authentics can take the right locks because the authentics can and can only move the states around and increment the versions for that particular uh inodes. So we'll do this uh probably uh series, uh so I hope uh you know uh uh going through the um different uh file operations.

A

uh You know things have started to make sense. At least you know why all these locks are required and while different types of lock types are used, and if you know when one, if you get a chance to implement a new file operation, you know what you know locks to uh to acquire uh and um if you're modifying the excitators you need to take a different lock if you're modifying the uid or the gid take a different clock.

A

So I hope that is clear and I think that's that's mostly what I have so I'm happy to take any questions.

C

Thank you, so is simple, lock kind of a distributed lock.

A

Yeah, so all these kind of so the simple lock is the base class lock in typing and implementation, uh so everything is kind of a sim a distributed lock. uh So you know it's so I'll quickly show that uh if you see the simple lock thing it says you know anyone uh can so anyone can read lock, uh which is like even the replicas can read lock.

A

But nobody can, you know, do a right, lock or an exclusive lock. Does that make sense, so everything is a distributed lock. uh The the thing is, the semantics are different, so simple, lock has different semantics for uh read, write and exclusive, uh while catalog will have a different semantics for semantics for read, write and exclusive and file lock again will have different semantics for read, write and exclusive.

A

C

Yeah, okay, I mean local, lock, isn't a distributed lock right from the name itself, yeah.

A

Right right, that's that's only be yeah, that's that's because it's only used for bumping up the uh ci, node inode and the d entry versions, uh so that isn't distributed. That's only locally used by that particular mds.

C

The other question I had was regarding the vector of logs, so that so you construct a vector of logs for each f is that the concept.

A

Right so everything, if you see uh handle client, get atta.

B

C

Anything that needs to happen.

C

We didn't go through the code of acquire. I see that we construct a vector of locks for each f op and pass it to that acquired right function.

A

Right right, that's what we do so these locks are locks, are uh grabbed or taken or or held um still the scope of the request, and once the request is done, we you know drop all these locks, uh the other, the the uh the the a minor point here is, you know the mds has something called as an early reply uh where it can. You know reply back before. Even it starts journaling.

A

In that case, what happens is uh I think the read locks are dropped. uh So after the early reply, the read logs are dropped and then once the uh the the journal hits the disk uh or the operation is journal, uh then we drop all the right and the exclusive blocks. So that's a small point um yeah, but the scope of the lock is the scope of the request itself.

C

Okay, now I was just trying to understand this concept of you know having a vector of logs and why? Why maintain it as a vector.

A

Yeah, so that's because we need different yeah, that's because we have different fields that need to be either read or modified for an operation like like. We saw right uh when we are doing creating a new dng like a new directory. uh We need to read the permissions of the parent directory for permission checks, uh so that requires a read lock on the parent outlook.

A

Then, once we are creating a directory, you need to update the number of directories and status. That's that's information like the their stats and the and the r stats. So that requires a couple of other locks to be wr, locked right locked because that needs to be updated, um and then you need these snap locks for you. You need to protect these snaps for the for. So we take these snap locks in so we relock these snap blocks, so we have different because there is not. There is no one log that guards everything right.

A

um There are a bunch of logs some. There are a bunch of logs that covers different uh fields in the metadata in, in that particular data structure, uh and some need to be read logs. Some need to be right, locked and you know and oh and to make it more complex. uh You know it could be that you know some of these frags are kind of are handed over to replica mdss and they are generating read capabilities.

A

uh So you know to update to update the a time once a particular uh read operation is done on a particular flag. um All the it needs to be kind of you know uh you know integrated and then the auth mds needs to uh kind of update the a time. So uh so you know you can you can imagine the complexity of that? That's why you need all these uh distributed locks and all these state machines, so that you know the locking is done correctly and as efficiently as possible.

C

Okay, yeah thanks.

B

Yeah sort of the fundamental design of cfs is that the metadata is uh sharded out different under different locks. Right, that's a you know, and it's really uh quite different from most other file systems.

B

A

And that, and that introduces a lot of complexity, a whole lot of thing. You know all these locking, and especially when we come to scatter locks, uh you know which are like the nest, locks, the file lock and the nest.

B

A

And file locks you know, and that becomes it becomes really complex because you're just like starting out all these metadata to replica mdss, you you're not only sharp, but you you. You are actually splitting up a particular directory node into different frags, uh and then you can. You can replica you. You can make different copies of these frags and assign it to different mds. So you know that makes things interesting and complex.

A

All the questions it seems so thanks guys for attending this talk, so yeah uh we'll do a part two series- uh I don't know when probably the next slot is already taken, and probably uh the other one too so we'll do we'll do a part two series which covers uh the actual acquire logs implementation, we'll go through the state machines. I will see how simple lock works, so yeah yeah see you in august guys. Thank you.