IPFS Project Operations Working Group, 19 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Project Operations 2019-08-19

Description

Notes: https://docs.google.com/document/d/1IZ8tyWKnuXKTMbYaZO_8FVUKJ1kxA4imgPaTBhXXbAc

A

Stephen have enough credits he's done. It welcome to the ipfs Bifrost gateway as a service product eyes. Let's run ipfs really well for lots of users regularly weekly call on the agenda. Today we have a TV okay, our check-in. We don't need to spend too much time on that. Let me just share my screen.

A

A

Can you see a thing, what we call we've got em burns. Can you do some okay, uh q3 estimates? Zeros is fine. It's just. We just need to need. The report in looks like I'm.

B

Sure I can pull that up. Yeah everything.

A

Is done, it'd be nice to spend a little bit more time going through this and making sure that we're focusing on the most important things but I think we should do that with Adrian and Hector present. So I'm not going to spend any more of this cool time on that.

A

But I would like to talk to Michael burns about the ewr nodes G. What what's going on Michael burns.

B

Gotta find that mute, but so I don't know what like the root cause of that problem was.

B

Yeah I don't know if this was like weird GC or performance issues, just bubbling up and ewr happened to be like. It just happened to be it's bad day, and so he had the floor.

A

B

A

Was the well did we see that was wrong with it? Okay, just he said you, it's like it's in a sorry state.

B

Yeah so let's see originally it was this space filled up and not clear why that didn't alert more mildly at us, but bank 1wr was just like unresponsive, so couldn't get him to SSH SSH yesterday couldn't kind of recover via the console, so not not sure what went amiss there went awry there, but it was sad easily solved by rebuilding it, so destroy it and rebuild it in terraform, which revealed that we are mid migration, which is kind of the worst place in the world to be so.

B

Terraform has a different view of the world and the config, then nginx or then instable, and so when building nginx it writes an invalid config file that config file fails when you try and restart the daemon that causes user data fail. That causes terraform to fail and like the box, is just in a sad statement. So you go massage it and get some things over. Can we, let's.

A

Easily yeah easily done.

B

By hand but trying to get those out of sync trying to remove all the logic from terraform, so that nginx is just handled by ansible and there's not like a split brain situation.

A

Strongly agree that we are not in the best position right now. It is surprising to me that the Terra forms can't be run right now.

B

A

My understanding is.

B

A

Said the Terra forms run, but there's an error from the used, a script which means their forms, though correct.

B

So it's it is not able to deploy the certs that the nginx config file mentions, and so you can't restart the service. So we partly migrated off of it. We just left terraform in its add state should not have let it be there and it would be nice if we could wire up a nightly to catch that a little earlier, but I'm Midway fixing and should be half of the PR for that up within the hour. It's just deleting stuff from terraform. Basically, nice.

A

And then the process with the run, tariffs or then run animals and happiness precisely.

B

Oh, it might still be a two-step process for a little bit. Yeah.

A

B

Terraform will fire Bob ansible great.

A

That's progress.

A

In the meantime, the ewr know it is still not reporting.

A

C

Also, no traffic being routed to it because ever never gets installed because user data fails. It's just so far messages. We have limited capacity.

A

B

So I will go poke at it. I believe net data was streaming in last night when I called uncle yeah. There is at least I'm missing monitoring alert there as well to have.

B

Marcus, you mentioned this and I didn't dig into what that code actually looked like, but file coin handles this as part of its user data script, which seems like yes, but dope way to do it.

C

Yes, so any time the user data fails for any reason, there's a trap in the batch script. That will then fire off a alert to the previous alert Andrew. So any time and no doesn't come online in the unified least. So we don't out that we would be able to at least detect that a node failed to deploy.

C

B

Super handy and presumably I can just go copy, some bash and wire that up precisely.

A

A

So all of a WR is on Michael Burns it's plate is that okay, Michael Burns.

B

Yep I can get those up in short order. Okay,.

A

That seems like a high priority because we're seeing alerts, memory use like 5% of RAM available on the one remaining machine and it's a pretty busy machine. Gotcha.

B

I will type even faster, no.

A

It's not my objective mother. This should be the priority.

A

A

So then, the thing that has happened recently is we have now pinned all of the IP FS hosted sites to all the gateways, except for the ewr ones which, as discussed, it hasn't been possible yet, but this hasn't been the Magic Bullet to fix all our the sporadic slowness of certain sites getting access to, although I'm interested to keep watching it, but it should definitely improve it, but the things that are now impacting it are having a more proactive node failover strategy for Gateway nodes, so I think this morning, AMS one was being slow and Alan sure was reporting that he couldn't load file.

A

He could load up your Vesta IO, which seems like the next most important problem to souls if we want to actually get decent response times and reliability from these nodes.

A

I've raised the issue for it, but I was wondering if you guys, having had any thoughts about it to make sure that my assessment is correct, with we're relying on bird bgp and dns to do a kind of passive failover when a machine completely dies like the bird it's. The third thing. Third demon has to fail for the thing to be taken out and then there's still a kind of latency of a few minutes for the DNS they stopped routing traffic. Is that correct, yep.

C

Yeah I, yes, it would be potentially a priority to deliver the redesign, as in have the nginx and bird daemon coupled on one machine and then I think I. Think some people have suggested that having the the nodes upstream of those so that we can at least have like some sort of back-end checks and fail overs, so I think right now, yeah. The idea that bird is our only health check is a big flaw and probably contributes to a lot of our issues. Cool.

B

Totally agree: I will raise one historical note before we rebuilt the gateways on packet.

B

Little less than a year ago we had the model of one front, end box, running bird and nginx and then a back end box that did ipfs and from that discussion we wanted to move to open resti so that we could have proper, active health checks because we were having the failure mode where ipfs would wedge, but nginx would not be aware of it, and so would keep like sending traffic on to a dead box easily fixed.

B

We kind of we architected to solve that problem and obviously uncovered other problems, but having proper health checks that can handle like backends and failover is like that needs to be part of the the fix, I guess.

A

At this point, what would prompt a health check? Look like my my suggestion was like a type EFS get requests for the empty directory seems pretty light. I want Steven. Is that a reasonable health check at this point? Do we have anything more satisfactory, or is that a good one? Sorry is everything else? It's that's! Okay, we're talking about useful health checks that an IP FS demon is in a functional state. So so the nginx can stop routing traffic to a stock like BFS process and.

D

A 500 machine trying to fix the foul, no yeah, there's no C, because I.

A

Was gonna say, try and get the empty directory? No.

D

In all, as for the empty directory, so.

D

It depends what you're trying to check.

D

If you just try to check to see, is the idea of festered even responding yeah, that's probably fine. You can probably also just like fetch the version or something like that, although that won't go to block store, if you think both.

C

D

To the block circuit of caching everywhere, but I think these actually just like forced to go through bit swamp, occasionally yep and like at the time of one machine, fetch timing of the Machine yeah, and you actually really want like very so. If you just want to test it like is this than you're responding.

A

We need a thing at the level of like nginx can make a request or open resti or something can hit an HTTP, endpoint and say yep. The thing says it's responding, so so you like roll.

D

No, no once again to the curse, no.

A

D

You can curl API is that v-0 such a person, yeah.

A

Yeah, that is, that that won't tell us much other than like. Okay, the daemon process is.

D

Funding, yes, the investors Riley, but it said, depends what you want they, if you want, if the block stores responding that.

A

From the level of we want, they get get requests for data to it. So this is on a gateway node, its fronted by nginx, it's high traffic and we keep saying I'd, give us demons getting kind of ultimately frozen up or unresponsive. Maybe that they are responsible is I, think near yes, I would yeah. We don't have the answers. Failure.

D

Okay, isn't like the first enough, not serving data that they have frozen up, just not surfing things you're, exactly with a way to have that level of granularity that I'm, assuming it's frozen up, not sort of things. They have locally cached, which means that yeah, the solution is you that you find out the Machine. You add a files up. If you try to fetch on the first machine, I told.

C

D

Piping dated through otherwise it's like there's, not that much like. So sorry, we don't really even happening many locks. Okay, so there is one potential problem here: maybe you find out a file descriptor somewhere or about nginx, refuses to open more than certain number of connections like that. It's like literally it's not responding to versions and that's a big issue.

D

Yeah! That's why nginx handles that kind of stuff yeah and that's on your.

B

D

That you see a machine, that's locked up test version, yes and the empty directory and then like to see what actually happens. Okay.

B

That as a first approximation will work, I think we're gonna go in to get into sadness with hitting nodes that GC and so like. The daemon is responding but they're not serving up any useful data and so yeah. This.

A

B

Happening at the perfect solution that maybe M can.

A

You restate your concern, so a GC is because they olds keep responding to requests. G does D sitting freeze the world Stephen a.

D

Gc does freeze the world, so I can't do it get a request. What EC is in progress, sort of so GTR GC freezes, the.

D

Does it trees that I don't think it.

A

No, it does not freeze, gets.

D

In life freezes it's okay, but this may block bits watt from writing to the data store, I, think yep. So like it means like you already have the date. I think you were trying. If you don't have it I think you month, okay,.

A

C

Collision but yeah this garbage collection is there any benefit of running it on more frequent basis, so that it's doing less so that it speeds faster and uses less resources. If you.

D

Guess he frequently it'll complete faster. You can also just not run it so another way to do missus. You literally just delete everything and regrade the repo or snapshot and restore from snapshot down that should just magically in most cases, yeah.

C

That was something that I wondered about for sure. If we can just use some garbage collection tuning, it seems like it doesn't run that frequently and then, when it does run it just actually frees up the box. We get a out of memory error on AMS nothing, so it could be due to do that kind of thing right if they're, if you're the you see, Q gets too too large or everything yeah yeah well,.

D

It's not a to see a piece of crap the way QC works. Is it locks entire datastore, which means we're doing how morgan has to walk so like the, although in this case this is a funky case, it's the case for like we don't have anything actually. Is there anything pinned on these game events or not? So there is now okay, so that would make GC slightly so exact used to walk that data first and then walk the datastore and figure out. What's knocking that data all right just.

C

D

Like literally delete a directory would be faster.

A

The thing that's changed on the gateways recently is because we're seeing slow, slow response times for content that is on cluster, even though we're trying to maintain connections to cluster we're regular things just slow response times. So we have deliberately pinned all of the pre PL websites and the file coin. Pretty brown use to the gateways which and in total that's about 20 gigabytes of content, including destroyed the FSBO, which is the lion's share of the.

D

A

D

So that will make GC took a bit longer because, like Kurt is the current GC. So if you let see how it's written one.

A

Second, one check we couldn't look. We can pause this for like there's an issue that we can go through this, but yeah.

D

Gc does basically is it reads in all day. That's it actually reads: all the data off the desk existed traversal and we.

C

D

Don't cache, then, once it's done, that it walks through everything in the data store, but just actually read ahead and deletes everything. That's not like within the set of things it's already read so there's reason. Some things to keep then reads instead of thing: X it has. If it's not some things to keep it clean, looks like if you have a few things to keep. It's been much faster, actually reading all the data, a second user, putting a lot of things GS, you see, will slow down.

D

If you have any on your team, it is, and with experience with conversations.

D

You never see a situation are complaining about this yeah. If you do, you see, is not scalable. Are you telling us that your team does not have benefits for this? No, it's.

A

Also, it doesn't matter.

D

That much for a lot of the work package well, first of all, like there's.

C

D

Yep really, basically not there is packed a bit of steam and there's.

C

D

Has not really followed a package managers, because actually you just need to be able to it and retrieve files quickly. They don't really need to GC that much like it's somewhat important, but like actually a lot of actors want to keep old packages like running activist infrastructure. People do care about like GC, and all these like higher, like bigger system issues, yep.

A

Okay, I see from the point of view of the Gateway endeavor like with a P, whatever p0 people on endeavor, so I might have to pull rank on that at some point, but because we don't have bandwidth is superb bandwidth constrained right now.

D

It's not like this like. If you asked if I seemed to make some publicity I seems to say well, yep yeah, yeah, understood she's, sorry back, but this what it really is. Not it's like there's when you look at the actual roster.

D

It's I think it's calling me and maybe Eden is okay.

D

A

Okay, we can, we will have to hobble along with the TC as it exists. Now we will. We will try and raise it up the list of priorities and it's kind of on Adrian best. If he has time to look at it. Hector is very limited and is going part time soon.

A

Anywho, okay, let's put a pin in that, so the outcomes of that are because we are now telling things on the gateway. Gc is now costing us more in terms of RAM, RAM and CPU time when it does a TC, so we may trigger GCS GC more proactively, so that its operating rather on a full repo, but wouldn't we just wouldn't we just shrink down the high-water mark like the max storage Feli rather than my product? Are we trading it just give it a small earth mix, that'll.

D

Help a bit I, don't think that'll hook to much like sea ice it'll make it go faster, but if you're putting 20 gigabytes, that means it has to read 20 gigabytes off the disk and deserialize 20 bytes of stuff, and then you have to deserialize it like understand it. So the followings!

D

That's why we were a large portion of the time is doing the rest of it is just like walking through the links. That's not that bad yeah!

D

Okay! We actually are you using badger, or are you using that of us? Let's say we're using flat? Okay, well, yeah! So there's a lot more, sometimes properly, reading the 20 gigabytes of thin stuff. How long does it take usually.

A

That number to my fingertips, I, can dig into less always an issue for it, and maybe we can just Thresh out what are the best strategy for us in the short term and then longer term is and.

D

The simple strategy is just like snapshot the red boat when it's in a good state didn't delete everything so like btrfs or any snap, something or just copy things. If you have no space in this know, then that will make things a lot faster.

D

C

A good point: yeah thanks for that tip honestly, it I I think that they could definitely be aligned to pursue sure seems.

A

Relatively low impact- okay, that's probably enough for GC, but it is different in issues yeah.

A

So there was general support for readers owning our gateway, node failover strategy, but mr. Burnes added the 20 that we've done this once already. So we should learn from the mistakes of history, so I guess it'd be useful. Mr. pendrick's, you could add that notes to the issue just.

C

On the boy sure.

A

The point force Reapers, but maybe I guess, focus on the ewr nodes. First, someone has added oil coin. Proof, params and IP get downloaded update to this list.

C

Okay, so this this is some interesting context to think for the Gateway team. So was the last week it was a while I was on holiday and the five point team I was having problems downloading proof, parameters from the Gateway.

C

They were seeing the download it would start, but then it was gonna taper off to like you know, zero bytes being transferred people would cancel so they're getting frustrated, so they decided to try using IP yet so they would be able to to to download the proof parameters without involving the Gateway and what was interesting about that, those that got it working. There were some problems, but those that got it working still had major problems with downloading's of crews grounders. So some people reported in a 1.5 gigabyte file.

C

It would take two and a half hours or or more, and they would see the same behavior of the download tapering off and then starting again. So this is a little bit of anecdotal / evidence that there's potentially some of the issues that we've been facing on the Gateway are not unique to the Gateway and are more global to the I. Feel fast Network, so just in context for some of the things that we've been seeing aren't necessarily remedied by using a direct connection or by bypassing the Gateway yep.

A

C

A

My understanding of this is that there there was some we've got patches into the Gateway nodes and we have DHT boosters but correct me if I'm wrong Steven, but we sort of we know that there are still performance issues with the DHT and really yes yeah. So we we know if we know for sure that, like we are not out of the woods, these are the you're not even.

D

Close, no, it's slightly less terrible, yep like at the moment we're closing that we have a lot of fixes. These enable hope, we're not closing it like. We are still now working on the chest and version of Church. You, like it analyzes fixes in terms of you, yep.

A

Yeah exactly that, like we, we have identified this problem, but then I we've gone, we've gone big solution to it, and so the fixes are now at the mercy of the testing procedure and the new release plan is.

D

Managed that's.

A

D

Yeah the release.

A

Plan is the reese plan. Is great I mean like the sticking to it? Yes, enacting the release, one sweet? Okay, we have no minutes left. Does anyone have anything else? They want to talk about I.

C

Just wanted to say that my main area of interest from the from Petra's planning document seems to be around service levels and indicators and I just want to point out that I'm still quite blocked on making any progress on that, because we still don't have a lot of great metrics that are coming in so like one blocker is definitely the yes exporter thing. So I just want to make sure that that's still a prioritized thing, because I like to be able to have a success rate with a gateway. But currently that's that's impossible. With metadata.

C

We have attached to our net data metrics.

A

Remind me why it's impossible with the current data.

C

Because there's there's only the only metadata or like labels that are that are associated with those status codes is actually there, there's barely any. So you get a you, you can see that we have like say, for example, of four nine nine code, but there's no other information that is relevant to it. So we don't have like what the request path is refer like any of that sort of stuff.

C

So you have four four, nine nine are a big potential source of like people are having problems, but because it could be anything we have no way of trying to make sense of what those four nine nines mean without some more relevant labels attached to it. So, yes, exporter would export all the fields that we currently see in our law and yes or in Habana, which wouldn't give us a lot more options so like be a duplicate and make sense of the four 9/9 which is critical for our success rate indicator thankee.

A

It's still important, nobody is looking at it right now.

C

Sign that right, but I'm not sure it's always okay,.

A

How long, how much more, if we caught up George, is he easy, off-boarding I, remember: October, okay,.

A

It's critical for SLO s, which are critical for our incident reporting.

C

Exactly and as I say that that also blocks incident or being able to declare a incident I made a comment on that on that issue and I. Think it's yeah, that's true. We only have black box metrics still and that's like extremely scary. That's you want to have some indicators that are actually pointing to some. You know precise data that that we have okay.

A

That's cool a quick question for Steven, so the gateways are currently running a fun job. To reconnect to cluster nodes, every minute, I'm just wondering if there is there, is that a good idea it it shouldn't break anything. So if you're already connected to that node, it's literally.

C

A

A no op, but it does so your Isis through the passes, through a wait when you do call to connect as through this kind of school. It's not this like fixed, hard-coded, 100 school, that it passes through the connection manager and what I'm not clear about is what happens inside the connection manager. With that score, does it then reset the score back to 100? If it you.

D

Know it adds it so, no, no, no I'm, sorry it. So these scores are tapped such a tags, yeah. That has a unique tag so that yeah.

C

He says: let's.

A

D

Tag is only ever up set to 100 or zero yeah. You spent to 100.

A

Not set so it doesn't matter, I gotcha, so just with continually cuz the tag is user connect and the value is 100 and if I keep calling that it's always 100 and it can never be anything other than 100. Yes, okay, good tonight, all right we're out of time we're four minutes over. If anyone has any questions you can say now, but it's times up, alright, there's so much more to talk about these meetings are too short at 30 minutes keeps us from see you later I.