GitLab Group Conversation, 3 Apr 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Public - Infrastructure Group Conversation

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

A

I think we can get started so welcome to the infrastructure group conversation questions, we're taking questions.

A

Okay, should I just walk through the slides and see if, as I talk about these things, questions come up.

A

Okay, I hope you can all see that if I did this right, so I've been known not to do this right. um So I split. This group conversation in to talk about lately about reliability and some of the foundational aspects were focusing on because those I think I believe need to get a little bit more attention and wider.

A

um What our attention from from the company, because they're important, but they aren't because they're foundational, they're kind of hidden and then Marian will talk about delivery because they're doing a lot of very significant work that anything is exciting and moving, how we're doing continuous delivery and continuous integration. So first welcome to new hires. We have two newest Aries and one D BRE we're coming from all over the globe, and then we just recently heard that we actually have another hire on the way from the UK. So welcome everybody.

A

We're very excited to have you here. I wanted to take a little bit of time to talk about, distorts notes which today are running the ext4 filesystem, which is actually a great file system, but I think we can do better and in discussions with the team, we've decided to move to CFS, which is essentially a volume manager and file system and a single tool, and it has some advanced capabilities that would really like it has a long history as well tested is supporting multiple platforms.

A

A number of people on the team have had long experience with this file system and we want to address taking a page from the security team about securing and F. We want to implement data protection and that so we want to add as many protection layers to the data that we host as we can, and there are four aspects that that were worried about file system, corruption, I'm one of the lesions, meaning things we didn't mean to delete that mean quickly need to recover disaster recovery itself.

A

Gos is been working well for us, and the Geo team has been amazing in addressing issues that we've run into. But there are some things that we can actually add again: additional layers of protection by using CFS and then testing, which is something that we've been trying to implement more extensively for a while, but it's expensive because you are not having to replicate data and what we want to do is do do so in a smart way. So CFS gives us a very simple tool that implements a volume manager and a file system.

A

It's a very advanced file system has a lot of data protection built in it's always consistent on this. So you never need to fsck CFS. It provides.

B

Snapshot, I'm really sorry, but this is this is terrible. I think what is going on here, like this is like I've run database systems that are ten times the size of gitlab, and we didn't need ZFS like this is. This is ridiculous. Why, like this, is that these kind of data protections should happen at the database layer, not the file system. Layer like this is solutions. Looking for a problem and I really don't understand this well.

A

B

A

Have a fasiq aid exe for file systems? For some reason, the file system has corrupted itself so.

B

A

B

Notes, you don't win. Do we F SDK of a database server, you shouldn't, you should failover to a clean database server like you, like, literally I in the years and years and years that I've been running database servers. I never did anything to the online fsck database like it was always done offline in triage after we had failed over to a replica like there's, no way that we should never be worrying about fsck on a database. Like that's a wife hang.

A

On just a minute, why are you assuming this is database? Only absolutely the data in the database I know is very well protected to process replication, but the data on the get nodes is not we. Today we do synapses.

B

At the and we're already solving that at the get layer is we're building H a get, and we will never need. F is K get server because we can simply replicate to it and failover like these are things that should be handled by the software database and data storage layer, not by the files and.

A

There's nothing preventing you from doing that. It.

B

A

Why are we wasting time on this? Like visits? For instance, we've we've gone through data that he's been deleted, that has taken us hours to recover because we didn't have local snapshots. That's that's a a second operation on something that has local snapshots. We also want to replicate terabytes and terabytes of data in a smart way.

A

I, don't want to run three environments that are running at three times the size of the production environment, I want to use clothes and then once I do a clone if I expose that clone people can use actual data at the scale of get lap comm and when they're done using it just a very tiny Delta from the production data set. These are things that I cannot do only X before these are things that that the application cannot do it for.

C

Me I'm gonna, I'm gonna, interrupt here a bit then this is I. I think I think there's something in what you say. That makes a really good point, but I I think we should be kind of kinder to each other. When presenting the point- and it's I think you should- we should definitely talk about this. But like saying this is terrible without any qualifiers is, is not is a way to get in a very heated debate and and we need dispassionate debate. I agree, I, apologize.

A

Norris I I know we can be passionate about these things, um I'm I, suppose I also come from environments where CFS has actually done wonderful things for us and we wrote the blueprints. We wrote the design they've been available for a long time, happy to sit down about it to sit down and talk about it in detail again, I'm, not saying that this solves everything, and that this is the only solution. I'm saying that we can build multiple layers that can help us better, protect the data.

A

So there's no reason why we shouldn't continue doing some of the other things.

C

And yes, just to maybe clarify this is not like a substitute for database failover. This is.

D

C

Substitute for backups, this is no.

D

C

For at some point, having get a che but multiple get servers, it's not a replacement for any of these is just like the back backup or backup, or a data of resilience in depth where, but this is another option available on our toolkit right I mean.

A

It's just a narrative thing, so, yes, that is thoroughly for right then, and this is why I I said I was taking a page from security because they do the same thing. They build these layers to protect, and you know a safety belt spelling fail. You still have some more reliability. Well,.

C

And and if we have a problem on the database, this server, the first action, will probably be to failover and figure it out there yeah what's wrong on the disk later, like I, think we agree with Ben's point there, yeah.

A

Totally I mean we're not stopping doing these other very important things in favor of CFS well,.

B

Oh, when is my employment opportunity cost is this work is blocking other work, other work work. Is it block? um Actually, let me look at the slides and see how many slides mentioned kubernetes migration.

A

Kubernetes is being worked on. This is not a. This is not a man. Moths problem me anymore. Sorry, Stu, kubernetes, isn't gonna get to kubernetes Tosca. Well,.

B

How many, how many, how many of the s3's are working on this right now, there's.

A

At least two of them.

C

Scott's been working on those. How lonely do we think this effort will take? This is other level for this, for this quarter.

A

No, we still have to the migration and all that, but again there is number of things that write. Kubernetes is an important thing and no one's denying that, but if I speak to people who want to do testing against production, datasets I cannot offer that solution at an even remotely reasonable cost today or performance, because I need to continue co-op in this data over and over again. So this isn't solving all the problems perfectly is solving some problems that I can't solve any other way, and it's not.

A

Yes, it's taking some resources because we're making some investments on things like testing environments and the number one thing that that folks have asked over time about being able to test this I need to test with production. Scale data- and you know geo, doesn't do that. Geo gives me a copy, but I can't touch that copy and if I want to copy that copy, then now I have three copies with CFS.

A

I only need two, and whatever little Delta's and I can create a gazillion clones for people to test against their apps on the ephemeral environments. As an you know, my dream is for an engineer to say: I've, build this new feature and I want to test them. If I go- and here is my big data set and it's super cheap to do again, I think I am more than happy to have a deeper conversation as to why we we went down this path and why we decided to invest some resources on this.

A

There is no argument about it, so.

A

I'll set up a coffee with you, Ben all right, I never want to the next one. So another very foundational thing which we've been working on amar has been working on this and other members of the team is our services. Information is very imprecise, there's a lot of tribal knowledge of what things are and how they work, and so we've invested time in essentially standardizing how we represent services in terms of metadata. We use this for animation. We use this for monitoring.

A

We use this for figuring out incidents and dependencies and and some other meta data that we need to essentially carry out our business. So we've developed a service inventory as an API in front of it, and it's it offers structured data about services. This will allow us to not just do better automation, but also to perform auditing, so when services have to meet certain characteristics, that means that we can ask this service for information and decide whether a service is actually in an operational state. That is that we can work with versus.

A

You know run books and handbooks and we keep a not wicked pages to do and use that. But so this has happened running a marked, the Service Catalog I believe is called and we help with that. They put a UI in front of it, which is super nice, because now we can go and ask questions and get very structure answers.

D

Jerry, so this is, this is Anthony from the security team, um I added a question the doc on this I. Can you just give like some examples of the? What of what services are the kinds of things that are in this? Are we talking about external dependencies like mailgun internal services, all.

A

Internal ok, explain this is all internal, so this is essentially our internal topology and we want to be able to do things like we have a service inventory and we keep the owner as part of the meta data. So when we're calculating it or budgets, we can say. Oh, this incident affected the service.

A

This is the team that it belongs to I, haven't looked at the entire spectacle, chunk of structured data that you can get about services eventually, I think monitoring specs can go in there, so we're starting to essentially flop down in a single place where our environment looks like and doing it in a way that our tools can ask questions when we meet them.

A

So today, for instance, you know some services that are very obvious, like giddily, but other services are not as obvious, so this is our first iteration to say: here's the phone book for the services we have and that doesn't necessarily map to specific components, because the application provides.

A

So this is this is a way to do away with rival knowledge and have this in a way that is consumable by tools. If we eventually find that we need to extend that to external services, the model is simple: it's you know just a bunch of attributes. So if we needed to do that, I think it's just adding the data, and there may be some other attributes that we may need. That would be fine as well.

A

I mean again, as this start as a as a gamma file that Andrew created and then a more expanded that, including that the definition and then built a thin layer that allows us to ask questions about services, and then they actually want to give this really nifty. Api UI I mean it's simple, but now I can go and find out things. So when I'm cranking through our budgets, this is a lifesaver great thanks.

A

Sorry I know order this right with a screen sharing, but let me.

E

So I had a question in the doc: okay, yes,.

A

I'm, looking at the door service information, are we considering tracing I know under instrument that they give that yes, I, don't know how I knew or if andrew is actually wiring these two together, but I know that if experience has shown me anything it's that once we have this more authoritative catalog of services, more things will will consume this. This data, so if I assume Anderson Nicole key may be able to answer way more thoroughly than I can.

A

But minimally I would imagine the naming of the services, because in some of the tools we were in and in the past a single uppercase, lowercase type thing essentially makes it will say. Oh this is not the same as this, so that should take care of that.

E

Sorry, I, don't think I understood. So basically, what you're saying is the way you guys have set up the self-service data source with the metadata is how you're going to like join things together to make your trace. Is that what you're saying which is different, I.

A

Don't know that the trace the tracing system would use the service directly, but I would imagine that it does use, for instance, how we represent a service name, so that we have a single representation of a service name so that when you capitalize or or maybe yes, by ID, and then you can look it up. I know that we're we're looking at using this with our budgets, for instance. So instead of me saying this is giddily and capitalizing the G or not. We say that is there.

A

That is the thing that had an issue and how what that is name I I find that later and if we ever change some name or in the case where, for instance, we decide to break up services for some reason by using the centralized service directory, we can do those things and the tools will consume the data. We're not harkening service data everywhere and.

E

So when you say they would like, you know, the service data will tell us what the name is you, that is, that a visualization or there's.

A

A UI that the the team built for it. Essentially, that was the first thing they built so that we could actually play and interact with it, and then they built an API so that we can consume the JSON that the API calls returned. So if I'm writing a tool that requires the name of the service, so aerobatic, for instance, then those are API calls and it's the service to get the information it needs, because maybe okay I know from an incident.

A

It was caused by the service and then I need to do attribution to the team instead of hard-coding that in a spreadsheet or everywhere, we just know that that team owns that service. The great thing about that, too, is that let's say that a service moves teams right, so some other team takes over a specific the needs to update the directory right, and so, when I ask that same question. Three months later, I get the right answer.

A

E

And do we go deeper than the Slayer level activity functions and like okay within the service? Here was the problem.

A

I'm not sure I follow so.

E

If you have a service and sent various components within the service right, there was a bottle made. Then there was a response that came and then this happened or whatever so.

A

This is, this is less concerned with tracing in terms of following I. Think of it more of the if the trace says: I hit service, X and then service y. If I need more data data, more metadata about service, X and then service wine, then I get it from the service directory. So this is very contextual right. So, if I need to set up some, let's say I do some dependency, monitoring and I, say: service AAA depends on service B, and so, if service B is having an issue, don't hate me about service a just.

A

Let me know that service B is broken, so we can. We can build some of these function, I with a directory right now. The rectory is just very simple: we really want it mostly to capture sort of do the breakdown of the services to begin with, who are the owners, and then there are a bunch of other attributes that the team decided. We need it onto service automation, but this I mean will continue to iterate on this.

E

Thank you thank.

A

You see more questions.

A

Robert, how are minimizing the risk of CFS not having direct kernel Linux support? Has that changed in the last year, so I believe it has because I believe now than to actually ship satisfy of the kernel. I know we had some conversations about the legal aspects of this and we decided that, yes, we could do it. I, don't know the details. I believe they're in group and if not I can I can definitely get the details.

A

Then, when what is our kubernetes plan, what is the timeline? What is hold up, kubernetes is being worked on and we were shipping to services. I think this quarter on kubernetes, so it is in progress. It's not it's one of our key arts, so it's being work, but if we want more details about it, Dave would be the person to talk to, because right now, I, don't have all the details but we're working on it. This court.

E

Can use following up on that I'd love to understand what the move to kubernetes will do for us and just in a snapshot, if you can share what will change after we move to kubernetes? Oh.

A

No you're not now I'm a little bit outside my area of expertise. I can.

B

I can give a nice example of how it would help things.

B

For example, I spent the better part of two weeks simply trying to scale up the number of web workers, because the process required a significant amount of overhead to just simply I'd like to have some more web workers and some more web nodes and and getting that spun up and getting that available was not something. I could do easily with the current infrastructure and on kubernetes. It would be a significantly simpler change to say, make me some more pods and they would appear right.

E

Okay, that's helpful, so that's something but running into daily already.

B

Correct yeah, but the current infrastructure takes quite a long time to to reconfigure. Thank.

F

F

B

No not yet, we've been working on getting the data storage transferred out of the local disk and into Fano storage. We actually I just completed I just completed one of the down stepping of the production data storage, so that the Prometheus servers only need to store now a week of state in order to operate and will actually be stepping it down to 24 hours of State locally within each Prometheus server, and that will allow us to migrate those to kubernetes pretty easily cool.

A

Okay, Maron delivery.

F

Yes, I can give some color on the slides, but I'll start with just repeating that we have two back-end engineers, scientists, structure, Alessio and my restarting next week, and we are already starting to work on introducing some feature inside of github that have been a lower priority, but have a higher impact on github.com, such as storage limits.

F

That obviously, is impacting our costs as well and I'll just quickly run through the single codebase effort. Where we're making a quite a bit of progress, there I think the graph of the diff between C and E is trending down and we are somewhere in the area of two like in the year 2015 we had the diff between cme the same as it is now.

F

So there is a lot of great work being done by the whole engineering, because this is an engineering, wide effort and soon we'll know a bit better whether we are going to reach that ambitious deadline of May.

F

We are removing some manual steps from the security releases which security release have scaled up significantly in the past two months, three months. Actually, no, it's April four months.

F

So a lot of manual work created serious amount of overhead and we automated a number of previously manual tasks like picking like ensuring that things are back ported correctly and I, think that is already having effect on how we can share faster and going further.

F

We introduced new environments where we are going to be testing our new I'm just checking time where we are going to be testing some of the some of our new tooling and we are on our way with automated weekly deploys, so whether that's done with kubernetes or with whatever it doesn't really matter. At this point, it will very much matter as we as we start speeding up this deployment process. So we need to change some processes around development, as well as how things end up in production and, finally, just a plug there.

F

You can check out some of the training sessions that the teammates and some of the conversations we had all of its on YouTube, so we are at G underscore delivery in slack. If you have any follow-up questions, I'll be more than happy to answer them. I think we're also at time Gerry I.

A

Will have four more minutes, I'll take more questions if.

C

We have we try to enter a twenty.

A

Five short meetings.

D

C

To have bio brakes straighten out their hair or do other things all right. I love you all with my hair, then thank you.

A