Kubernetes SIG Scalability, 7 Jul 2016

Previous Meeting

⏯

youtube image

►

From YouTube: 2016-07-07 Kubernetes SIG Scaling - Weekly Meeting

Description

Public meeting recording of the Kubernetes Scalability SIG.
Check comments for meeting chat log.

A

Okay, like seven just started, recording its July 76, kill meeting back to you, Joe yeah.

B

And we will be posting these video public, so don't say anything that you wouldn't want to show up on the front page of the newspaper. That's pretty high standard all right. So, let's start out agenda items. Is there anything that folks want to bring up and talk about now, I know: Bob. There was some talk of Samsung showing off a sharing scale data. Are you guys ready to demo that.

C

Yes, in a limited form, so I basically what I wanted to show is two ways that we can save reduce data and it's not a complete demo at this point. I'd like to actually give this at a meet up or something else for actually have slides, but I wanted to kind of show everybody the two options, we're thinking about and then get some feedback in terms of what other people are doing in what they think.

B

I, okay, that sounds great, some a demo in ways to save and share from me. Yes, hey right. um The.

A

Floor is yours, I think I.

B

Would you say about oh I said I said the floor. I think the floor is David's. Isn't it yeah I just want to make sure that there's other stuff that people really want to get to today that we can we can get to it, but um going once going twice all right, David, please take it away. Okay,.

C

D e: how to share my screen.

B

I have a green button on the bottom bar that says, green icon shows me there. You go. Ok, ok,.

C

So the motivation for this is, we run a lot of tests and we run a lot of benchmarks, but we don't retain a lot of that data and we don't share it in particular.

C

So there are some tickets going back as long ago as one year with rufius talking about how they intend to do long-term storage. It's issue number 11 for them and they have said that they do not intend to take responsibility for long-term storage.

C

So some of the reasons that we want to find a way to share this data are that we know that Google, core OS red hats at all have been collecting a lot of data as well as ourselves, and we think that there's value and being able to share that data so that we can compare different implementations and different stacks.

C

So, given that permit, this doesn't have a way to do this. Now we looked at two possible ways to do it, so I'm going to show you the first way so I'm doing this locally. This is not going to be on a kubri Eddie's machine, but the commands are analogous so they're, just some basic Locker commands so to start out, I have a default, daugher machine and so odd. I can't quite see how do I hide myself there. We go. Ok, so.

C

So this is the dollar machine that I've set up my local system and you can see that I'm not running any containers. So to start out, I'm just going to start prometheus and.

C

Once I started, let me explain what each thing here does so so, first of all, I have a basic with this configuration.

C

And I'm just scraping myself, so nothing fancy. um The other things to note are that I've set a the storage directory to be from the root partition, Prometheus and then I've set the local retention time. So this is really important because Prometheus, because they do not intend to be a long-term storage facility, they periodically will Cole data. The default is approximately two weeks or 15 days.

C

So clearly, if we, if we back up Prometheus data- and we want to view it a month from now or a year from now- which is not unreasonable as soon as you start prometheus, it will call all of the old data. So we purposely start Prometheus with a long retention time and in this case, I chosen a hundred years.

B

So is there any? Is there any talk in the Prometheus community about sort of having Prometheus run in sort of a view, only mode versus versus sort of a collect and because I mean that's kind of what you want to do. Is you want to be able to point it at some data and have it be read-only on that data?

B

So I haven't seen anything talking about read-only.

C

um You can set, and then in the case that we're going to show here what we do is we actually we change the configurations so that we don't collect any more.

D

Dough once we start, you know, I'll be.

C

Quiet and watch the demo: it's not that interesting! It's a good question, so please ask questions as soon as anyway housing. So we can see. I have this container now, let's see if I can see anything. So here are the metrics that I'm collecting for the purposes of this demo and.

C

So the next thing to do is we've been running for well, that's we've been running for two minutes and.

C

So we want to do is we want to go ahead and collect this data now I'm skipping one small step which doesn't technically matter, but what this means is. So what am I doing now is I'm going to back up all of the data, and so what I've done is I've executed at our command inside of the docker container and I've just torn up everything in the database directory into a backup directory? Now you can see from the way I mounted this. The backup directory is actually another directory on my host machine right.

C

So now, if we look in back up, we can see here's this tar file that I've created the contains my credit is data. Is.

B

Anything you can do to cost Prometheus to flush that to disk or is it you know, is it caching stuff I mean so that's.

C

The purpose of this local memory chunks parameter here, okay um d, by setting it lower, you can cause it to flush your data to disk more quickly, but there's currently no synchronous way to ensure that all of the data is on disk. um So in the real world.

C

What you would normally do is, you know, pause for five seconds, just like in the old days, ext3 just to wait for the five seconds sink right, and this is one of the limitations of this method of collecting and we'll see later, a different method of collecting which doesn't have that limitation, but then has other limitations. So so we backed up this data and let's.

D

Go ahead and get rid of our restore.

C

C

So we're going to make our restore directory in this part puppy does not fail one in so.

C

C

Okay, so five copy that data into the restore directory I'm going to go ahead and get rid of the previous container.

B

What's the format of that data? Is that something that's documented somewhere? Are there other tools for manipulating it? So, no so.

C

The this is the interesting thing is that all of the storage, the back end storage formats, are all well they're, not proprietary, many of them most of them, are open source.

C

However, there aren't many tools that know how to natively manipulate that database and in particular you wouldn't want to, because if you had Prometheus modifying that the database, which would be the normal case, you would end up with correct yeah few people potentially reading and writing. I just wanted to point out that there is a really good article here that talks about the way the disk.

C

Storage is is saved and whittled Xing, and then they use their own proprietary format, which is documented here in order to store the data on disk. What they do is they have one file per time series.

C

And so for very large stirs you need to shard and have multiple promiscuous servers and in fact, if you have a very large number of time series, you have to worry about running out of file descriptors. So those are two known scalability problems with me this, which can be fixed by sharding.

C

So so we don't have a great way of.

C

Accessing this data, except through Prometheus now so we've we've got this data here I think um actually might not work. This doesn't look like.

C

Like a valid file or directory so.

C

So you can see here, I am trying to mount this with Prometheus static at this point and the reason is, is I want to stop scraping so for this read-only view, it's it's a read-only do in the sense that I've configured Prometheus not to scrape anymore.

C

So it won't add data, but it still will call any old data, except for the fact that I've set the retention period to be 100 years.

C

C

Have a feeling one of my paths is going to be wrong, so I just wanted to do this slowly.

C

So all of these parameters are essentially the same thing we saw before.

B

So it looks like.

B

I'm just reading through the Prometheus docs it looks like there's a checkpoint interval on the the local storage system, mm-hmm that defaults to five minutes so that it didn't. It didn't check point of flush, stuff, right, yeah, so.

C

Your wagon that.

E

C

We'll just go ahead and.

C

Choose a different power file.

C

And this one will just use from yesterday.

B

Julia Child take one out of the oven.

D

C

Let's look at that restore directory, so that is a restore directory and the problem, then, is just the path Thank.

C

C

Okay, sorry about this, so.

C

You just verify everything is where I think it is. That's there.

C

So let's do this.

E

C

Now we have a ridiculous instance again again: I apologize, I've been modifying this since five this morning, so just a few hookups, so we have a ruthless server running again. The darker machine, IP is not change, so let's go ahead and see if we can access it. I reloaded the metrics page and sure enough. There is data there. So let's go ahead and in let's, let's do a query, so I did a query now.

C

It says there is no data and the reason is because my time so the way the range queries work is you specify a unix timestamp for the beginning in the end, and then you specify a step in seconds and there is no data for the time period that I've chosen because I was defined period.

C

I chose was in the morning and we were no longer in the morning and so I go back in time and I try to do this and I run into a problem, and that's that you can only ever query eleven thousand points per time series.

C

So, uh let's see, how can we find out.

C

Well so so let me finish off with the with what I demonstrate so far I've backed up prometheus I restarted it using a database from yesterday and I now have the difficulty that I can't actually easily query it because I don't remember what the date was in a unix timestamp yesterday. um Let me see if I have anything in my notes. Maybe this.

A

You had a second method. You wanted to run past folks to without the.

C

Yes well, so so we're pretty much at the second method. So let me just summarize the difficulty with this current method, so.

E

The advantage of the current.

C

Method is everything is in prometheus. It's compact, which I do not think, is a particular advantage, um because this is the reason. That's not an advantage. Is that when you actually do analysis with this theta you're going to be dealing with large data sets and in all likelihood, you're going to end up using MapReduce jobs or some other tool, I mean you could depend on or fauna. If all you want to show is a dashboard, but I.

C

Think part of the value of sharing is to be able to do sort of the comparative analysis that we saw with the core west scalability paper, one of the things you'll notice about the scale of paper they did there is that none of the graphs were dashboards. Their graphs were all generated using something like matplotlib Boca or even it could have been our. But it's it's used the used graphing libraries that are amenable for Clinton publication, so the second. So the current method that I've shown shown is that you can stand up.

C

A committee of server pointed at an old database. It will start, and you will have that data, and then you can point your opponent front end to it and you can have all of your pretty dashboards in you know. Do whatever you want with it. However, not everyone may run from it to us, so there's a second way that we could collect data which doesn't have some of the limitations of the previous one. So the previous one, we see the raw database itself, which is again it's permitted specific.

C

There aren't tools if any, that I'm aware of that are able to read it, but using be once we have a primitive server up. We can query it and that's the reason you need the time stamp. You know probably here's a better thing to do. Actually.

C

So I'm just going to go ahead and stop this from a disorder.

C

Get rid of it. I'm gonna go and start a new Prometheus server so that I know what time it is.

C

Okay, so now we have a front of the server again reload. We see we have.

C

Metrics being collected and.

C

So now we're just going to go ahead and query it.

C

Those queries going to fail as the times will be wrong, but we're going to fix the time you squeeze in them because we know them now.

C

So this second method is all right. The second method is essentially the same.

C

Going to see what I've done wrong here, uh my too far into the future.

C

Well, I'm not sure um at any rate, what what you saw earlier and I'm not able to show right now is you can query the Prometheus database while it's running so instead of actually capturing the database, you can query it and get a JSON result.

C

So an alternative way of sharing data would be to just save the JSON, and what you would do is for each of the metrics on this metrics page or any of the ones that you wanted to share. You would do a query during which covers the duration of the test, and you would just curl back query into a file and then share those flat text files with others.

C

Now, one of the advantages of doing this is that you then have the data in a format that you can import into any of the tools I mentioned earlier. Things like our boca, matplotlib, etc, and so you can do more advanced analysis that you can't do using simply grow fauna. The disadvantage of this is you lose for fondest querying capabilities, because you only have the ability to use the query language at the time that you capture the results for sharing.

C

So those are the two ways that we have right now for sharing Prometheus data. I guess the question I have for the group is: what do we think is the value proposition in between.

C

Itself and then sharing that which then has the complexity of people having to start Prometheus to see the data, but you get the advantage of having a query language, vs sporting, just simple JSON for sharing, which gives it the advantage of more tools can understand JSON. But it's a lot uh well the size again, I already doesn't matter. You lose the ability to use the primitives query language, which I think is a pretty significant loss.

B

Then it would have any opinions, I mean, so the ideal here would be something that mixes these, where we have something that is usable by both promesas Prometheus itself, so that we can sort of retain the Korean language. But then it's also so you know well documented with good tool set so that you can be used in other contexts. um Would you say that's correct, I agree um I'm, just I mean like that would be. That would be the sort of you know if we could wish for a pony type of thing. Well,.

C

It's actually easy right, we would just do both, we would save the database and then we would save the JSON right.

B

Okay, yeah I mean at the end of the day, it's probably for it for a scale run, it's going to be a lot of data, but not an outrageous amount of data right right, yeah.

C

So I think the amount of Jason we're talking about saving is on the order of you know. Tens of megabytes, which is even if we were talking about hundreds of megabytes, I, wouldn't I, wouldn't think that's too much storage as well as will be cheap again. One of the reasons that I think that actually storing the JSON is better. Is that I think that at some point, we'd like to be able to do more than simply create dashboards, I? Think nor der to do more fine-grained analysis.

C

We're going to want to do things like quested, where we actually instrument communities to admit additional events to remove this store them, and then we may want to compare those in different ways that I, don't think we're. Fauna will allow you to do easily.

E

So I may have missed this during your walk through David, but the second method allow for easier normalization of timestamps. It certainly felt like you were having to fumble between today and yesterday and tomorrow for timestamp queries and back in the I'm on a pony world, I'd love to see a time series that was like normalized to you know. Zero is the first time stamp on words that way, conceivably, we could at least visually compare different test runs pretty easily. They started from the same checkpoint or something right.

C

So that's a good point: um I'm, not aware of a way I, so I know the Prometheus does store the timestamp and the time stamp it stores is is now so it doesn't ever start at zero. You can have it seek to a certain time and consider that zero for the purposes of graphing, but the data itself as far as I'm aware, is always stored as the current timestamp as the node sees it.

C

um That's something that if you were using like a Python program or something to you, read the JSON you could you could account for that and modify it fairly easily and I think Ravana. You could also seek to a particular time, but you're right, you would need to know the time. I think that part of the difficulty with this presentation is that, because I'm not showing ravana, I did have to have a little bit more knowledge of the time. Stamping.

C

So that's something that if you look at the- and I know this doesn't fix it, so your points well taken, I think, is something to consider. But if you look at my original backups, I wasn't including the timestamp and I only started, including the timestamp this morning, because I realized I couldn't do the math in my head fast enough. Okay,.

E

Well, I just wasn't sure if this is something like another tool like her phone I could fix for us, but in an end-to-end solution. That seems like something we would need and um I will.

E

It's but there was certainly a lot of chatter in the prometheus back-channel. They continue to maintain a hard-line that prometheus is not meant for long-term storage, it's not meant for archival, and it's not meant for backfilling. So their proposal for things of this nature is you need to have prometheus dumped off to external databases as quickly as possible, something like in flux, DB or something like open, TS DB, and it could be that that's sort of the longer term version of option number two I think.

E

Maybe the hope here it was that option number one if he could best fake prometheus out with the right amount of theta was going to be lowest cost best effort get us eighty percent of the way there but um seems like we need a little more polish at the moment.

C

So um I specifically actually didn't so you're right. That's as I mentioned. Maybe not prometheus is not going to be a long term storage platform ever. It was definitely just to be clear with soup.

E

E

Just over and over and over again um that was you know many months ago, so things may have changed, but certainly for the collection aspect, I think prometheus dumping into its own files as quickly as possible seems like that still scale as well. um But the export story seems a little painful, so.

C

I purposes, so I purposely didn't bring this stuff up, because I think it's a more contentious and potentially long-running discussion that we have there is talk of using influx. Db is a long-term storage solution for prometheus there's, also talk of potentially using some other database. My vote personally leans towards influx Phoebe, but I think that there are potential problems with the Rope out of the fact that the clustered nature flux DB is become close source as of a few months ago, etcetera, etc. So I completely agree with you bit um having a difference.

C

A long term. Storage format is something that we need to follow up on, but the purpose is here is or is really justifying the lowest common denominator that allows us to move forward now without having to resolve the issues of whether or not we want to use influx, DB or something else. I intend to continue looking at an influx TV, but I think that a JSON representation is is clearly the lowest common denominator.

C

D

Yeah, did you try running prometheus we've improved DB as a back-end on, say, 200 notes, because from what I know there was some performance problem with all the rest of it proves to be I, don't know if it's fixed their.

E

Capital alkylating tier earlier so back in September, says Samsung SDS was doing a Prometheus collection on a thousand hits and you found exactly what you did that in flux. Db just doesn't scale for right. At that volume, um influx TV point eight or in flux, DB point: nine! Take your pick how they handle the wide variety of columns that keeps your Prometheus threw at. It just fell apart. So the reason I think we're talking about influx DD here, isn't to write data into it as quickly as possible, but use Prometheus for that.

E

Initial collection of data from your cluster have a great to whatever Prometheus's custom flat files are and then use some other process to bulk export that over to influx DB, it could potentially be that import and export with influx DB is a better story than prometheus is story, but I agree with you that writing directly into influx TV from prometheus at runtime or at tests or data collection time is probably not a.

E

Scalable thing I am basing this on many months, so things may have gotten better, but I, most yeah, I I, think.

C

That we would only my interest would largely be to use on flex. Db is a format for sharing. I think that the scalability problems that people see within blocks DB many of them are due the fact that influx TB has a format which takes in order of magnitude more space. Timotheus, it's not nearly as compact, and the reason is, is because they did you analyze everything so that every data point has all of its metadata with it some steep. This is a disadvantage because they think that they want to save space.

C

In my experience and I had this at amazon, we spent a lot of time trying to compress all of our metrics and what we found is it just. It was effectively like our metrics, because we couldn't get any of our tools to work with the compressed versions, so we ended up deciding that we could take the three or four orders of magnitude storage increase. In order to be able to have easy access to the data, the source was cheap.

C

The programmer time was not so I think that prometheus historically, there has been complaints about its scalability I. Think many of them are probably due to the fact that it requires more hardware to do the same thing with you. It does because it's doing something it's making the trade off. So that's something we could revisit later, but take your pointer and I I think there is when you're talking about archiving I, don't think if you have a yeast feel a little problems because you're not again, it's not real time ingestion.

C

It's just a background batch process right.

A

I think related to the influx DB comment and our earlier experiments. It started to become obvious that the influx DB cluster required to to that you would need to scale up in order to measure a cluster was getting to be a significant fraction of the cluster size. And could you scale it? Maybe would you ever want to in a production environment, spend that much on just an influx DB cluster for for this case almost certainly.

B

We're over time here, so I just want to do a little bit of a time check. I was reading through some of this stuff. You know it looks like the prometheus community is kind of in a little bit of a bind in terms of long-term storage.

B

All the options have some real downsides, whether it be in flux, DB with you know, you guys hit the scale with mobility issues, but then there is also the clothes clustering issue that it looks like inside Prometheus, as viewed as is making flux, pretty much be a non-starter as the default option. They're looking at doing sort of generic writing back ends that you can sort of you know put anything in there. There's a lot of talk about long-term storage backends being right only so you can't actually read in query.

B

So it's up to you to use the query language of the database you're writing into to be able to do any aggregations. um So that's a little unfortunate there. Also. So, even if we did have a long-term storage system like influx, TB or or open to ftp, we still wouldn't be able to use the prometheus front end to query that stuff. At least it looks like that's the case.

A

Joe, what's your what's your gut reaction on if we started publishing JSON um from you, think this with that has any value or here's.

B

What I think might be interesting um if we have a way to export data out of Prometheus right so because, because we're in a situation where the Prometheus storage back end is good enough for scalability runs okay, let's ignore sort of long-term storage for four clusters in general, so it's good enough for scalability runs we're not going to be collecting so much data that we have to get it off post directly, uh and so now it's a matter of okay. We have this this data sitting around.

B

How do we actually make use and share that my ideal here would be to have a way to export that from Prometheus full fidelity import it back into Prometheus.

B

If you want to for sort of live querying, um and so this you know- and it looks like there's, there's talk of having an import, API and- and so this doesn't seem like in and all the exporting uh there's an exporting PR for doing, live export to an external long-term data storage, and that's hung up on questions of retries and reliability in keeping up and synchronous versus a synchronous versus buffering that type of thing I'm.

B

If we say you know what this stuff is all offline and it you know- and it's all you know, and none of it's happening real-time, it may be easier to actually move Prometheus in some of these directions. If we want to try and move Prometheus I mean you know, it's not like. We have. You know engineers falling out of trees here. um So you know, in that case I mean like exporting. The JSON gets us there. Another interesting thing to think about is that you know prometheus is written in go.

B

Is it possible to extract the prometheus sort of query language, query processor and run that in a batch mode across json data right? So now you know, I could imagine a command-line tool where you point it at a some sort of you know, store on disk, and this thing just runs and turns through that loads it from disk as necessary and uses the same query: languages Prometheus. So it's essentially offline. You know, post-hoc analysis, I, think you.

C

Don't even need to go that far, because so pretty those has something called exporter's and they're normally used for things like applications which aren't instrumented worthless things like tail tailing log files right that they didn't. So you could have an exporter and they may already have one which would just read the JSON and ingest it into prometheus. Okay,.

B

Does that time work with like more historical data? Will it in insert it into or is it just you know, because a lot of these things they're only built to ingest stuff, that's happening. You know now now some some definition of now right, and so, if you use one of those exporter's to give a data that came from last week, will it will it it freaked out I, don't.

C

Know that's an open question.

B

For me, yeah, um so I don't know so it sounds like we're trying to use Prometheus in sort of a weird way, I think the JSON stuff. You know if we can get this stuff exported as JSON, then you know that's. At least you know something that that is well understood and can be can be widely used. But you know, if we're not careful, we're going to end up rewriting all the alga. The sort of you know aggregation calculations on top of that JSON output, which seems silly right.

B

So if we can get the best of both worlds, where we get some sort of interchangeable format and we can actually use Prometheus query against it um in some way. I think that you know just keep your eye. I. Think it's about.

B

Even the eye on the prize in terms of what the starters are in terms of what we want out them, the observation of both is not unreasonable right. We do the JSON and the raw the raw Prometheus database. That's not crazy sure. It leaves our options open.

B

um One thing that I didn't want to mention and we are totally out of time um I if prometheus works, like I, think it might work, it might be possible to actually do a little bit of a binary search across the time scale to actually take a data set, take Prometheus and say at what times do we have interesting data for this metrics, and so the question there is is essentially there's your hitting that 11,000 point when you do that step. Does that? Does that essentially down sample it?

B

So if you say step equals 2, does that say: I want to actually return every second step. Could you go to step e equals 100 and do a wider thing find out where you might have hit a data point and then zoom in into that? um That might be a way to go. Also in terms of like building, because it's like again it's like you know, Prometheus doesn't have the way to say you know.

B

When is there interesting data for this time series you have to essentially know the time you're looking for so that you know, building some to ability around that might make this whole thing flow, a heck of a lot better yeah, but very interesting stuff, thanks for thanks for demoing cool and with that I think we are out of time. Thank you, everybody, and since we have this recorded, will try and post it and see how all that works. It's brave new world for us and see you later today, thanks.

C