Cloud Native Computing Foundation PromCon Online EU 2021, 14 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Lightning Talk: SNMP done quick - tuning JunOS for metrics extraction - Ben Kochie, GitLab

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon North America 2021 in Los Angeles, CA from October 12-15. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Lightning Talk: SNMP done quick - tuning JunOS for metrics extraction - Ben Kochie, GitLab

A

Welcome hi, my name is ben cochi, I'm one of the maintainers of the prometheus snmp exporter. Snmp is a networking protocol. That's used to manage and gather data from network devices, typically router switches, that kind of thing it's very old, but fortunately the data model that it uses maps very well into prometheus metrics.

A

The metric trees can be mapped into metrics and they're indexed in tables and the indexes can be mapped to labels. This works out really well, so I've got a couple of old juniper switches, they're in a switch stack and there's quite a lot of ports and a lot of data together. So let me start up a quick scrape.

A

Well, that's going it's taking a while. Let's take a look, I've turned on snmp exporter, debug logging, and let's see how long it takes to gather this data. Well, it's still going so while we're looking waiting.

A

Let's take a look at the snmp configuration that I've added to my juniper switch, so there's some stuff that I've left out, but this is the interesting bit that helps improve performance. uh The first thing I did to improve performance was, I added a inter interface filter, and this drops some of the data from the device that I don't actually need to gather from the device there's a number of sub-interfaces and it's a little bit cryptic.

A

But basically this drops the sub-interface data from the uh from the output of the switch, and the second thing I've done is I've created a I've added the stats cache that caches the data for 29 seconds, and this is designed to match with the scrape interval. So if I hit the device twice from two different prometheus instances, it'll produce cache data, which is should be much faster than producing the pulling the raw data from the switch. uh And but I I wanted to make sure that I didn't uh cache longer than one scrap interval.

A

So I've made it one second shorter than the actual scrape interval. Let's see how that squawk is doing okay, so that walk completed and it took 22 seconds. Well, it's not bad, but it's not great. So, let's see if we can figure out why and or how to improve this well, so we've got two subtree walks here in the debug log. uh One of them took 12 seconds.

A

One of them took took eight point uh 9.8 seconds, well that pretty much matches up with the default uh iaf mib, and so this is the walk configuration that I've I've asked the device to produce data for, and so the interfaces table in the ifx table come from this ifmib and, as you can see here, these two tables, the iftable and the ifx table- contain a lot of subtrees, and so the first thing we can do is well, let's see what happens if we take and split that out, so I've taken and I've built an expanded tree that takes and expands all of these subtrees, and let's run that scrape so here's if expanded and let's see what happens if we try and load this and see and we'll wait for those logs to finish all right.

A

So that's going a little bit faster, well sort of it's still taking somewhere in the order of five to six hundred milliseconds per subtree, to gather all this data, and so we haven't really improved the speed by making the scrapes more granular.

A

So it must be something about the scrape data that makes it take so long to produce those metrics. So the next thing we can do is we can simply stop ingesting data.

A

We don't need so here's a generator config that I've created that only gathers exactly what I need from the device, which is the high capacity counters for all the basics, and then I've created a second config that gathers all the error counters and a couple of other things like admin status, upper status and port speed, and so once this is done producing data yeah so that still took 24 seconds. It definitely wasn't any faster. So, let's see what happens if I do the same thing.

A

uh And I only gather my my mini config. Well, let's take a look: let's wait for that walk to run.

A

And see how long that take that looks like it completed well, that was much much faster. I wonder why the the the system log seems to be a little bit lagged, but um let's see if we can get that to produce more data there we go yeah so that that walk only took six seconds, so the big trick to do if your gathering data is too slow is turn on. Snmp exporter to blog logging examine all of the sub trees to find out if any specific subtree is fast or slow, and then.

A

Reduce the amount of data that you're gathering thanks. If you want to see these configurations, I put them up on my github under my tools, repo under the snmp exporter directory and there I have a lot of example configs here.

A