Flatcar Container Linux Tech talks, 19 Nov 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Building a Multi-Tenant Observability Pipeline with OpenTelemetry - Joy Bhattacherjee

Description

This talk will cover setting up a multi-tenant Observability pipeline for metrics and traces using entirely opensource projects such as OpenTelemetry, Prometheus, Jaeger, Kafka, and Cassandra.

A

So yeah, uh so today I'm going to talk about something I guess like a lot of people do in their respective organizations, which is handling observability right.

A

But uh today I'm going to be talking about a very specific uh use case or specific thing that I sort of helped build for cygnus.io, one of my clients, with the use of the new fangled tool sort of it's a new entrant to the observability block since the last year, around 2019, which is open, telemetry, uh also good, to note that, like uh signals.I, o recently got selected for yc winter batch and some of my work, I hope most of it will uh make through.

A

uh So uh just uh before I get into the uh construction of uh this uh observability pipeline. uh Let me describe what I understand personally by observability. This is my mind, map of understanding how you know how we consume uh observable services in a cloud native environment. Of course, we all know about the three pillars, which is logs, metrics and dresses, but the way I visualize is that they are also interrelated. Whenever a particular event happens in any application.

A

We do not consume events from only a single source of this uh three pillars: right. We need information from all of them and these sort of intertwine uh in various ways uh for folks, like us, devops engineers or infrastructure engineers, to make sense of what's going on inside the system, and that is the point of the term observability right. That a system which is sort of black box is exposing data, and with that data with the sensors, we can make sense of what's going on inside the system.

A

What's the state of the system from the external matrix right, so we have like logs metrics and trusses, but our objective is to get to the point where we can start doing rcas and we can start doing sli and slo violations. We can do anomaly, detection right and what we see in our daily work is that having these different event, sources, isolated in uh different dashboards or you know different sort of uh data visualizations, uh it doesn't really work right. You know what we usually do is we have two three different screens open.

A

We have multi-screen, monitor, setups right and uh in one place we are visualizing logs in kibana we are using graphana to visualize, let's say, parameters metrics and we have, let's say a jager dashboard or something to visualize the tres events right and the actual correlation of all these events to understand what is going on with the system during, let's say, a downtime uh or a particular outage. That's going on happens in our mind, because most of these tools still now uh have been dealing with these three pillars in a very, very isolated manner.

A

There are explicit tools that handle only one of these verticals, and so we have to mentally map all this data and then make any sense of it. uh Sort of this is where you know open telemetry steps in uh so this is sort of a slide. I sometimes keep uh for you know an audience if you are not familiar with distributed dressing, but I think today I spoke to ashish about this. I might skip that because I believe most of you folks are familiar with distributed dressing and what it is and how it works.

A

So I'm going to skip over this cool. So the reason uh why we choose open telemetry as our core pillar of this product that we were building is the first point. The three pillars of observability was under one single roof for almost the first time right uh before this, every project started uh tried, dealing with individual pillars and what open telemetry, also brought to the table was a vendor neutral data format.

A

So, as you can see on the uh right-hand side of the slide, uh the current state of vendors uh and you know sort of fragmentation, or you know- diversity of vendors uh in the monitoring landscape right and what has happened with each of these vendors and their? You know a sort of earlier interoperable non-interoperable data formats was that once you set up one thing, you're sort of logged into that and you couldn't really plug and play the different formats.

A

uh You know you couldn't interplay with the data you couldn't, let's say, take neuralic data and use that to push into jaeger right. You got you couldn't convert that right.

A

So this is where again open telemetry sort of steps in- and you know many of you might be knowing this is that they created this interoperable data format called otlp and each of these vendors very interestingly sort of stepped up and created data format, converters to otlp, so that all of this data becomes interoperable right and that created a huge opportunity and suddenly you can basically take zipkin data and convert it to jager right. You could take uh prometheus metrics and convert it to a different type of metrics.

A

Let's say like statsd, which uh created a lot of in independence in how we, in our daily practices, like you, know how we handle observability data. Suddenly we didn't have to build things in a very, very opinionated way. We could run something on the cluster but visualize in an entirely different data format. Right we were not, uh I guess like uh captive of one single particular product.

A

uh So this is one of the reasons why we chose to you know: go with open to limit and open to limit is not a very old project. It started somewhere in 2019, uh we'll come to that. uh The other part that we found very interesting was uh there's a particular component within open telemetry, which lets not only do data format, conversions from one vendor type to another, vendor type or vendor, to open data formats.

A

It let us process uh that custom data with very minimal golan code like I could consume certain data, add to it, enrich it and then pass it on in the otlp format and automatically the exporters would basically interpret it in the target format and we'll see how sort of open telemetry does this right. But this is one of the key things that made us choose open, telemetry over. You know using other open formats, so yeah we'll come to the lineage of open telemetry, and this is sort of a personal thing.

A

Every time I work with a new ecosystem, or I start studying about something, that's new. I try to try and go back uh on at what point of time the ieee standards were written, and you know how this whole thing started, because every single piece of tech that we use daily is not really novel. They all have a sort of a long. Lineage of you know, history that sort of stays hidden because not everyone. You know uh you know digs through that, but they do have almost like they get spanning history before it comes.

A

uh You know in front of us and it's suddenly a new thing. A same goes for open telemetry right, so it sort of started. If you see this particular chart, that's on our screen right now. It sort of started way back in 2002, 3 7, like almost like. You know uh two decades back right.

A

So everybody talks about the google dapper paper as the source, but what people don't talk about is google diaper paper was itself an assimilation of three pr papers right one was the ieee pinpoint uh the 2003's magpie paper and the 2007's x-rays paper. While I cannot claim to have read all these papers uh completely, but I have glanced through them and I've compared them to the google diaper paper. Google does borrow a lot of concepts.

A

uh You know and reconciles a lot of competing concepts uh in between these three papers to arrive at what we have come to know. As the you know, the google tapper paper, which sort of is central to the observability ecosystem you know going at so what happens? Is google makes uh on 2010 the dapper paper?

A

It uh releases this to the public and, as usual, with any google paper like hdfs or any other white papers that were released, another company sort of picks it up and tries to implement the paper, and in our case this is twitter, so twitter implements the darpa paper into something called zipkin internally, and at this time twitter was using the finagle system. uh You know in their infra and they sort of used.

A

Finagle uh took cassandra as the back end storage and uh created this uh implementation of dapa paper called zipkin and on 2012 they released this particular implementation to the public as an open source project and called it open, zipkin, uh sometimes down the line, uh uber also sort of starts doing the same thing right. They already had an internal tracing sort of a system called the mercy cakes bunch of you might be knowing about this, uh but these sort of adapt, open, zipkins principle.

A

Even at the initial stages, they were even using the zipkin, ui and certain components, so these sort of marry the concepts they had internally with open, zipkin and create this thing called jager, which is probably the most popular dressing solution that we have seen in the recent times then way down the line on 2016.

A

This is the first of year I think, or the first two years when cncf was just like. You know, coming up a post, kubernetes era right. uh This is when uh open, zipkin and giga sort of gets merged into this thing called open tracing, and this is being laid by the people who were leading this particular projects. We had ben siegelman from google dapper paper.

A

We had, I guess, adrian cole, who was at the point leading zipkin development, and we had eures kuro from jager right and uh this three core maintainers of this three very different projects decide that we need an interoperable central format uh on how to do traces and they create this thing, called open tracing and donate it to cncf. It's one of the probably earliest projects to get donated to cncf right parallel to this uh google.

A

After releasing the data paper, they were also working on another internal product for observability called census, and they released this to the public in 2018 and call it open senses and for quite some time, open. Tracing and open sensors remain sort of competing open standards right uh till 2019, when basically ben sigelman again steps up, and uh he says that he uh both of these projects have very very common uh goals in in their minds. We are doing a lot of things. Similarly, why don't we sort of merge this into something?

A

That's going to benefit the community and, uh unlike in other segments where everything was, everyone was doing forking off and doing their own implementations? We saw a merger of two very, very popular projects, into something called as open telemetry and as late as I think, 2019 december november.

A

uh Don't quote me on that, but uh late 2019 right and it also borrows from the 2018 w3c uh standard headers for tracing, which is the trace context and correlation context headers, and it borrows certain concepts from the w3c paper, uh uses open tracing and open sensors as core ideas and creates a new project called open telemetry.

A

So as it currently stands, this is the cncf.

A

You know dave starts and if you can, if you folks see here- and I guess our organizer here is one of the contributors to open telemetry- uh we have uh some of the biggest names in the industry uh and not only the industry but in the observability ecosystem right like new, relic uh uh data dog uh bunch of companies who are working together to make this interoperable data pipeline and format which also contributed to why we wanted to sort of you know, step up and use this to build a good product cool, so we'll sort of again uh slightly go into the architecture of this and uh certain folks here might be familiar with this.

A

But I'll still try and you know explain my best. So the way open telemetry is structured is uh through three core items, basically, which is receiver, processor and exporter. A receiver is a basically a component which receives in either proprietary or uh or in a certain data format like even an open data format right. So if you have jaeger, there is a jaeger receiver. If you have zipkin data, there is a zipkin receiver and it also has the open format called otlp and they have created.

A

The project has sdks in most of the popular languages out there to basically integrate the otlp sdk and create traces and metrics using that particular sdk right. So you instrument your code and the trusses and metrics are flowing um and for ruby and java as the picture states, uh they have auto instrumentation libraries, but for the others uh the instrumentation has to be done by the developer, and so we have basically one one receiver, each for each of these different languages and different data formats.

A

All of this, what this receiver does is receives data in this property or different, open data formats and convert all of them to this internal data format called otlp and sends these to a processor, and the processor is basically sort of an in-memory queueing mechanism.

A

There is batching where you can say that okay batch up this many events and then send to something else, or you can do a queued retry, where it will send but wait for an acknowledgement and if it is, doesn't receive an acknowledgement, it will re-cue it back and then send it again later on right. So this is sort of the state manager of uh this whole pipeline right.

A

It holds the events for certain time and processes based on certain queuing logic or batching logic, but in addition to that, this processor also allows you to write custom uh processing logic. uh This could be property. This could be open where you could basically process the data that otlp has. The receivers has have provided to you and enrich it in certain ways. So one of our ideas was that hey, we have uh trace data. Why don't we start deriving certain inherent metrics from the trace data itself like why?

A

Don't we uh derive latency, and you know, uh request uh per second uh qps metrics directly from the trusses itself right, so we could do that with the processor implementation. And finally, once this data processing and queueing is done, it's all sent to this exporter component and again exporter.

A

Implementation is very specific to the target right where we are exporting the data to so you could basically write uh instrument your code using otl pr using jaeger, but if you plug this whole pipeline into let's say, uh open sensors exporter, your final output format is going to be open sensors- and this is true for even vendor formats right, because there is a contrib repo, where the vendors are actually uh creating. This receivers and exporters per vendor to allow, for you, know, data conversion between different formats, right cool.

A

So this is again a sort of a brief overview of the internal component layout and uh sort of the repo layout. Also, I would say so. If you can see the green components are the are sort of on the core repo and open telemetry is a separate, contrib repo, where the vendors contribute to either receiving uh their format.

A

uh They either write receivers or exporters to their particular proprietary formats right so most of the vendor stuff, if you can see, is in the contrib repo, but most of the open source stuff implementations are in the core repo right, so we have standard receivers like open, sensors, jager, zipkin otlp. We have tres receivers, proprietary ones like signalfx for matrix. We have standard again otp host metrics prometheus, of course, the one of the code pillars of our observability uh and metrics community, and then we have you know. uh Proprietary sort of implementations is carbon.

A

There is a open, kubernetes uh matrix implementation, also, which is not reliant on prometheus, but it directly grabs metrics from the kubernetes cluster, using the api, server queries or ap server apis right and then we have different processors. Here we have attribute processor, where you can add or subtract attributes from the trace data or the event data that you have received. You can, as I talked about batch, you can filter data out. You can do qt retry, you can do sampling, we'll have a brief.

A

uh You know segment about doing sampling, whether it's tail sampling or head sampling, uh or you know uh the various types of that right. Tail sampling is hard by the way head sampling comes so is fairly easy to do, but uh while open telemetry does provide uh implementations of tail sampling, it's still sort of limited, so yeah. So these are basically some of the processors.

A

Don't have enough time to talk about each of them in detail, but uh some of them are fairly interesting. Right, like sampling is probably one of the most important processors and attribute processor, also filter too right. So again, as you see on the exporter layer, we have exporters for each of the different formats. We have the standard otlp we have jaeger zipkin, which both are open formats and then we have a bunch of you know proper implementation, exporters for uh property data syncs. What if you want to consume in jaeger?

A

You want to instrument your code in jager, but you want to send it to aws x-ray right so that pipeline can be created through this particular structure. Right, if you have your receiver as let's say jager, but you're exported as aws x-ray, you can visualize your traces in aws x-ray. So this is what the different uh exporters enable you to do same goes for matrix, otlp, prometheus and then carbon signal, fx, stackdriver, etc cool.

A

So uh one important point to note here is that the binaries for the contrib repo and uh the core repo are not the same. So if you want to use the features or the exporters and receivers marked here in red, you'll have to use the docker image or the compiled code from the contrib repo, which is sort of a superstate of the core repo. So it has got the core components, plus the extra components from the vendors.

A

But if you end up using the core repos docker image in your deployment, you will miss out on uh the components that are marked in rate. You don't get any of the contrib components there. This was a bit of a tricky thing for me when I started building this thing, because I was expecting more of a plug-in system. But uh finally, we saw that it had to be recompiled or different. Docker images need to be used if you want to use different features cool. So now we come to our first.

A

You know product objective uh target. How do we consume the client's data? Thankfully open telemetry to the rescue right so uh to basically build our internal demo? And at this point we do not did not have clients we're just you know, creating our product from scratch. We needed to do uh emulation of client data and various formats of client data. We wanted to have the data, interoperability and variety as one of our course friends. So we created a load generator application with various existing sort of load generators.

A

uh We had open sensors trace and metrics using a simple golang app one file, golang app, basically, which is there and then we had sort of omniscient synthetic load generator for both jager and zip interests, and we used fresh tracks, uh prometheus metrics called avalanche. Basically, it's it generates a fake, prometheus metrics, which sort of emulates an application, but we sort of combined all of this into a single implements and says hey. This is our load generator. We can toggle certain.

A

You know, control certain tiles to basically either up the scale or down the scale and test our internet implementation out. So this is our load generator and then we had to deploy the open telemetry agent on the client side. Now how we did it is that we had two separate clusters: uh sort of deployed. One was emulating as a client cluster, and we had our. You know the platform cluster separately right. So we deployed the auto agent with uh receivers mapping to our sort of emitters, the event emitters that we had configured.

A

So we had open sensors, jager and zipkin receiver, and then we had parameters receiver, which sort of works in a different way. So all the trust receivers are they directly receive data from the actual instrumented code, but promise you scrapes data. So it's sort of how normal prometheus works. You provide the promises, receiver, the script config and a service endpoint, and it will go and scrape that end point and get the data out right.

A

And then we fan in this data to the batch and queued retro, processor and, finally, to initially we were not using the otlp receiver exporter. We were using open sensors, uh so this image has that, but eventually we moved to the otlp exporter.

A

So we found this into the otp exporter, and that is where our client side implementation uh sort of ends, and this is so awesome right because we did not have to write even a single line of code, but suddenly we were able to consume at least four to five different types of data formats. uh It's representing different clients that we might have in future using a simple open source component deployment right and uh we consumed this on a platform cluster using the collector headless service, and this is where the fanout sort of started right.

A

So we had an otlp open sensors initially, but later on, we moved to otlp. We had the receiver which again uh routed through the batch, include detroit processors and finally to three different exporters. We consumed from around four formats and we were experimenting also, you know seeing whether we can fan that out and uh is.

A

Is the data being received correctly in all the different formats right and this experiment turned out to be true, so we, when we find out our jager exporter, had data not only from you know, giga, but it also got the generator data from both jager zipkin and open sensors. We had the combined data from all three different sources in a single jager format and so went uh zipkin. uh We also saw that prometheus sort of stayed linear because we are scripting with prometheus and also exporting two prometheus, so we got the data we were expecting.

A

So this is a stage one, the first, you know how we started. Building our prototype and this sort of shows you the yaml view of the pipeline. The previous one was the architecture view. If you see this is how we write it like we say that hey these are the receivable map of receivers. We have open sensors zipkin. These are the end points for open sensors and I don't mention it goes to the default endpoints and for prometheus. I provide a script config.

A

We give it the job name: okay, go scrape load generator on port 9001, scrape out the matrix right, and all of this goes to the processor definition and finally to the exporter. Definition right and finally, we organize this into the pipeline schema where we say: okay, uh you know these are my trace pipelines, and these are my matrix pipelines right. So tres pipeline goes in a way that okay receivers are open. Sensors and zipkin processors are batch and queued retry and, finally, my exporters are, let's say: open: sensors and login metrics.

A

Also receivers are open, sensors are providers, exporters are logging and open sensors. The reason I have the logging exporters in both cases is that we were in debug mode and, of course, I needed to ensure that everything that was being received was also being finally consumed in the final data syncs right and but it's probably ideal not to have the logging uh exporter enabled in any of the production deployment. It's only good for debug, because then you're basically dumping a client's data which could be sensitive.

A

You wouldn't know uh into their city out right, so that's sort of not recommended, but we're just building a product, we're building our mvp right so yeah. Now we come to the one of the most important processes which was still sampling right, which we talked about previously, and this is how you do tell sampling uh using one of the given processors is that you define a policy.

A

You see that okay, if uh a particular attribute type with a certain key and within a certain value range, uh uh if you, if you match that, then do not sample or if it matches that then do sample. So, basically, if you see from there this particular conflict, what it says is that if the http status code is 200, then sample it, because we do not want to have a lot of 200 okay data in our traces right, because we are not worried about success cases. We are more worried about failure cases right.

A

So uh this is where the first map sort of goes is that uh if the status codes uh mean value and max value like, if, if it's exit zero, then sort of sample right like if, if it's not an error, then do sample. So this can also be inverted, and we can explicitly say that, hey if the error matches or if the status code matches- let's say, 5xx or xx, or four x x, right certain http status code.

A

We uh want to send those data uh same each of the stresses individually right, so that configuration can be uh written down under the policies.

A

And the one thing about uh you know tail sampling is that it does not work in clustered mode of open telemetry, given that uh cell sampling happens in memory after you have received a significant chunk of traces inside your particular deployment right so to basically do tail sampling across, let's say in different replicas of a single open, telemetry deployment, all those uh traces all those uh particular events have to be shared across. That means we need sort of a consensus mechanism.

A

A friend in the grafana labs actually had built one of these forks and it's still sort of under in development and grafana labs also published a sort of blog post way back when I was building this uh that okay, this is how we are doing, tell sampling with consensus on top of open telemetry, but it sort of had fogged off uh pretty much from the core branch and the merge back was not at that point of time allowed, and I think I should connect back.

A

uh You know with my friend at grafana to discuss, like you know, sometimes how what they're planning I'm pretty sure they are still working on it, even though it was paused for a bit right. So uh this is how you define uh you know how to sample uh particular traces right.

A

So now we have solved one part of the problem. The client had the data client had various formats of data. We have converted into our desirable data formats right so now. One part of the problem is solved that we are not worried about how the client is instrumenting their code. They could instrument in any particular thing. Our solution is to deploy, uh give them a simple yaml which they can deploy to their cluster with the correct receiver format configured, and we will receive the data as we intend to, and then we can manipulate it around.

A

But now we come to the second problem, which is now we are, let's say, targeting multiple clients. That means multi-tenancy on our cluster right and how do we create a long-term data storage which are resilient and which will sort of thrive in sort of a very high scale environment right. So this is where we come up with a plan of okay, we'll again use completely open source components to build this data pipeline out right.

A

So we needed data syncs right, so we needed a sync for pertinent sync for uh metrics and traces, so which means we would deploy a particular prometheus instance or a deployment if you want, uh in the form of promisius custom resource uh for in the tenant's namespace and same goes for jaeger, will just deploy a giga custom resource in streaming mode and with kafka, so that you know, uh data isn't lost right because we do not want to lose data from uh our chess uh config.

A

If uh you know one of our let's say deployment goes down right and we planned on integrating the long-term storages, which were already supported by both uh prometheus and a jager. At that point of time and still supported, which is we plan to use cortex uh with cassandra backend for metrics data, and we were again use cassandra back-end with jager.

A

So this sort of arrived at a very nice solution where all of this data for the metrics and traces were coming to a single cassandra data, sync right and we basically had 10 and 14 and zero one. We had 10 and zero one key space and for uh the same dinette we had 10 and zero one matrix key space, and each of the tenant data was isolated in their own key spaces and we could basically scale it out very, very fast right. So this plan sort of worked.

A

What we needed to build this out was couple of existing. You know open source, kubernetes operators, which was parameters operator, eager operator kafka operator for the kafka topic. We were creating pertinent and deployment of the cassandra itself now. This is where we got stuck a little bit, because we there were multiple offerings of how to do cassandra on kubernetes. There is killer, which is a very popular project, but it's not truly cassandra. It's cassandra compatible.

A

It adheres to the apis, but it's not purely the same implementation, so we initially started scala. I was pretty excited about using scala for this purpose, because we had an operator. That means we could easily create one custom resource for pertain and name space. We didn't have to sort of have a central thing there, but this plan sort of did not work out for us kill us.

A

uh Current methodologies still have certain scalability things to desire, and it did not work for us, so we feel back to using a central cassandra cluster with bitnami cassandra right, and we, of course have to deploy cortex now. Cortex itself is a pretty big architecture, sort of to deploy and scale. But uh given this talk uh covers a lot of other things, I'm not going to you know, go into them a lot, but there are a lot of documentation out there on what cortex is.

A

I think, like again, berlin is what I call I was discussing with our organizer. Just before the you know we started. Is that berlin's, probably the observability capital of uh you know the world right now, with all the maintainers sort of there, in red hat and in other organizations uh and uh the you folks are familiar with uh uh how cortex and other things are there, so I'm not going into that.

A

So what we needed to do was basically uh crea use this components and build a multitalent, isolated uh sort of architecture where our multiple tenants data can in no way be consumed uh by another tenant. The data access had to be very, very uh controlled right and, to sort of uh do that, we again followed standard practices. We keep the data consumer, which is our hotel character.

A

We created one hotel, correct, uh collector, pertinent namespace and uh one prometheus here: one jagger custom resource also per name space and one kafka topic per namespace, and we sort of secured this boundary with network policies with our back. Since we had multiple prometheuses running, we did not rely on a cluster role, but each prometheus and each jaeger had their own role and role binding.

A

So, instead of trusting a single service account token or a single role, each had their own service account, trusting their within namespace single rpec implementation right, and we also use network policies to ensure that calls could only be made on certain ports within the name. Space and no way can my promoters go, go out and scrape another attendance name species right and this we did through both network policies and also using proper level based scripting right so and additionally, securing the deployments themselves with bot security policies.

A

Ensuring you know, resource quotas are correct, pretty standard practices, nothing fancy uh and then finally, what happens is the data comes in an open portal. Collector finally goes to cortex cortex dumps. The data in the matrix key space and jager turns the data in the tresses key space.

A

And finally, we have the central data sync across all our tenants, two key spaces, pertinent where we were getting all this data and we could finally build sort of a query layer on top, which is going to be, of course, our proprietary query layer, but our main code pipeline of getting uh vvv data. If you want to call it in you know big data terms, which is variety volume, and uh I guess velocity right.

A

So we could consume uh data in both variety in all linearity volume and velocity uh and sort of have a very stable pipeline.

A

Using these particular components, dump it to a single data, sync which is queryable using standard sdks, and we could build an application on top of so uh this sort of uh uh this slide sort of sums up what uh I had gone through in my previous slide sort of sums up the whole architecture and what is happening, and uh uh if you can see uh this is from combination of my previous slides, which is okay.

A

We had the load generator application uh and we are, if you follow the colored lines, you'll be able to identify how the trace data is flowing and how the metric data is flowing. So the blue lines are sort of trace data. So, if you see there are three sources representing our clients, which is the open sensor, generator jaeger, emitter and zip kilometer. All of this goes through the hotel, client, auto agent and then finally hotel collector, and we route this through the jager exporter.

A

uh You know uh to ajigar instance, which is running in streaming mode, uh uses kafka topic for you, know, resiliency and dumps it to cassandra in a large distributed storage format so that we have data in long-term storage prometheus matrix. It also comes through the prometheus receiver goes out through prometheus exporter finally comes to be dumped in the prometheus custom resource, which remote writes to cortex and cortex finally uses again cassandra as the mean back-end storage. So we have both matrix and trace data in a single.

A

So this is two pillars of two out of three pillars of observability inside a single application, probably visualizable through the api layer and dashboarding in sort of a single combined view uh when we were building this. I think the last week when, during my engagement with my client signals.I, oh the log integration was also done to open telemetry, but unfortunately I haven't had the chance to look into that, but uh you could build it out in the same way right.

A

You could take the data logging data also in the same way and finally dump it into a central data store. And finally, we have a proper pipeline set up, on top of which you can build a unified dashboard with all three pillars of possibility there and you could build correlations on top right. So the problem of multiple dashboards- and you know frantically.

A

uh uh You know just looking around to figure out what is going on looking at trace data you're, seeing whether spikes are happening on the metrics dashboard, uh and you know, you're, look digging through logs on kibana and figuring out. You know which particular application is misbehaving. Is it a particular part or not all of that sort of frantic, looking sort of becomes much more peaceful, and now we have consolidated data in one single space cool, so uh this is sort of the last slide.

A

I have a small demo prepared where I demonstrate uh whatever I talked about. This is again a pre-recorded demo and, uh as stefan had said earlier, in the talk that we all learned, I think during 2020 that doing live demos during online talks are probably the best ideas to have a recorded demo. But before I go to that, I will ask you to probably scan this particular qr code, and this will take you to the link. That's mentioned above where the base code is dumped.

A

It doesn't have the complete multi-tenant implementation, but it has enough code to basically get someone started in open telemetry on cuban. It is directly it contains. All the templated manifest that someone could deploy and get started if they are interested.

A

So with that, I will switch out to the demo and try and keep it short and see if we can cover that.

A

So apologies uh at the very beginning, the video is sort of choppy. I was using an inherent. uh You know linux tool to sort of record this. Instead of using a sort of proprietary recorder and uh the video turned out to be choppy, we lost pixels on the way cool. So let's uh wait there. um So this is what I have you know. We have a config camera. We uh we're just looking at the load generator.

A

We have a deployment, we have a config map, we have a go golang application and have a service to expose this right and finally, we're looking to the actual golang application right and all of this by the way is available in the repo that I mentioned earlier. So you folks can go ahead and look and see how the whole thing is organized. So what you can see from this particular golan code is that we are sort of writing how to generate trace which stresses to generate and certain metrics, also right.

A

If you can see uh that we are. I don't have a pointer in this, so not sure how I'm going to see this, but let's see.

A

Cool, so if you can see the just the previous frame right so under the ctx tag, new context background right. This is where we are creating the trace, and on top of that we are creating the open census metrics. So the name description measure and aggregation and tag keys. The multiple maps that you see within the view this part from line number around 6 to 35, that is the open census.

A

Metrics that are being created and from around line number 41 to 50 downwards, is how we sort of generate the tris for this particular small application, and this is not exactly how one would write in actual production application.

A

But this is just like a small thing to you know: generate fake traces and fake matrix from a single small application right, so yeah again repeating like line number six to line number 35 is metrics and then 37 and oh sorry, 41 downwards is how I'm generating the tris for open sensors using open sensors sdk, and I have no idea. Why am I traversing the code backwards from down to top very unusual so anyway?

A

So if you can see that we are actually configuring, the hotel agent end point in this particular application and we are configuring uh it so that we can send both trace and matrix data to this particular endpoint. Using the open sensor, sdk.

A

Cool now we come to the rest of the deployment, as you can see that we have the golang application there already. We are just mounting it inside a container and using a goron command to run it. We have prominent avalanche, which generates uh metrics and traces. I used small numbers for this demo because I did not want my deployment string.

A

Advertently crashed due to high stress, but the whole setup does hold up once scaled in the correct ways right so here I have only metric count as one one single metric with 50 serieses, which are being uh you know, exposed out of this particular load generator, and then we have, of course, synthetic load generator once running with jager and once running with zip king right and in both cases. If you see we configure the hotel agents, agent's receiver port as the target from our load generator application.

A

In case of jager, we target the 14268 port, which is one of the giga ports, and same goes for zipkin. We expose. We send the data to the zipkin port, which is 9411 in the zipkin format, so the receiver itself acts or masquerades as if it's a zipkin instance. So the application is completely unaware whether it's portal agent or it's an actual zip, in instance it's sending to to the application it's the same, because the apis are exactly the same.

A

Cool and of course, we just exposed the load generator with a service and yeah. This is, which was a bit part of my uh slides, which is how the opens uh telemeter legend is sort of configured. As you can see, we had three data sources, open sensor, zipkin and jager. We have receivers for that. We have the prometheus with scrape scraping config and exporters, which is again towards our targets.

A

Right again for this demo, I was using open sensors, but the ideal way to do is use odlp, which is the internal format, uh try and not use open sensors just use. uh You know if you're fanning in fan into otlp and fan out from otlp do not use open sensors. I this was at the beginning when I was still discovering the ecosystem. So you know the video has that cool, and this is finally the open telemetry as in deployment which I think you can find.

A

So, let's actually move on right, because uh just browsing through yamls in a demo is not going to be interesting. So what we have is the hotel agent deployed and later on, this is the logging part which I was talking about since logging exporter. Basically lets me verify that uh all the data dumps uh all the data that from load generator, are actually coming into the hotel agent, and I can verify that.

A

Yes, all the metrics and traces all of that is actually finally coming through right and it helps when you are initially setting stuff up, and once it's done, you can, uh you know just forget about it. Now same goes for open, telemetry collector. I am not going through the config again, because uh you know the demo sort of the ppt sort of cover that, but, as I was talking about the tail sampling, so this demo has that tail sampling enabled and we will have a small look into that.

A

Cool, so this is our final result right where we actually dump our data into the data. Sync, we expose the service, we open up giga service. Now this is our final jaeger, where our exporters from the open, telemetry collector, are sending data to right, and this is technically in our platform cluster, where we have received customer data already and as you can see, all the data that we had received from, let's say: open, sensors and the other uh different. uh You know emitters.

A

All of the data is together in a single sort of format in a single place, and we can browse each service separately right, so we create. We can have the data here in our first data sync right, and this is all happening trend. This is the visualization of the attack visualization uh uh through uh yeager, so yeah. So in this demo, all the data is flowing through. You know what we discuss the open elementary agent on the times cluster and the open telematic open, telemetry collector on the production cluster, which is a platform.

A

Cluster cool, we'll also see how the parameters data is coming in.

A

Cool, so uh we have talked about the parameters. Metric was generated uh using the slow generical avalanche and we can see that avalanche has started, uh generating the metrics which are being consumed at the uh platform end uh in a different cluster altogether, and we can see all the matrix coming in through open telemetry character.

A

And I think, with that, the sort of small demo ends there is not much to it, so this demo does not have the whole cortex and uh multitenant setup, because we, I did not have time to get the whole thing done for today, uh very short notice. But the idea is that the concepts that we discussed in the slide uh are going to hold, and uh this is the backbone of sort of the matrix backbone of the product that we built and yeah with that. I think my talk sort of over and I'm open for.

A

A