Dagster Dagster Community Demos, 11 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Dagster at Rohde and Schwarz Mobile Network Testing

Description

Simon Späti—Lead Data Engineer at Rohde & Schwarz—discusses the production Dagster setup at Rohde and Schwarz Mobile Network Testing

See the full May 11, 2021 Community Meeting: https://www.youtube.com/watch?v=HRd6rEU33XM

A

Okay, hi and welcome. Thank you very much for having me uh my name is simon spaty. um I have the pleasure to show you today uh how we use taxer at rota, schwartz mobile network testing a little bit about myself. I'm a data engineer.

A

I I'm an author of my little blog, where I write also about data engineering stuff and I'm an early user of daxter wrote schwartz has is a big company. We have over 12 000 employees around the world, we specialized in electronic test equipment, broadcast and media cyber security, radio, monitoring and all kind of radio communications.

A

I myself I'm located in switzerland, where we work at a product called smart analytics. This is um it's actionable benchmarking, so you can compare different the network providers to each other. um You have kind of a business intelligence overview to to really see the mobile network testing quality. So you can see how is my network doing compared to my competitors?

A

What where can I improve? And where is my bottlenecks in my network and has mentioned my blog, where I write about the data ecosystem.

A

To go a little bit more in detail what we do so our main goal is to provide the tools for our customer to improve the quality and performance of the mobile network. So on the downside you can see the car. So this is a mobile network testing in africa. So there is not always a wells streets, but we have a equipment that you can see on top of this car.

A

There is actually a we put all our smartphones in there, so we have a quali poc, which our software is is running on and you can either put that in a car or you can also put it in a backpack. So if you would like to measure the network uh quality in, for example like in a metro or in a football stadium or so then you can use the backpack put in the smartphones and then configure what kind of tests you would like to do.

A

So we we can configure like youtube tests, we can the phones, they will call each other, they will call um whatsapp calls they will upload to youtube, so they will all kind. They will do all kinds of different test scenarios that hopefully the real user will also do and then, at the end of the day, you get the measurement file and then that gets uploaded in our smart analytics in uh in our smart analytics.

A

We have like a time series and statistical. So if you look at the architecture, this is actually the testing going on. So during the day, several cars, several backpacks are collecting this measurement files and then typically during night or in the evening, when you finish measuring you upload it to our data warehouse where our custom etl is running. So this is uh where all the the magic happens where the the data is is transformed into fact and dimensions.

A

We clean the data, we transform them, we aggregate and then we we put them into the sql server for the time series analytics. So if you really want to drill down into the let's say in the protocol level, what what happened when a call has dropped, why it has dropped?

A

So that's where you use the sql server or also, if you want to plot the the points. So the measurement points on the map, uh you would typically use the sql server and then we we use the analysis service where we. uh This is our cube. So for statistical reasons there, you can have overall statistics how your network is doing compared to your provider competitors, what tests you have actually measured, how long and they're all kind of statistical measurements um yeah and then the motivation for us to for using daxter.

A

So uh we want to bring this intel into the cloud and we would like to manage all this logic or this uh ether logic in a center place and also being a big data ready with uh the cloud. As you know, there's a lot of open source tools which you need to mingle together and there we just want to be ready to to connect all of these tools and also be have like a state-of-the-art tool. So I said we from on-premise. We went to the cloud.

A

We also want to scale out. So at the moment we have sql server that can only scale up, meaning we need to buy a lot of expensive hardware. In case we have a lot of data which costs a lot, especially if it's, if you have idle time so you pay the cost anyway. So and in the cloud we want to use more cheaper mushy machines and only scale up if we really need the resources, so we can really also scale down and save money.

A

When we we don't use any kind of detail or only limited amount of queries and then generally, we wanted to also overcome the intel processing, because that took longer and longer, especially for our big customers, they're collecting more data and also with the 5g and 6 g in the pipeline.

A

There is coming more and more data which we need to be ready for and also query time. We have a lot of percentile uh queries so or medians.

A

So in that matter you really need to sort all the data to find your median and that, of course, you know when the bigger the data is the longer it takes, and there we really got on the limit for bigger datas.

A

We also have a web ui as you've seen before, and there we really need seconds response time, so the user has. They want to have this dashboard kind of feeling where you really drill and slice on the fly, and they don't have time to wait. So we need a really fast response time.

A

um That's then, how we we we look at our architecture today, so we we use the browser for accessing our smart analytics. uh Smart analytics is then built it's a in-house build tool with no js angular and c-sharp back-end.

A

We also have other query types like tableau, so users or our internal users can also use tableau for querying the data directly.

A

We have jupyter notebooks to do ad hoc, analytics or machine learning models and then the measurement files, the ones that we talked about before they get uploaded directly in our s3 storage, and there we ingest into our data warehouse which is through it apache druid, uh is our replacement for our cubes, so it it is well suited for us because it has a sub seconds response time also on large data sets, and it also the architecture is made that you can really differentiate the ingestion to the query time.

A

So, even though we would load a lot of data, the query time is not interfered at all, so this is makes it a really really good choice for us.

A

We also use spark for processing uh the data and also creating delta tables. um We have general service in our road and charts cloud. We use kubernetes for that and then there we get like monitor logging resource scaling. We get out of the box and then, of course, that the heart is is stacked, where we really put all our uh code into so the import pipeline is is mainly based on eventing, so we use a lot of event, event-driven pipelines.

A

So if the customer uploads files it gets triggered pipelines and then from there, we use other sensors or events that take it to the next pipeline.

A

We also also have a so-called adoption layer where we do the data processing. Let's say if we have a version upgrade or we do other calculation or machine learn, learning models that that's done in this. This bucket here.

A

As mentioned, the sensors is a big part of it, so if you have never used sensors in in daxter, this is this is how it looks on the dacket ui. So it's very uh nicely done that you can really see uh all the dots.

A

That's actually uh when you trigger a new sensor, so there you can say how many off or how often it should poll, and then it would check some kind of log python logic and then yeah, we'll start some chop, so you can see which got spawned and then you can really click all the pipelines and see which pipeline has been started.

A

So this is, from the ui perspective, a very nice way, but also, if you, if you implement this, so the only thing you need to do is actually to to put an annotation at sensor and name your pipeline. You would like to start and then for us, it's very handy. We use we use s3, mostly with sensors, and there we can build in some glue logic. So we need to we define which files we would like to read.

A

You could also add more logic, and then we just spawn pipelines out of each file. So this is a nice way to add a little bit glue code there and and to to spawn your pipelines event base driven.

A

uh This is, uh let's say our import pipeline. Just to give you an overview how this looks so we don't have like an overall pipeline, as mentioned before. If we upload our zipped files sensor get triggered, it will uh like parallelize per file that get uploaded all of the unzipped files get uploaded again, and then we take a next sensor that takes each of this file immediately and run the atl pipeline on it.

A

So we really try to scale as much as possible and that's also why we use sensors because we have different granularity between these uh sensors or pipelines. This uh etl pipeline will then uh do all the the logic that which we had before. So we have our facts and dimension coming out here as a parque file, and then this gets based on fact, tables gets ingested into or created as a delta table on our s3 storage, and then we also with python ingested to do it.

A

So we have a very nice flexibility that we have machine, learning, use cases other experimental stuff. We can really use the delta tables.

A

We can also do joins and more sophisticated things, whereas in druid we really use for our sub second response time for the application.

A

This is the pipeline part, but we actually also use tax for provisioning. So whenever we have a new user or we have a new tenant in our cluster, we we use taxi to create the buckets to upload the secrets to our vault. We create the druid account, so uh the more we use tax there. We also find new use cases that we can actually integrate with it. So it's not just for data pipelines.

A

That is, is it uh which is very powerful for us, but also for for other kind of administrative tasks that we would like to automate um exactly and then we also use assets. So assets is the part where you link your data to the computation.

A

So it's also maybe a rather new newer feature, but it's uh very well suited for, if you want to add metadata on- or you also want to show your persistent data to your customers, so we played around with the etl, so we added some size or durations uh to it, and then we could immediately see if there was a spike in some retail jobs or then we could. Actually you can then on the ui.

A

You can also click on that uh point and really trick build drill down to the actual pipeline to see which file it was and what happened there. So this is a very nice feature for us and we would also like to to add more so we try to create assets for all our persistent three tables and also for our delta tables and in the future.

A

We also like to try the data lineage that just got added to dexter, because we have this uh event driven pipeline, that from the zip files to to the actual druid fact, data table. uh Sometimes we we have some wrong data and then it would be nice to see where this data is actually coming from and there we were hoping to use the data lineage feature to actually document it and also make it available for all our customers and also for us as engineers.

A

um If you haven't used assets, it's actually quite easy. The only thing you need to do if you have your solid. This is the one part of your pipeline. Besides yielding your output, you just yield another asset materialization, and this is actually exactly the what we do here for to get the output that we just saw before. So we just just add a key which is the unique identifier. So here we are still playing around.

A

If we want to have it on each fact level or we would like to have it a little bit higher level and then daxter is like dividing it. But here you can add descriptions and then all the metadata which we have seen before.

A

So uh so far we use it. Let's say for uh quite a bit and the advantages for us using daxter is in the first place also that we could replace our custom-based processing engine that we used on-premise. But now we have massive out-of-the-box features, so we just have restart capabilities. We have backfill, we have dependency management yeah, we see. What's top, we have the ui, so we have a different mode, so you can just switch from production to local.

A

Just you have this uh modes that you can just switch, so all of these features are tested and stable, so we just get them out of the box. So this was a nice.

A

Nice implication for us when we start using daxter and the beautiful packet or the ui is very nicely done, and it shows exactly what what you need to know. It's you see always the state of each pipeline. What is the current jobs running and you get rich metadata? That's also a big plus for us that we can really add, also customize metadata, to show make available in the yeah in the ui as well easily without doing too much um the problem. Solving is also a big thing for us uh before we used other methods.

A

Sometimes uh uh it was actually hard to find the error itself, so we had to go through pipelines or we had to really find the error, and sometimes we couldn't even find it so with tax. You really have the error straight in your face, so you really.

A

You can really start solving errors because- and in case you missed some data or you don't know why the error happened. You just add more meta information, so, at the end of the day, you just have an easy way or always have all meta information to solve this box.

A

um We also have uses uh developers that didn't come from the data sphere and they have told me that it's easy for them to grasp the concepts, so um all the concepts with the resources and the solids and the pipelines it made sense to them. So we could also have quite fast speed up with new developers uh using daxter and it's actually very pleasant to write pipelines.

A

When you have a such a nice uh framework around you with uh yeah, there's a lot of thoughts in it that you don't need to think about it's just uh built in and then everything is self-documented. So you don't need to actually uh like draw diagrams anymore, because uh the pipelines index that are self-documented, you can really see each steps.

A

uh You can even put sql statements in inside there to really see uh what's actually going on with that assets. As mentioned and yeah, you you're spending less time actually explaining what's going on, but you can really uh discuss business transformation and logic. What needs to be done?

A

The one of the biggest thing for me was the reusable uh of code. So once we had the before, we had microservices in python, so first things we could really easily move them to daxter, so we had glasses and then we just added them to resources, and then we already had them inside taxes, so that was a very elegant way to to use access existing code.

A

So for us it was in no time we could move the stuff, but also the resources are specif specifically helpful for us, because you write the code once and then it's tested and you can use it everywhere in in the solids or in all your pipelines. So you don't duplicate your code.

A

um It also reduced the boilerplate code for us, because when you have microservices, you uh tend to integrate the logging, the restart and so on these kind of things you always need, but you always re, implement them normally, when you have different microservices, but with taxed you you do that once or you you even get it already out of the box or you build it yourself, but then you do it once, and that was very good for us and it's also functional, but by design.

A

So when you use taxi, all the pipelines are actually built, functional which also reduces management or ease up the maintenance of code or restartability, and all of this comes just out of the box just by using daxter.

A

Here is a if an example of how we use resources. So, for example, we have a druid resource. We took our classes that was already there that reconnector with some methods that we use and then you just add the annotation, add resources, and then you give it all the passwords and usernames it needs.

A

And then you do that exactly once and then you can use it in your pipelines just by specify the resource, and then you can access this with the context. So the context is very powerful, so all resources that you specify you can just with context resource dot through it. You have access to all these uh yep resourced methods. So, besides that the context also has additional functions that you have, for example, the run id.

A

So you really get a powerful tool there and you don't need to fiddle around with passwords and uh usernames, because this you do once in the resource and then you're ready to to actually do the pipeline logic, which is uh yeah. It's also another very nice feature of dexter um when we started there wasn't yet the kubernetes deployment, but just uh when we were about to start kubernetes, actually there was, and so we we are heavy using kubernetes.

A

uh It's also very handy for us, because we have a sql server on linux uh in one deployment and then another one we use spark. So it's easy to separate them with uh the user codes and we can also scale out easily with new ports, and it makes things very easy for us and also understandable. What's going on and that's also a big plus for us, it's piston based. So the language of the data nowadays is uh is python. So it's easy to learn and to adapt for engineers, and it also supports sql statements.

A

So you can have your solid that takes sql statements and then you can easily integrate that as well or it has also powerful extensions like dpt and other things that you easily integrate with it so and yeah. That's this and the next steps, so we're not yet using that much unit, testing and smoke test. So we would like to improve there at the moment. There's a lot of manual testing and yeah. We would like, of course, to have some test files that automatically get tested the documentation we would like to use assets more intensively.

A

Maybe even automated integrate the data lineage feature, we started some daxter pipeline guidelines just to have some best practices around the let's say when to use assets. What is the namings how to use resources? So it's very basic, but we also like to start growing that one and specifically, I would also like to try the dynamic orchestration, as taught before we have many sensors starting low level pipelines, but with uh I would like to try if we can also use.

A

Maybe one pipeline then spawns all the dynamically all the sub pipelines and also partitioning we're not using yet, but everything is based on files in our environment, so partitions would make sense there as well yeah. That's it from my side. Just contact me anywhere. If you have some questions or yeah, feel free to ask them later on.

A

Thank you very much.

B

Thank you simon. I also wanted to give simon a special shout out. He was an extremely early user of daxter and has produced a ton of great content about using docs on his blog, which I highly recommend everyone here to check out.