Cloud Native Computing Foundation Kubernetes Batch + HPC Day EU 2022, 19 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Fast Data on-Ramp with Apache Pulsar on K8 - Timothy Spann, StreamNative

Description

Fast Data on-Ramp with Apache Pulsar on K8 - Timothy Spann, StreamNative

As the Apache Pulsar communities grows, more and more connectors will be added. To enhance the availability of sources and sinks and to make use of the greater Apache Streaming community, joining forces between Apache NiFi and Apache Pulsar is a perfect fit.

Apache NiFi also adds the benefits of ELT, ETL, data crunching, transformation, validation and batch data processing. Once data is ready to be an event, NiFi can launch it into Pulsar at light speed. I will walk through how to get started, some use cases and demos and answer questions. Benefits to the Ecosystem.

https://www.datainmotion.dev/
https://github.com/tspannhw

A

I have a little demo, but based on the timing, I'm not showing it I'll be in the booths in uh s91 over by the cafe tomorrow in the next couple days. If you want to see some of this in the real world, now look two different technologies, working together in the open source that help out for a lot of different workloads.

A

One of them is apache nifi. It is an open source project. That's I kind of think of it as a swiss army knife, it's anytime you're trying to implement getting data from any type of place, and you don't want to write any code now if I could start off on one pod scale it up to thousands and it's very easy to use to get data, whether it's in a batch, whether it's in a stream just get it started, get it into your data pipeline.

A

So you could do things with it, whether that's logs events, sensors stuff out of tables, whatever you may have there, and the next project that we work together with is apache pulsar. This one has a lot of different use cases.

A

Often people use it for microservices. It's a great way to communicate between processes, whether they're in different clusters, they're in different clouds, different availability zones, lots of different ways to communicate.

A

They run pretty fast, so yeah we could do real-time apps or we could do batch apps. uh One of the nice features here is no matter how much data you put into the message queue.

A

We could uh scale to whatever level you need, because we could tear out automatically to whatever kind of storage you have, whether that's s3 or some s3 compatible adls hoodoo file system, whatever it may be.

A

Our infrastructure at the highest level is a couple of different things. What's nice here is. This is a real cloud native distribution of concerns.

A

Pulsar does the compute that's in its own scalable set of brokers, bookkeeper is storage and that scales separately from the compute, and then we have some metadata that we can put into xcd or whatever you have there. We're not tied to any specific implementation, makes it pretty easy to run.

A

Cluster is pretty straightforward, got a number of brokers, the bookkeepers for storage and they scale up independently uni using anything that's standard on kubernetes or yarn or wherever you may be running and again, like I mentioned this happens automatically. You don't have to know about it once you have it configured, it'll just do it.

A

When you set points to say, if data's over a certain age, if data's over a certain size, you could have it automatically go out there, but you could still consume it as if it was in the regular local storage makes it very easy for you to just have a message system that goes on forever, but you need to start back from the beginning. You could do that.

A

We could also run functions within this architecture. Functions are kind of nice, it's kind of like having your own aws lambda, that you run yourself in kubernetes. What's cool with this, we support java go and python very simple api, just deploy it, and we have all the kubernetes operators and helm charts. You need to do it the common use case internally. We use this in the open source apache to do sources and syncs.

A

So if you don't want to write code to get data from postgres or or any kind of storage, you just point to it pulls it in. We could also run kafka, connect, connectors.

A

I didn't really tell you what pulsar is I've been going pretty fast here, trying to get everything in, but it's messaging and streaming and they're not the same thing streaming is like kafka and kinesis. You want things in order, you're thinking about cdc event at a time very fast things like flink. We operate that way. If you want to. If you decide you want to do work, cues or messaging, and you send a message: don't care order, don't care who gets it?

A

We have that with full acknowledgement, full scalability, there's companies doing petabytes in memory, data warehouses with pulsar.

A

And we have a native connector that my friend and I worked on to make sure we can connect to 95. So now, if I get that data started in the system, you do very simple workflows in nifi, which is nice no coding, and then you, you got your data in pulsar, just to show you how we do it. We have the open source operators and we got a couple of managed ones depending on how you want to run it. One of the really cool features I I didn't mention is: we could talk other messaging protocols.

A

So if you want to talk kafka, we'll act as if we're kafka, you want to do mqtt, you want to do a rocket mq. You want to do rabbit. We can look like all of those and interrupt any kind of messaging at the same time, so send a pulsar message, pull it out as if it was kafka and mix and match as many as you want. At the same time,.

A

If you don't want to run it yourself, you could run it in our cloud. It will do it for you. We have a free tier, you want to start off with and you can run it as if you just want to run it as if you had kafka that scales infinitely and you could have a million topics without having to worry about brokers or anything like that runs all the kafka stuff. If you need to do that, I've got links to all my stuff here. I don't know how much time I have.

A

I know I'm going pretty quick here, but I'd only have a couple of minutes, so I'll just go on through quickly I'll give all these slides out. So you can get to all the links. If you have uh want to see demos examples different things we work with I'm in booth, s91 next couple of days I'll show you some different demos, microservices spark flink.

A

We interrupt with a lot of different things and it's part of the apache projects. You know we work well with pretty much all of them there, whether it's kafka spark. What have you uh just show? You some kind of command line, because people like to see that and a link to uh our thing how we can auto scale up our pulsar functions, which is our microservices in kubernetes, just using some custom metrics and it's pretty straightforward.

A

We open up all the apis. Everything is either command line, interface and a rest api everything's documented. We don't hide anything everything's open source, so you could just take it start running it yourself.

A

I'm the open source developer advocate so go out there and use it. It's pretty awesome pretty easy to do. I think we have a little time for questions a little time. Anyone has any questions or wants clarifications.

A

Want some details on anything wants uh better pictures. I don't know want to see uh batman robin moore. I don't know yes.

B

What is the most like the biggest use case? I'm trying to solve here like with the functions, is each function going to run for hours or seconds or minutes? What is the scale.

A

That is a great question. The functions are event at a time. So what happens? Is you subscribe to a topic or multiple topics and when a message or event comes in you execute and run and then you're completed?

A

So it's really designed to do something. It could be routing transformation, machine learning, we've got someone implemented a sql engine in there. It's really something happens. Do something and what's nice is it's triggered by something going into a pulsar topic which, like I said you could have millions of topics broken down with multi-tenancy for tenants and name spaces.

A

A nice feature with that is you got support for three languages? They run execute, go away, there's managed version of instances, so they kept fresh. It's a very simple api. You implement a function and it gives you all the context and features you need.

A

If you needed a longer running, one I'd probably say, run spark or run flank or run a google function or something else you could. I don't really want someone sitting in there for hours in one of these doesn't really make sense. Something like one of those other infrastructures would make more sense. This is really an event comes in you know. Maybe I want to do uh real time, uh nlp on it on one piece of data.

A

One event, one log, that sort of thing we tend to have people if you want to do joins, do it with flink sql. If you want to do etl spark, we don't want to do any everything in the world. We do enough with messaging and streaming so, but we needed these functions so we opened it up for all the infrastructure to use it well, we might not have more questions.

A

Maybe a long talk is worse because then people fall asleep, no one's asleep too hot. To fall asleep. Yes,.

B

Thanks, you mentioned something about petabyte of data in memory. Can you speak a bit more about that and how quickly can you process that.

A

Yeah we have uh there's a couple of different cloud companies in china that have created in-memory data warehouses with flink and pulsar together and those are in the hundred petabyte range, and it's fast enough that it's used for if you know uh singles day in china, it's part of that infrastructure, so real time transactions pretty powerful. I don't know if I'd use it for scientific computing, but you know flink can do a lot in memory. We could run as much as you need in memory, and you know: what's nice too is once it's in pulsar.

A

You can have it go away. If you want once a message is acknowledged, it can go away or we could keep it forever.

A

Since you can have you know, especially with the tiered storage, maybe I'll, keep 50 terabytes in recent local bookkeepers storage and then do the rest, the other 500 petabytes in s3 storage. You know and then I can look back and I can rerun everything. That's ever happened for topics in order and I don't have to do any special code for that.

A

I could just point to earliest offset and just do that, and I could do that with the native client drivers, which support like the top 16 languages out there, or I could do that with uh spark or flank we're uh first class ones for both of those projects pretty straightforward to do.

A

This is a typical app that I do. I have some app doing. Something gets data into pulsar. Have a function, do something like for mine. It's breaking data up, I get data. I pull out of a couple different rest sources, clean it up based on where it should go. Reformat it put it into a couple, different topics and then, as those events pop up a spark, etl grabbing a batch of them, dropping them into a table, and then I've got flink sql running continuously.

A

So, as events come in, it's updating its sql results of that sql can go into another topic. They can go into. You know a file system. It can go into something like hbase or kudu or any of the data stores that flink supports. So it's a nice way that this this is a toy application. I wrote. I know there's a lot of different areas, but each section is very simple: the client libraries are pretty straightforward, whether you're doing java python go rust. Kotlin mo scala.

A

Most of the major languages are supported by the clients, because we've got a lot of different people using pulsar.

A

ah We might be at the end.

B

There's still time for one more one.

A

More question: oh, I guess there's no one from kafka in here the kafka people always like to throw me a tough one, very cool, to see no kafka. People.

B

Thank you, tim.