Apache Cassandra Cassandra Summit Europe 2014, 27 Dec 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: User Defined Functions in Apache Cassandra 3.0

Description

Speaker: Robert Stupp, Consultant

User Defined Functions (UDFs) allow users to code their own functions in Java or a JSR-223 scripting language. The presentation describes the current status of UDFs and its related use.

A

Hello, everybody I'm just talking I, wanted to present you. Some new functionality will be introduced in Cassandra 32, though it's called user-defined functions, I think it's a great feature, especially for everybody, since we tell people not to use any code in Cassandra some words about me: I'm, just a contributor to Apache Cassandra I've built the UDF staff currently working on the Rope cash for 30 I'm, basically I'm working as a freelancer on my own helping customers to build good Cassandra solutions.

A

Cassandra field, always Brent you and actively developed. So everything is on a development. Everything may change. So please don't blame me for anything. I tell you now and might have might be changed so Cassandra fear, though it will bring a lot of new features. As Jonathan already told today. A lot of new, improved, great improvements and user-defined function is just one of it.

A

So what it basically means user-defined functions, so you can go at and start writing your own code. Let it eat execute on the Cassandra nodes and one one thing I'm really proud of is that your own code gets automatically distributed to the whole cluster yeah. So please take the last cent, not word-by-word.

A

Characteristics of a UDF, the ticket is called pure functions by pure we mean just input parameters, primitives, usually no state, no side effects, no dependency to other libraries are like pulling in some messaging system for exam. Don't do that? Don't even do maybe a dns lookups like that.

A

So basically who's a Java coder.

A

Okay. So you understand what I'm that function? That's what I, what we mean when you say peel function, nothing else.

A

As I said, it's basically simple to set up a user-defined function, so you have a bunch of arguments 12 and you have a return type. You specify the language you're using and you specify the source code you are using. Yeah will be injected into a class with boost its transparently, but you don't have to bother about it. Yeah. So just argument, return type and some language.

A

It's even possible to not only use Java. You can also use any scripting language like java script and do it like this yeah and even this one whoops ah sorry wrong wrong side.

A

It's jsr two to three: let's skipping for java platform, so we basically did it, so you can write in Java JavaScript. Even if you want you can install on your own support for groove ej will be giant and scala closures actually not working. It's not because of assets because of the implementation.

A

So, behind the scenes it takes the code that you have written in the create function, statement built the java class or the script file compiles it it loads. The codes are, and it transparently migrated to every other node in the cluster, similar to a create table statement, create index statement like that and now it's executable in the whole cluster. On every note.

A

So the argument argument types photos functions. Basically anything you can use in Cassandra in the table like boolean integer, even collections or tupa types of the new. To that one excuse me: you leave fine types. Basically, we did it by pulling in the Java driver. Did.

A

So um yeah, what it's for?

A

What you know have seen you can't you can't do it. Even they are not much better as you can do it in your application. Well create creating scenes, function them what it's for yeah!

A

It's not a great new feature to do summer, some of anything or calculating simple functions. It's not that much!

A

So, basically they're just nice to have nothing. You can do better on your own. So there is a real use case, of course, and that's one even you in three two, those aggregate functions, so you can do it on your own, so you TFS are used inside the aggregation functions like some average.

A

So who doesn't know what an aggregate is could um oops.

A

That is basically the syntax to create an aggregate. The only aggregate, if you want to curl the minimum function, just give it a name. Some argument, some state it has to maintain state when scans arose and the name of the EDF ships used to calculate the minimum yeah um basically working internally is it has.

B

A

State, basically now for each row, that is what will be scammed. You take. The state will get to state in including empty value of the rows. Calculate the new state returns it. The last state will be returned value, so what you previously have to do is scan all rows. We turn them to the clip to your client and calculate the minimum on your side that it wouldn't work for something like an average.

A

Well, you have to do to some and to divide by the number of rows, so you need something that works after the last row to calculate something. That's called a final function and you can see, for example, they can even use to pitayas. Those are two: perhaps it okay to put tabs are just as you can see in the syntax. It's just one type.

A

It's frozen into the role to that one, just a sequence of some data types, so the initial initial state would be 0 0 and the state function which causes for every row is the average state, the state of tuple and the final function would return. The average.

A

Yeah, that's basically, it.

A

I think it's not the intention to execute either good at what I meant when you win a set, don't pull in any evil dependencies, costly dependencies, things that have to wait for something else and boot style slow down your whole cluster I. Think you don't want that that we, what we also want to do in 30 to add some permissions for DD l stands for DML statement, whether you can execute a function or not, and also some still.

A

The functions you are used to use are like a now count time stamp off they're, not called native functions, and they there's a reason that they have to belong to the system key space, it's just because you can't modify them. You can't even drop them, so each user-defined function or aggregate just belongs to a key space like a table or use it hard.

A

If you do not specify the key space with your function system keys basis, search first reason is just the native functions to keep it backwards, compatible.

A

um Yes, some some words about it as a set scripting, it's nice, it's really nice. You can call Scala or groovy something like that in Europe and let it run on your cluster. But you have to keep in mind that that scripting has a lot of overhead during some tests. I found out that the overhead just to execute, for example, JavaScript about thousand times slower than Java yeah. Keep that in mind. I would strongly recommend just to use Java plane job.

A

Keep the UDF spewer, don't just take the parameters, work with them and return the result not pulling anything in and test them. I, don't I, don't think anybody has to want to deal with. For example, nullpointerexception running a new production system.

A

Still, looking at for the tickets.

A

So the aggregate stuff was built by Benjamin era. He did. He asked me to yeah set I'm a geek I just want to know a bit more whites execute on the coordinator. Why? It's not just, for example, for aggregation, distributed to the whole class and the whole cluster does work and returns the results back.

A

It's executed on the coordinate because the coordinator, only the coordinator, has our results. All individual results insert it as I said. Please prefer Java over anything else. You can do it I think it's a great thing, but for prototyping trying things out, but yeah.

A

C

C

A

You're just dreaming a bit because you th can could be used for so many things.

A

Indexes, for example, indexes by themselves, are not quite.

C

A

What you can use them could be used. You could use a UDF for create in X for X unless John isn't set in this presentation, for partial indexing, for more advanced filtering and even ya might be a bit complicated, big complex to implement, but such a thing like a distributed group by where the individual nodes can aggregate data on the partitions they own and turn return. These sub partial results back to the coordinator me does a final merchant. It.

A

Yeah, that's that's! Basically what UTF are I kept it a bit short was not good in climbing I'm sure you have some questions.

D

A

A

Are basically not.

A

Yuya, you get a bite, a byte buffer and it's basically just the type that you will used to get in Java tryna.

A

E

A

But it would be basically just a full table scan of course, yeah.

D

D

A

Be expensive, of course, if you do, if you do such.

A

A law about that filtering regarding the previous slide, it's just yeah, basically dreaming not sure how how to implement it. I think it's worth to get reducer to network traffic from the cluster to decline.

A

Currently, yes,.

D

B

You have the same thing to only see me, you can use it. Of course you can always use things like that to hack your own.

D

B

Not at all because the thing is up, the thing is happening on the coordinate, your note. So basically you queries a cluster. Get back to your wizard, make sure you have the latest data. Then you apply the function or your aggregation, but you don't push the function to the low.

D

A

A

Okay, so you want to do image processing using.

D

A

D

A

Not think that's what that's, what that meant to be pure, not to pull in any any foreign dependencies, because that would be draining asleep, especially if you say.

D

D

A

A

Technically, you have access to keep last part. What I am yeah I, wouldn't recommend.

E

A

Bay is giving comparison about what which impact has UDF.

A

A

Not really much because if you stick to java yeah because it will, it will be yeah jitter. This is a bit too much, but just an execution. So we're talking about nano seconds, maybe maybe some micro seconds for each row. It will grow if you use javascript, because then you get easily in the middle east. Second.

F

A

Has some impact yeah, but it depends on you what you could, if you just do, a plus B, it's nothing more. It's just basically technically behind the UDF is a java that implements an input interface and has just one instance. The java class in the old vm.

F

Everything everything is being done on the port 80. What's the difference between that and a client.

A

What do you get? Don't think I got your point if you're just talking about pure you gf's using in a select for example, it's nothing really not nearly nothing.

B

A

That's what was the intention of the distributed group right goodbye? It's just an idea at the moment.

D

A

It can do that with could do that with an aggregate, for example. Yes,.

A

That's probably the idea.

B

These functions they.

A

And not really a fan of such a thing that gets caught for millions of rules, because monitoring takes time and.

B

A

Individual functional execution.

B

In space the monomer, you are, it's.

C

Very strong, the JV space.

B

Is the same name.

A

B

What's happening in Cassandra coordinate going to get back to your reserve, you build your rigid second, you send your reserves user what's happening there is that. Why do you build your razor set? Then you apply the function to the data that you store is the result of your fortune. Instead of the original.

B

Can create yeah, it does one memory their page, that you get what you take per page.

B

When you only one objective,.

A

A

You could use a also for insults, I dropped it it's possible.

A

It can aggregate basically on everything you can do with a usual select. Just you get the aggregate back.

A

G

You get a max or a time window, customer phone once a month or a year a year, and if you have the max per.

A

Year or it would be great to think yeah would be great to have such as.

G

A

A

Yeah, of course, especially if you stick to Java, because there you have nearly no concurrency problems, I'm not sure about dj's are implementations regarding closures are not closure, but the other scripting languages.

A

Yeah, it's a completely different approach than with triggers triggers were placed on every note as a jar file is classified. Udf are just coated on the only in the sea insecure, so you don't have any. We don't even expose internal information to the UTF did answer your question. Yeah.

A

Mm-Hmm you mean in the schema which yeah it's basically a new schema Taylor or to Newport a bit. Now it's just the migration for the tables or something like that.

C

D

Can you reference from the new reference, no.

A

Basically, yes, but then, if you intend to use, for example, usually find types we will get into problems because you can't use user-defined types in another key space, it's possible to invoke them, of course, yeah, but be careful, don't mix things from it's a different domains.

A

Any more questions so I think I'm pretty good in time, maybe a bit too short or too fast yeah thanks for your attention and have a great day.