Dagster Dagster Demos, 22 Sep 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Turbocharge Your Data with Cube & Dagster

Description

Integrate Cube with Dagster to seamlessly connect your data transformation pipelines and orchestration tools with your semantic layer. Handle all of your rough-grained transformation tasks with Dagster and land data in your warehouse and then perform your business logic and metric definition in Cube and provide consistent analytics throughout the organization. Use event-based integrations to trigger your pre-aggregation builds efficiently and easily, saving time and compute resources.

A

Webinar today, uh today, we're talking about some pretty exciting stuff, we're going to show you how you can turbo Turbo Charge your data pipelines with Cube and Dexter.

A

um We have Brian and Tony, representing Cube and pedram from Dexter to talk about all things, Dexter and Q, but before we get started um just a few logistical pieces, if you have any questions along the way, um definitely enter them in Click, submit and we'll answer them during the live q. A portion at the end of the webinar and the recording of the webinar will be available on Demand right at our events page.

A

So you can check it out at any time right after the live show and what we'll be discussing today is uh what is Cube. What is Dexter um then jump into data orchestration and then close out with a live demo um with the Dexter and Cube integration in action with um q, a t Fallout so without further Ado I'm, going to hand it up to Brian to get things started. Welcome Brian,.

B

Hey thanks, Nathan I appreciate it um yeah, hey everybody. My name is Brian Bickel I'm, the head of Partnerships here at Cube and I, have the Good Fortune to manage all of our isv relationships and our SI relationships, and um you know I get to get to work with Daxter and you're happy to be here today.

B

So uh as Nathan said, I'm going to start us off by uh talking about Cube cube is a universal semantic layer product and our mission is to power. The next generation of data applications uh across an organization um to unpack that a little bit in the modern data World. There are a lot of different ways to use data, whether that's um you know, multiple business intelligence tools, embedded analytics and customer facing uh use cases, and, and now these kind of emerging use cases around AI agents that we're all sort of forced to deal with.

B

And um you know if you think about all of those different use cases.

B

The major challenge that we see um in these situations are that customers have a lot of different data sources and then there are a lot of different Downstream use cases that they're trying to map all these different tools to, and one of the common approaches in the past to solving this was simply uh doing the the data modeling in each of your individual Downstream tools, whether that was in a business intelligence tool or with some kind of lightweight data model in a notebook or something like that.

B

This creates a problem, because every time we add a new data tool that involves uh repeating ourselves and are repeating our data model, and it allows for a situation where we end up with a lot of extra work. That is unnecessary and a lot of opportunity for models to become stale or inconsistent and for different teams to kind of end up with different uh calculations or different metric definitions, or potentially different insights and disagreements. Based on based on the incorrect data.

B

And you know the big idea behind what what cube is doing is is really solving this problem of of repeating yourself and making sure that we can separate the data modeling from each individual tool so that you only have to do it one time in the semantic layer which each tool can then connect to.

B

This also allows for a lot of flexibility, because um you know the data models and the architecture can change alongside your needs as you as you grow your business now um from a features perspective. Let's talk maybe about exactly what a semantic layer is. We sort of break this down into like four major groups of functionality and as far as what Cube connects to cube connects to relational databases, cloud data, warehouses, lake houses, query engines of all types of shapes and then a lot of like Oddball sort of Time series databases as well.

B

Typically, if something can speak SQL, we can usually connect to it and and start to ingest data from it. But once inside Cube we provide, you know four major sorts of functionalities and- and these are data, modeling, Access, Control, caching and then apis- and to start to break some of that down in our data modeling capability. We are a code, First Data, modeling experience where users can use either JavaScript or yaml and yaml assumed to be templated and kind of python, augmented yaml. In an upcoming release.

B

Shortly um you can build your data models there and then you can Version Control them in different flavors of git and move them through development to production branches. uh Once your data model is in a place where you like it, we do access control. That's a uh you know: an access control, that's able to do things like column, masking role-based, access control. That can do uh you know row level, security and other kind of interesting, like multi-tenant uh requirements that you may have, and then our caching engine is a product that we built.

B

We call Cube store and that is all Uh custom development built around fast pre-aggregation. It is an aggregate aware system. So if you build a roll up for let's say a time, Dimension or maybe a specific Dimension that you're interested in rolling up, you don't have to inform any of your bi endpoints or anything else. Downstream about those- and you don't have to maybe like change your definitions or point to a different view like you would have to do with something like a materialized view.

B

As a tactic for trying to solve for performance Cube just handles that automatically.

B

Whenever a query comes in, we we figure out what the most performant, either cash or pre-aggregation, would be to solve the problem for you, and then we rewrite the query and then finally, everything leaves Cube as a relational or sorry as a rest, API or a graphql API or via our SQL API, which speaks postgres and the SQL API is sort of how we get to you know most bi tools and that's typically, what they're going to be consuming rest and graph, of course, may be more suitable if you're building your own custom uh front-end framework with something like react or angular review or a myriad of different products.

B

You could use now um and then uh this emerging kind of category of AI agents and chat Bots, they sort of split in this. Some of them consume rest and some of them are are writing their own SQL, so they might consume different endpoints as far as like.

B

Why why you need a modern, why you need a semantic layer in the modern data stack and really what problems we're trying to solve that the main problems are: are these sort of six areas around consistency, security, performance, flexibility, time to value and future proofing and to kind of dig into these just a little bit the data consistency is typically a point where we're resolving that, through better data, modeling capabilities and abstracting the data modeling from each individual Downstream tool, so that there is a better chance at users having a more centralized data model to connect to and Not Duplicate metric definitions or or build inconsistent metric definitions.

B

Security is handled through the access control policies that we mentioned as part of our our stack um again.

B

You know if you have multiple Downstream tools, it's tempting to have different data models in each one or potentially, you know have some inconsistency and not be able to keep the data models or the access control policies uh uh similar between all of them performance is improved in a lot of ways by our pre-aggregation and caching engine, and um you know with the ability to really not Define data models and security within each Downstream tool, but do it once that improves flexibility, time to value and future proofs you for Downstream changes in the future um to kind of Orient.

B

You know, maybe, where the semantic layer sits. If you aren't familiar with the idea of using one of the modern data stack, we are usually adjacent to data sources, whether that's uh cloud data warehouses like I, mentioned relational database products, lake houses and other things. We are not a transformation tool, but we do typically I would say, connect to rough grained uh transformation. Rough grain Transformations uh through DBT that are like landed back in the data source that could be DBT could be SQL.

B

Mesh could be a number of other products and um yeah relevant to the the conversation. Today um we do have an integration with dagster as an orchestration tool which we'll talk about later but um yeah again. We we don't, we don't compete with orchestration. uh We we kind of run right alongside it as part of the modern data stack and then again, you know Downstream. Here you can kind of see all the use cases we talked about already and with that I'd like to hand it over to pedram, to tell us about Daxter.

C

Thanks Ryan so yeah, my name is pedram Naveed and I am the head of data engineering here at diagster, and today I'm really going to be talking to you all about.

C

So to really understand, Dexter I want to first talk about data engineering and kind of where we're at today, I think a lot of us are familiar with the Pains of data engineering. um There's probably no doubt in all of your minds that there's just too many tools and it seems to be getting worse right. Even with proliferation of the modern data stack, like things have become so fragmented that now, as a data engineer, you have to go and look through. You know five, six, seven different apps just to find out where something went wrong.

C

What's going on in your data, and so that's just part of the problem. But let's say you do find it once you do. You probably don't have the right context to understand. What's going on um traditionally, when we write, you know orchestration and tasks.

C

We have these big tasks that do a whole bunch of things, but the task doesn't really tell us what's inside of it too often right, and so, if we're trying to understand uh what happened with one of our models, there's not a clear sense of like which task created that model in our database or what table came from where, and so we lack this full context on what's going on within orchestration, and things tend to be a bit of a black box.

C

The other side is: um if teams start to grow, we start to see a lot of Silence happening and so often there's this very little shared context between teams, because things are so hard to integrate and we don't have all that context.

C

A data engineering, team and data science team tend to work in silos, even though they're often operating on the same underlying data sets and that lack of unification makes it really hard for teams to know what's going on and to have full contacts on what teams are working on, and so what we kind of end up with is this terrible Dev experience and I think the peak of it, and probably the worst of it, is how we often end up testing in production, because these orchestration systems are really tied to specific infrastructure and storage.

C

It's hard to have a local Dev environment that allows for easy testing. So, for example, I remember when I used to use airflow bunch I had to install kubernetes on my laptop just to test a pipeline that I was going to push into production. That was a really heavy lift and often you end up pushing things to production over and over again, just hoping for something to work, which is not something we love, and so those paints are really what drove diagram to exist.

C

It was a tool designed really to solve these fundamental data engineering problems that, in a lot of ways, software Engineers have figured out and it's restored, but we haven't really had the right Tools in place, and so what the extra enables did you do is you know you can write your code and test your data pipelines in pure python? You don't need to start creating all these abstract classes in order to do simple testing everything works pretty much out of the box. You've got great monitoring and computation.

C

The UI is really intuitive and then what's nice is because we look at things in a little bit different, a different way. Instead of focusing on tasks indexing, everything is based on an asset which I'll talk about a second. You do get all these nice ancillary benefits like being able to track data lineage and metadata.

C

So let's talk a little bit about data orchestration, so uh there's really like these two uh overall I would say: styles of orchestration. There's the traditional Oracle orchestration, with tools like prefect and airflow are really good at and what they're focused on is again that task right. You say: I need to do this thing in order to get some outcome. Here's all the steps I got to do, wrap that in a task call that a dag and then you ship it, and because of that you really only have Quran scheduling right.

C

You can run things on a schedule, and so often what ends up happening is, you might say, you know, run my ETL job at midnight and then by 5 am hopefully that's done. I'll run my modeling job and then by 7am. Hopefully that's done. I'll run my visualization.

C

B

You've ever worked with data in that.

C

Schedule you know a it's not efficient and B. It's not always accurate right. If you have to do retries and that type of stuff, like the schedule, quickly gets out of hand and then the other part is always very hard to test. If you ever try to test an airflow dag yourself, you know that that's not an easy thing to do and because everything is centered around tasks, you kind of lose access to all the things you actually care about.

C

So you don't have any metadata, for example, on the tables and how many rows it um wrote unless you explicitly pull these things down on the data orchestration side, which is the model, the extra uses, we've we've kind of Spun things around and, to be honest, like it took me a while to sort of wrap my head around what this actually means, but once I did it like unlocked a lot of powerful features for me instead of thinking about tasks which is like I need to do this thing, let's get it done, and actually we think about the asset, the fundamental thing we actually want to produce, and so by flipping that around we now have this nice way of thinking about.

C

You know the things we actually care about, but the task is kind of like the imperative way of writing things. The data asset itself, that's a declarative thing that you want, and so, when we start thinking about, you know this table that we need or that report that we need or this model that we need. We can start to do a lot of really interesting things.

C

We can start to do upstream and downstream dependencies for that specific model, rather than trying to figure out what task it takes to create that thing, and because of that, we can move outside of just cron, and now we get event driven and SLA based scheduling. So if there's a report that is really important and you want to be refreshed, you know every hour. You can now do that and then Upstream that you can say here's the minimum set of things. I need to run in order to make sure this asset is fresh.

C

And, conversely, if you have something that you know runs once a day and you don't really need it to be that up to date, you can say that this asset, you know, doesn't require that much compute and will wait until the next day to get this to run, and then the other nice thing is, though, because we're thinking in terms of assets, these things become really easy to test, especially with Dexter. We have a system of resources, and so you can assign different resources based on your environment, for example, in production.

C

We can say you know, run kubernetes, run AWS run snowflake, but then, when we're testing, we can say, do the same operations but youths.db and use local fire storage and because we've written these things in a decoupled way, it makes it very simple to test it very inexpensive, and this final thing is I. Think what really becomes really interesting and you'll kind of see this in the demo is when we start thinking in terms of assets instead of tasks. You now have a system of record.

C

You can start to see your entire pipeline in terms of the things it's producing, rather than the things it's doing, which I think is a real big unlock, and so I've talked a little bit high level around like what an asset is. But let's, like take a look at quickly like how do you even write an asset?

C

It's really not that different um from what you're, probably used to an asset, is just a wrapper around a function that produces a thing that you care about, and so here we have three assets: country, stats, change model and continent stats.

C

Once we wrap those in the asset. Decorator in Python in daxer, we get representations of all these three assets and we get the dependencies as well, and then we can materialize these things as needed.

C

And this goes beyond just assets that you care and create about, but also into Integrations. So, for example, we integrate with a lot of modern tools such as air, byte, DBT Cube as well, and so we have these integration in the library, so you can leverage those and very quickly you'll get your entire DBT repository available to you with all the assets materialized.

C

Now you get introspection into all those models and their dependencies which is really powerful, because now you can see you know this CBT model has this air byte orders as a source, while this other one has the users of the source, and so when these are updated independently of each other. You can then incrementally update these things as needed.

C

So that was the talk, let's jump into a demo, we're gonna do this live so, let's see how it goes. um We will start with our friend the terminal and I just wanted to show what a sample project looks like I'll actually send a GitHub link to this as well.

C

um Really, there's not a lot in here: there's assets, content and resources, so I'll start with resources. The resource is really like that connection to the outside world and in this case I just have one simple resource: it's a duck, DB resource- and this is like the two lines of code. I have to write really to initialize that resource and if there's a thing that I want that's a little bit more complex.

C

For example, I have this custom sling resource I can write the code for that as well, and so for things that are supported out of the box. It's quite simple and relatively easy to get a resource connected and then for things that might be. You know your own custom, tooling. It doesn't take a lot of effort to create these resources as well, and you can see I had this one function called sync I ask it for a source table destination file and then I run some operation on it and then I return.

C

The number of rows in the file size and the resource is used by a thing called an asset right and an asset like I said, is just a function that has that decorator attached to it. This example I'm taking a bunch of data from various sources such as duckdb, um the internet, I'm downloading, from zip files, I'm extracting them I'm, getting some data from postgres and I'm getting some data from an API um from uh from some SAS application, for example, and so all of these things can be represented as assets and using those resources.

C

I can start to extract those assets and then you'll see that represented into extra quickly. So I'll just quickly show what these assets look like right. I have this asset decorator I put this little compute kind, equal python, so I get a nice picture at the end and give it a group. If that's helpful- and really these are all the lines of code I needed to extract that asset I can use an existing function.

C

I had already in order to download that that asset extract it and then I just output, some metadata and so I can do that for different types of checklists. That I have here. I can even do it for let's say: PB I can ingest the csvs and I can use TV and just run some straight SQL in there and then create an asset like that. So, overall, it's just very simple python: there's not a lot of boilerplate around it.

C

You just do the things you want to do and you wrap a little bit of decorators around that context, and then you have assets. So, let's see what this actually looks like the indexer I will run. Daxter Dev here and I will switch my screen to.

C

This one, this is the repo and, if I, load, dagster, make sure this is loaded we're here. So all that work I did in the background you can kind of see it visualize. Here, it's really intuitive to understand. uh All the raw data is my various data sources. For example, these are coming from the web I'm using python to download them I'm using sling to get data from a postgres database and I'm using steam pipe to get data from mastodon's API, so easy to connect various different sources and resources to get that data.

C

I can click here and just materialize, for example, only the raw data if I wanted to or might as well just click materials I'll and do the entire pipeline. So this pipeline takes the raw data. Duct DB is ingesting some of that there's a little bit of light aggregation going on and then this resource here all birds. That's really the culmination of all this work. We take all this data We join it together and we create that DBT model, and you can see this Upstream data flag.

C

That's telling me that the data Upstream has been updated, and so now we can choose to update a update this data as it's going through the system. If I go into runs and view runs here, you can kind of watch it through. You can see this little timeline.

C

Things are, you know, sequenced properly, anything that can run apparently will I, don't have to specify that, like these things have should, and these things shouldn't all that sequencing is sort of done for us by dagster and then, as these assets are created, they're materialized, and what that means is we get this little nice feature here? We see this view asset. We click on that. Maybe I'll make this a little bit bigger.

C

This cites data set that's an asset of the table, I've created, there's a table name database name and the path number of rows and what's nice is, as I run this over and over again over time. I'll get a plot of you know the column count the size over time. So I can keep an eye on these things. If something went wrong, for example, I'll have a sense of like what happened there.

C

The other nice thing is you get lineage or free so that sites data it comes in from site description, data, which is raw data and it feeds into all birds and then top Birds by year.

C

So I'll go back into our asset graph here and we can see everything is materialized if for exist, for instance, I updated this um raw data set on its own eventually you'll, see once that's updated. Daxer will be aware that you know tickets and daily tickets are now out of date. Here you can see that Upstream beta it's now out of date, and so I can choose. So just click these two here and materialize. Those or I can click this little triangle and say materialize. Anything, that's missing!

C

So, instead of refreshing, my entire data set I can now update. Just these two and I think that's it I'm going to hand it over to Tony for his demo. Next.

D

Awesome thanks Patrick. Let me get my screen shared up here.

D

Great there it is awesome, so we've seen how dagster can keep our pre-aggregations fresh. We can trigger Cube through dagster and take that pipeline from raw data, all the way through to accelerated query cache and and how it's basically all tightly coupled together and we can trigger it on.

D

um You know as needed and triggered uh you know when something else completes further up in the stack. We don't have to necessarily try and try again, uh you know every five minutes during the during the morning, Rush, depending on when a file lands, so that can keep us very efficient on the on the database uh transformation front. So we're going to see how Q Builds on this foundation and delivers the the business value The Last Mile through to the um to the rest of the organization.

D

So we'll start by taking a look under the hood of cube to see you know kind of how we get to a pre-aggregation that we want to keep up to date. So first, we've got that that duck or not the duck, the yeah the duck DVD model, but the birds model on dot DB, so that'll be a fun pun. Here and uh so the model that what we've modeled at is four cubes that pick up that data we've got bird toots. So this is this: is the Whimsical name for the the messages on Mastodon?

D

This is uh this is not something else for those of you with an eyebrow raised right now, um but we uh we've got Birds sites and species as well from the observations. So these um these are related together um and we we handle those uh those relations through a series of joins. Basically, we've created a couple keys that are going to handle the relationship between those two um and or between those. Three sites are basically where the bird was seen.

D

Species is the type in more information on what the type of bird um that was observed, um and then the the birds table is just the big table of observations, so we can bring in. um We can Define things like in our metrics or sorry in our Dimensions. We can define those custom cues to help with any composite keys that we need for joins. We can bring in latitude longitude. We can compute dates.

D

A lot of these string Fields actually started out as or sorry A lot of these quantifiable fields that we'd expect to have quantities, they're, actually string fields in the in the source, data that brought in numbers or Nas. So instead of nulls or spaces or zeros, or something it was n, a's and numbers.

D

So we want to make sure that we're handling those things, and we can do that through cubes kind of Last Mile data cleansing here in the SQL, we'll just cast those quantifiable columns as integers, and we use tricast, which is a it's going to return null. If it's not so, it's not going to error out on us um and then we're also taking um the uh some derived metrics as well.

D

So, for example, if we we can get the number of observations, we can get the number of hours that were spent people, I, imagine sitting in little hides or sitting on a chair in the field and join the uh enjoying the nature, uh counting things um and uh we can tabulate how long uh was spent per siding. We can also see how many observations per side. You know big flock Rules by we see 20 Birds, um you know, that's uh that's more interest or that's an interesting point uh to call out here.

D

So um we've got a number of different metrics that we can calculate. um It needs to be called derived metrics because we're already using Aggregates from within the cube data model. So um we can do pre-aggregations as well. Based on that. So maybe we'll we'll take a look in the playground first, and so as we're building this data model um jump back to views real, quick um as we're building this data model.

D

We don't necessarily need our end users to know that birds, sites and species are Separate, Tables or or even separate cubes that they just uh are interested in bird sightings as a concept, and we want to put all those dimensions and measures together in a way that they can just reference each of the fields that they want. So in Cube we call that concept, views and so we're going to bring through a few fields from the birds. These five Fields we're going to bring a couple fields from the sites latitude and longitude.

D

So we can get some.

D

We can get a visual on you know where these birds are being seen and then we're going to get the actual English name form instead of a species code from the species table, so we're on the playground side. We can run a query against this View and see you know what are the most common words in this case. 2017 is what the is the date range I've got filtered here.

D

um How many bird sightings were there, so you can see people were busy uh out there, counting Birds um 500 000 morning doves, uh seen in that year alone, and uh you know the list goes on. So there's there's quite a few Birds here, um but we can use the playground to kind of test our initial model to make sure that our computations make sense that the numbers are accurate, that we're Computing over in our data model.

D

So if we get, um you know a couple, queries that were that we like here, um you know one of the things that we're going to be curious about and very interested in is a performance aspect. So, um what's a good candidate for pre-aggregation, you know why not go directly to the source every time. So if we can identify a slow query.

D

So, for example in uh in our query history here in Cube Cloud, we can see what queries were run, how long they took to run whether or not they hit the cache or whether or not they uh yeah, whether they the cash or whether they hit the source Upstream database and for those that uh that did not hit the cache. So, for example, this 1.6 second uh query not bad, but uh could be faster.

D

We can go into pre-aggregations and accelerate that, and here we're defining which measures and dimensions we want to include as well as the time Dimension and the grain that we want to include that at and that generates a roll-up for us, a roll-up code in either em or JavaScript, one of the two formats that we support the cube and then we can go ahead and add that directly to the data schema and we've already got one to play with, but the uh or two so the pre-aggregations that we have were based on uh some of the the dashboards that I put together for this um and unfortunately, there's a snafu with uh with Tableau right now.

D

So you can see our Tableau external status. Page is all red. So uh we're going to be missing that part of the demo today, but I can talk through it.

D

um We'll give one more refresh here on the on the login page and see if uh see, if we'll get any luck here, um but um yeah so in in Cube, um we've got the uh the four pillars right: we've got the data modeling we've got access, control, um we've got caching and apis, and so um because all the um because all of our apis share the same caching, the same data model and the same Access Control.

D

We're able to make sure that all of our data that leaves cube is the same or it's it's very consistent. So Cube can be that single source of truth that um that the other Downstream applications, Downstream users, can rely on so as far as front ends go it's. Unfortunately, it's unfortunate we're getting our time out there. I.

B

Was going to show a.

D

One learning I had from this data set, which was uh the blue jay, so I've got uh I, live over in uh Washington, State and I've always called these birds, Blue Jays. But apparently that's not a blue jay I clicked on the map for blue jay, and it was all East Coast. So I was like what's going on here, so I got to learn that uh what I had been calling a blue jay was actually a Stellar's, J um and I like those guys because they eat the stink bugs around the house.

D

So um big fan of the stylish J, um let's see and and then we've got another tool Delphi. So Delphi is a an AI chat bot that actually only connects to semantic layers. It doesn't connect to um any SQL data source it. It requires a semantic layer for context, for things like um you know, describe descriptions of the fields when to use which field um and that's all context that we can provide through descriptions in queue.

D

So we have questions that we can ask it about the data set. We basically just point it to Cube's API. So the only things I've given it is the rest. Api endpoint here and.

B

D

An authorization token so that it's able to actually reach out and uh and hit this uh hitter API and then I can ask questions in plain text in 2017 which birds had more than 25, 000 siding or 250 000 sightings, and then it goes through the process. So it's very transparent here, which is very important for llms, especially at this stage. We want to know what the process is that they're going through um and even shows the queries that it's running. So here's a the actual query that it's running in Cube.

D

We can go, look in our query history and see this come in, and then it comes back with the with the data and it lists these seven or so Birds.

D

um But of course, I I, don't like to read that just a block of text, so I asked for you know, give me that, but in the table format with sightings with the number of sightings, so I can read it and understand it better and it you know, goes back and provides those for us.

D

um We could go. You know verify those with uh with our playground as well to make sure that it's actually querying the right fields and see um you know and make sure that you know in 2017 the dark eye: junko, uh 200 or 725 634.

D

275 or 725 634, sightings, perfect um and then the percentage change side, and so this is more of a uh this. This question isn't straightforward. It's not just give me this data point. um It's actually think about it and do a do a change calculation here. So what percent of change in sightings were the north for the Northern Cardinal were between 2020 and 2021, um so it decided that it had to pull data for 2020 as well as for 2021., um and then the sightings uh it calculated the.

B

D

As uh 14 uh growth from 2020 to 2021., so um yeah, we've got uh lots of business value or the business value. That's delivered through the entire pipeline that lands uh in queue and then is extended to the rest of the the business apps uh and the. Therefore, the business users through um through the data model, through the consistency through the tight integration of the pipeline I, think that's really one of the keys to how you can deliver the business value in an impactful user.

D

Experience from your data stack with Cube and Daxter, so I'll go ahead and turn it over for for questions now,.

B

Hey thanks, Tony uh yeah and thanks pedram for the earlier uh you know, kind of demo of Daxter and also you know talk about where the modern data stack is the data engineering uh yeah. Just as a reminder, uh we do have about 20 minutes left in the webinar. If you guys do have questions uh feel free to drop them in the QA and we will go ahead and get to them, and we have a couple here. So um hey wait.

B

A couple questions um about llms, specifically about how semantic layers can kind of help provide data to llms and the the benefits over doing native SQL querying in an llm and I think Tony sort of touched, some of that with our integration with Delphi and his commentary about Delphi and their architectural decisions to not connect to SQL databases, but instead connect to semantic layers like cube and a few other ones that they do and the reason they do.

B

That is to be able to provide the llm with better context, to have a better shot at more correct and more insightful answers versus having the llm do just like raw SQL generation and attempt to guess at maybe the way that your organization would calculate a metric just by using you know, essentially its Corpus of different ways. Metrics have been calculated before that it knows about so that's kind of like one. One way that that works um and there's another question here.

B

The viewer asks once you've defined all your Dimensions measures and joins Cube exposes an API to ask questions of the data. What is the Special Sauce that separates Cube from other semantic layers or headless bi tools?

B

um Well, what I can say is, if you think about maybe the cohort of like what other tools do something similar to us. There's some of the more Legacy incumbents like at scale and then there's obviously like Challengers or or you know, kind of other competitors to us, like maybe DBT and like what they've done with their original semantic layer and then what they'll potentially like re-release um the new version of that at coalesce. We all expect and then there's some more um more platform, aligned semantic layers that I would say.

B

Things like uh you know like what looker is trying to do, uh what Google's doing at looker and trying to split out the semantic layer product and then also some statements from companies like thoughtspot and uh Tableau at their shows about like intentions. To do this, so that's sort of like the maybe like the landscape.

B

uh Cube's position is that we are what we call a a fully Universal semantic layer, meaning that that we aren't aligned to any particular data or Cloud vendor or to any bi tool or any Downstream tool. So that's one one distinct, sort of difference from maybe like the looker and thoughtspot and Tableau sort of ways that they're approaching it we're not connected to any vendor.

B

So you know we're essentially like data Switzerland here in some ways um and as far as like you know, the different things that I think that we do that some other um products don't do I would say that the API connectivity options we provide uh that are very easy to use, such as like rest and graph and the SQL API.

B

um Getting those up and running is just as easy as Tony as Tony mentioned in the demo. You know he creates a data model and then he's got all three API endpoints immediately. um He's also got the same security and the same data models in all of them and then he's got caching with pre-aggregations. So you know I, don't know exactly sort of like feature uh comparison versus every other semantic layer out there, but but those are like the big things that we do and um they're what we find.

B

A lot of people use Cube uh the different components that kind of add up to their. You know helping them build their Solutions.

B

uh Yeah, okay, there's a question from Emily Emily Jacobs from gopuff. She asks. Is this a demo of the orchestration API? Will we be seeing that functionality? We already have dagster and we are trying Cube looking forward to seeing how these two play together. Please apologize if I missed that part of the demo yeah, so um I think we mentioned this I think I think Tony mentioned it, and pedram also mentioned it.

B

There is a dagster component that can invoke a pre-aggregation uh from Cube, so essentially Cube the way we do this is we have what we call our orchestrations, API and I. Think Tony mentioned that this doesn't have to be like a time-based refresh. This can be an event driven refresh so yeah and the way that you would orchestrate that in your pipeline is, um you know, after you've kind of kind of completed your last step with the asset that you're working on.

B

um You can just invoke that uh that pre-aggregation, refresh and pedram I might ask you to talk about this I. Actually, don't actually don't know what what you guys would call that component in indexed or I know that it's something somebody has to import right.

C

Yeah, it would be like a library, uh it's pretty light, I think just like one line, you would probably add maybe two uh and then you would really just have a dependency on your Upstream assets and then that asset. It would really call that asset itself. That asset is materialization. You'd have that to run after everything else is finished.

B

Cool yeah yeah, pretty straightforward um yeah, so you know, if you have any more questions about that. Definitely you know if you're already uh trying Cube feel free to reach out to whoever you're working with on our side and we'll we'll help you out and um also connect you to somebody at Baxter. If you need help with dagster, uh yeah I got another question here: um do you know folks using cube with streamlit snowflake data apps and their llm capabilities?

B

We do have a streamlit integration and streamlit is not uncommon as a front end for apps built with Cube and typically like in the embedded analytics or customer facing use case. um So yeah, I I, don't know. I, don't have a specific example of like what their llm capabilities are and interacting with Cube um I will say. The most common sort of llm Bridge to cube is going to be Lang chain.

B

That's um I think the most common that people are using to sort of build different AI chat, bot type experiences.

A

Great um well, it looks like that's all the questions we have today. uh Thank you for spending your hour with us. um If you want to come back and check out the recording again, that will be available on demand on our events page, and these are a few other resources you can check out.

A

um That will give you all um the details about what we talked about today. So thanks again and hope to see you soon have a good day. Everyone bye.