Nebula Graph Community Meeting, 30 Nov 2022

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: NebulaGraph community meeting [2022-11-30] (3.3.0 intro, graph modeling and transforming with dbt)

Description

Project Heartbeats - Wey GU
Introducing and Demo of NebulaGraph 3.3.0 - Wey GU
Demo: How I model data from multiple sources in tabular format into NebulaGraph with dbt and NebulaGraph importer. - Wey GU

A

So hello, everyone welcome, join our another uh community meeting and uh today, I will give two topics: I will firstly, introduce the 3.3.0 uh release content and then uh provide you a demo, a small demo that um leverage enable graph uh with DBT and to use DVD to transform um the tabular data from different sources into a graph database into another graph.

A

um So before the topic, I will uh first.

A

um Gave some updates of the project and we recently uh released the uh the fix of the 3.2 and the minor version of 3.3. And recently we have a bunch of fixes and the performance improvements. And one thing to imagine is that one of our community users from a Chinese company called.

A

High Goran welcome, okay, um okay, uh so uh one of our new contributor is working on a bunch of different rfcs. uh One of them are a user-defined functions that they are using it in their Downstream uh folks. uh Another one is about some uh specific filter push down, uh it's related to some optimization of the performance and they also introduced yet another uh flat sampling feature, and there are more actually uh they were. uh You know: hacking, never graphing many uh ways.

A

So quite interesting and another thing is: uh we recently uh have a bunch of new users uh leveraging the nebula NG batteries, which is a Java or um uh in a fashion of the my batteries. So um there are some fixes related to eight, because it is heavily leveraging the uh parameterized query and we hit some core box in this project by the users and they were all fixed, uh so some others are. uh We have.

A

We have a newly contributed uh by an user of a python, orm I'm, not sure if I already share this last time- and we also have a new project called nebula real time exchange. So it's leveraging the capability from the Flink CDC and you know, help out sync data from my core. In uh in a streaming way, uh this this one is new. uh This comes from one of our new um contributors in the in in the in the community from the uh being car. This is a organization.

A

uh The main contributors is the WD uh uh W uh uh drost uh is is will also his uh in our community, so this project is targeted to um making your application running on Java. That was on top of neo4j easier to adopt to a neighbor graph. So you basically just need to make many more changes on our codes to you know, make it running on Naval graph instead, so it's quite a young project. So if you're interested just feel free to check that out, and as I mentioned, we have a new release of Studio.

A

That I will give a demo later and we have an operator of new release, so yeah I will share something about what's new in the 3.3.0 okay, I will use the slides here.

B

A

Okay, um so this time we uh we mostly um sorry, we mostly uh allow our leveraging um changes in in the stability and the performance. So we have a bunch of performance, optimization, uh PRS and the new. We don't have too much uh uh New Leaf uh introduce the features, so one of them is. uh We support the wear class in in case subgraph, so we will uh later we will support.

A

We will finalize the fine past uh with the uh where, where class, fully supported previously, we have some limitations there and we we didn't, make it to make it included in this release for the fine past. But now the subgraph uh supports uh where Clause. Now uh this is a small uh function that we can.

A

You know cast between the type of time step and the date time, and also the the session pool, as I mentioned before uh later I will give you example and the the one of the major change uh in this release is we choose to um switch off the tagless or bare vertex by default, but if you were already relying on that, you can still use it just to switch it on again and I will show you some newly features on the tooling.

A

uh Specifically, on the graphic visualized tooling later so on Studio, we have what you see what you get schema tooling and you can also view your schema in a visualized way, but I think in best best effort. I will expose more details later and we have a quick starter. So it's a just a welcome page and inside our starter, we, you can easily load some. You know quick start data set uh in this quick starter. Later I will show you uh on dashboard.

A

We have a bunch of new features introduced, but uh some of them are relying on on other components, so they are not released in this uh this cycle, but but only one feature uh finalized is called it's the process. Metrics uh I will show you later it's just some uh metrics on the single process and on the Explorer side we have some more features. This is The Edge aggregation and the query Builder.

A

uh We can support uh to explore to be embedded in the iframe and now in our workflow we have a new algorithm node to vect, introduced in in the nebular analytics and Explorer I will show you in the demo later uh so this time, as I mentioned, we have a bunch of uh performance improvements, so um there there are some uh most of them are match related and uh we already shared a performance report and you can refer to more details there- uh maybe uh maybe not yet in in English, but I would check.

A

uh We have in the match count. uh We have the QPS with uh two to eight times uh improved and we, with only um you, know one in five uh latency um and uh for the three hoop query. So we have QPS in uh 40 to 100 percent of improvement when we went one-third uh latency and we we were testing everything on uh three node environmental deployment in the ldbc sf100 data set.

A

um This is the specific um capture um between the 3.3.0 to the Baseline, which is 3.2.0.

A

So we can see this is the match, count QPS and the uh match one hook count okay to who- and this is go in three uh uh steps. So as I mentioned before we are, we introduce yet another operator called the distinct list. So this helps in this scenario.

A

uh This is uh this is the where Clause introduced in in the sub graph. So previously we we don't support this, and now you can add uh uh like uh where follow dot degree. Nice larger than 90 and uh stars.point.

A

um Player age so be sure to check out the documentation uh on this uh referral character, and there are some certain limitations out there so be sure to check out the uh documentation uh session pool. We, we introduced um the session proof for uh uh python, Java, go and see, plus plus uh in the client side. So in that case you, um if you are, you are running nav application is the uh online service. You don't have to maintain your own queue of sessions.

A

So uh the idea to provide a section point is that you don't have to redo the authentication in every uh query.

A

Instead, you make them in in a in a queue and you ensure they're alive and uh you just put them in in a memory and uh your concurrent uh queries can be used this context, so you don't have to call the authentication flow uh every time, so it will be optimal when your service is sitting on one um the base and the stake to one space, and you just keep read and write queries so previously uh a lot of users.

A

Would uh they actually need to implement something like this and we we try to guide them to you know you should Implement something like this, but it's not always. uh There are some issues and not always not that optimal. So it's a common thing, so we we we did decided to implement in in the SDK site. So this is an example, so um you just provide the information of the graph d and be sure that the mindset of the session is binded with the space.

A

So you should provide the space, the information of the space together with the credential and afterwards you have a pool of sessions and every time you want to call it, you can just you know, directly use the execute it will be. You know a lot easier. A lot of you know, session related handling. Logic can be decoupled from your application logic from now on, if, if your module fits the the session pool module, on the other hand, if you you need from your application code, you want to create schema, drop space and switch space.

A

In that case, you, you should use the connection pool and create session from there. Instead of leveraging the session, the session pool concept is binded with the specific space you can switch, but it's not designed for this case yeah and yeah, and this is the time step and date, time type casting uh functions, so uh you can just use it previously. It's not supportive. So a lot of users is urging to request this feature.

A

And finally, this is the tag class. So this is a quite a big change, but frankly uh for my opposition, most of user they for most of you, so they don't need a tag list, but while we introduce this concept um of vertex is not mandatory to have one at least one tag binding, and that means, if you um delete one tag, those uh those vertex only with this one tag were not deleted.

A

They became the the bare vertex and that's in most cases, does not uh user's expectation and brings a lot of more uh complexity. So we listened the user's voice. We changed it so now, by default, this Tech less is turned off. That means, if you uh as before, as you are doing it in 1.0 and 2.0.

A

If you drop attack, um those vertex with only one tag will be deleted, and this is the these are the corresponding configuration Flags here in in graph and Storage.

A

Okay, then, finally, I want to show you something about uh the um studio and Explorer. So this is the tooling called uh schema uh sketch tour. uh It's introduced in post uh in both studio and Explorer, so I want to demo it Lively too. I want to show you here.

A

So uh this is the new studio, and this schema drafting thing you can you know you you can try, you can uh just uh drag and drop things here and then apply this schema to any space, so it's kind of fancy but usable for users and on the other hand, if you are, you can also view schema.

A

So, uh yes, as you have noticed that for a long time we didn't bring when we didn't bring this, you know quite straightforward feature was due to the reason that enable graphs graph module you, you don't have a constraint on the given graph, Edge type on how the starting node and the ending node are tapped, so we don't specific the tag of per uh starting node and ending node, so in in that case, it isn't easy to.

A

You know, draw this uh picture, but in in most of cases, people actually make the logical uh constraints in their mind, so they, when they Define the serve Edge type they actually uh in their mind. They won't put the team in the starting node. So uh we so we we bring this. uh We assume this.

A

uh This is uh correct, which isn't always uh frankly, but we we do our best to just um making data to sample a couple of uh samples and Trust assume that every Edge in in the graph only have one attack and they are all consistent. So we just try our best to draw this schema and I I consider this is actually quite useful for most of the users.

A

Okay, then, let's see what's next okay, this is the schema real as I showed. Another thing is the uh welcome page. So in a newer version we put a bunch of user useful information in in the welcome page, which is a new feature, and we also uh introduce the uh the capability of loading some starting uh data sets.

A

This is also open source, so anyone can contribute their uh data sets here in our uh Studio repository. So now we have, and- and now we have only uh two uh data sets here- but personally I will I'm planning to put some more data. Sets there to have. No the fresh users to you know. You know, um get familiar with different uh typical use cases and uh okay yeah. This is the data starter.

A

uh What's left, okay out, there, I will show you uh some dashboard, so in the new uh newly released dashboard, we now add some more Matrix, uh some some uh newly uh statistic uh perspectives, uh for example. This is the connect switch total for per uh one um process. This is the CPU seconds.

A

I'm, not sure this is. This is new, a number of aggregate executors, uh so I guess this is new, so be sure to check out documentation on the metrics and the the documentation of the dashboard and be sure to set up your own to see what's uh newly added. So this is the the page that you are heading to on the service like graph or meta, something like that uh yeah and for the Explorer Explorer is the Enterprise only uh tour. uh One of the thing is uh the edge aggregation so uh like.

A

When you have uh between two notes like in Tableau, you can have different ranks. That means different instances of the uh a pair of notes that the relationship between them. So now you can aggregate them like. These are two edges you can uh uh put them into into one and some sometimes it's helpful when you're seeking in the information insights from the uh Explorer another one is uh query: Builder, I'm, I'm sure you have already seen uh oops.

A

You have already seen things uh around the oops.

A

Around the query, Builder, but I will show you really quick. Those.

A

So, with with the query Builder, you can uh oh I need to specific a space and then yeah you can like drag and drop to write, queries, something like this, and you can see.

A

Okay, this. This could be a a query oops! Oh, this is not. This is an underlying relying uh relying on the index that so it cannot be uh varied. Okay, okay, this is the uh query uh composed by this uh drug drag and drop, which is kind of Handy uh final one. This is the iframe, so you can embed the whole tour in in your internal tooling uh web page uh sum up to user required this uh the. Finally, this is the note to vact post I.

A

Think you may be interested in this, so I will show you in the workflow. Workflow is the tour that you can foreign.

A

Yes, there can be uh later I will. I will talk about this for sure question. So I will definitely you uh the newly introduced algorithm, not for Evac. So, for example, I uh I want to.

A

uh Do it on the follow uh Ash and make the degree as the weight of the uh no two react and I guess so I will save it checked, try to run it oops.

A

Oh okay, I'm I'm using a wrong uh I'm using a wrong environment; I'm! Sorry for that, okay.

A

Sorry, oh this one was configured properly with the backhand uh Analytics, so I'm, trying to with the this is a new newly introduced one so I'm trying to get data from this one and I'm using, for example, follow uh the lag, The Edge type, follow as the the relationship that I will do this algorithm.

A

So, yes trying to save it and run actually uh with the with this workflow you can. You know pipeline different uh queries and uh different algorithms. uh In a dagway, for example, you can put, uh you can add different output to an another uh algorithm. This is possible and that's why we have this uh in interface and after that, uh so the source of this algorithm is from a space that it will scan the data directly bypass the gravity. It will scan data from the storage device.

A

On the other hand, you can also uh do this on top of queries. You can do that. I didn't do that, but you can it's it's possible and underlying it's calling the nebula Analytics so I'm going to oh, something is wrong, breeds out.

A

Okay, maybe I'm using a test environment- something was uh some other guys is infected infecting me uh in this algorithm. So if this is embracing, embarrassing, okay, so this is the newly introduced algorithm.

A

uh Okay, I think that that's all all from this topic on the 3.3.0 sorry for this time of failure, I will answer the question. First, uh first, one from Porsche's aggregator, uh this is a can, can there be more than one Edge between two notes? Yes, this is possible. uh It's because.

A

It's because nebulograph um have the.

A

The data module of Naval graph for Edge- you will have four turbos to determine when uh one Edge, so that is the tab and the source and the target vertex. uh Sorry. This is the type this Source or hexane destination vertex and there is a rank field. So if you are not aware of rank it's because it can be automated, so by default, you leave it. If you don't provide the information for rank, it will be left as a zero.

A

But if you want to, for example, you want to module the case that one person will purchase a couple of different times in in.

A

It is Support also with yes exactly you can have uh and, for example, you can have multiple instances of uh transaction between when people and when uh you know uh when per uh when shop. uh That's what that's one possibility. You can have multiple uh edges between uh two vertex, but as uh as goren mentioned, you can just have multiple uh edges just on different types. That's the more straightforward case.

A

I mean same same uh yeah for same as type. You can do that it's you just introduce the rank like you can put your uh time step in in in in the number or in the end, and put this field in as a rank field. You can do that and that's that's something make nebulograph unique and different from other graph. Dab is as well.

A

um So about no two uh uh facts: I I I failed to run that demo, but it was a pretty uh small data set, but uh you can run pretty uh large data sets because underline is relying on the nebula analytics. It's an enabling- and this is a proprietary uh we. We built it on another open source project, but we released it in a closed Source way and it excels in in the resource visualization.

A

If it's inserted for same it'll be all.

B

A

As goren mentioned uh back to this topic, if you insert uh one um pair of uh source and the target vertex with certain types, you didn't specific uh rank. That will be zero. If you insert it the second time it will overrate of your first uh and and that's basically because we don't have a so-called ID of our Edge and we assemble this ID with the full turbos.

A

So if you you insert a second time, so the ID or the full purpose is exactly the same, so it will be uh override instead of creating a new instance.

A

A

um And, of course, if you are interested in in the Note 2 vect you can, you can apply for another round of trail to see that feature.

B

The studio or Explorer I am seeing a different, uh uh uh a view. Is it different from the community version?

B

You go back to your Explorer.

A

Are you you know this.

B

Yeah so I'm seeing is different experience, so it is different from Community.

A

Version, yes, exactly uh the uh in community version, we don't have this uh workflow thing, so it's a DAC manager, you can, you know, create your pipelines and underlying all the all the algorithms on the workflow and and you can actually also create algorithms on the Explorer.

A

Here there are some algorithm tools they are. They are not the one uh in the open source version. uh If you check it in in documentation, you will see the graph computation. Where is the graph computation? Oh yeah, so you will see there is a nebula L analytics. So this one is the the close. uh It's a preparatory uh another one project in in our community. uh It's Enterprise only offering, and that will you know, have a better performance and resource visualization uh utilization, uh comparing to the Open Source One, but open source.

B

Can you give me an update, uh how do I buy an Enterprise license.

A

um You can just uh as as we talked uh we, we can talk it in in the emails, a thread that.

B

We have some okay, yes,.

A

I'm not quite familiar with the business part, but they can have a price with you and either I'm, not sure if you can purchase the algorithm. uh Sorry the analytics only, but you can talk about that. But if you you.

B

Know, let's take it back.

A

Yes, thank you, um yes and let's go run. Add it if you have some uh some properties there. So, yes, exactly the the second time of insert of exactly same uh photo post will rewrite everything. So sometimes it's unexpected, so you should ensure you are using the right uh way of writing data. On this perspective,.

B

Can you talk about the backup and Recovery in uh 3.3.3.3.

A

uh We didn't have much more update in this cycle. Yet, okay, yes, uh we were trying to bring some, but we didn't make it in this cycle yeah, but maybe um yeah. We will have another cycle of the Enterprise only uh version later and there will be some more uh things included in the back home, restore yep.

B

um Okay and any security improvements in the 3.3.

A

Security Improvement.

A

Phone I: uh do you have any specific scenarios? As I recall, there is no uh PRS related to Security in this cycle as well.

B

A

B

Recently reported a lot of vulnerabilities in.

A

There, okay, CV and as I recall it's after the release as I recall; okay, so it isn't coming in the next release. Yes, exactly yes, yes, oh thank you! Okay! Yes, sorry I I even forgot that, yes, thank you for your reporting on those Series, yeah mm-hmm. Okay, so I will go to uh next uh topics.

A

So again, it's another topics, I prepare! So it's a small demo on how we can leverage DBT to make some graph modeling and do the transform missions so it it may be not that helpful to You YouTube guys, because this is more for the fresh users um on how they they their in their mind, model to map the data from a tabular module to a graph module and I make it to a end-to-end example to help the French users to play about it. So the background of this topic.

A

So this is actually a sub project when I was writing a another topic around how you can leverage graph Tech hand graph DB, to create a toy level recommendation system, so I reviewed all the methods we can do together with GN um and I needed that set and uh in a graph manner. So I have to create my own. So then I I I decided to leverage DBT, which is quite fancy uh in this domain, and why not make it another small demo?

A

So the the task here is: we need to analyze what data we have, so we can think of it as a job when we have in our uh infrastructure or in our service. We have some data placed in multiple places and we want to make the transformation to leverage the graph thing and how we can do that.

A

So we need to Define how we want to place those different information from different sources into a a graph into a knowledge, and then uh we we need to define the schema of the the knowledge, so this is required in in schema for grab. That is right. So it's quite different from other graph databases, which is a schema less and then. Finally, we need to extract.

A

However, we Define how we want to extract it, how we can make the engineering thing we want to extract them from different sources and finally ingest to Naval graph. So uh in first step we need to see what we have so in this example, I will release that article later on the recommendation system itself later.

A

uh So those who are interested, you will get more, have more ideas on that. So the conclusion was I need to abstract a grasp information from two data sources. One is called omdb, so it's a movie data set it's a public open source kind of Open Source project, so there are some all the moving names uh and the classification of the movies and all the workers, the crews of the different movies, a lot of information and some other informations like the covers exactly exactly and um another day said it's a moving lens.

A

So it is a data set help the people to you know. Do the study on the recommendation on how users select watched and the rate different movies. So it's a data set from a real world and they just make the sensitive information mask so uh I I uh on purpose. I combined two data sites to you know demonstrate in our real world. We will, you know, combine different data sources in this different format and different infrastructure and put them together and make the correlations.

A

So this is a simple but kind of simulation to the real world. uh So finally, I I decided to leverage uh four kind of uh edges. So it's it's used for the content-based filter, which is the term in in the recommendation system and the user-based uh Collective uh filter. So there are two approach. uh Naive approaches of the recommendation system, so they will together uh will require our knowledge graph on the user and the movies to have this full types of edges.

A

So this was actually drawn by the uh you know the new feature in the studio you can draw it like this and afterwards just click. It will help. You generate the ddl of the schema and you can load it into a graph space, quite handy and then what's next, we we need to uh after we Define the schema.

A

We need to transform the data from different sources into our graph, um so let's look into the the first data set called omdb, so in omdp there are a couple of different uh table. Think of they are just tables, but they are formatted in the CSV files. So there's our movie table with the column of movie ID, the name, uh the languages and oh, this is official translation. So we don't care about this. So basically, uh you have a unique ID towards their names in different languages.

A

uh There's an AllCast table are reflecting uh between the movies and the persons, so the job ID is interesting. They are just numbers, but if you look into job names, they're reflected to uh different uh job roles, for example uh in English job id1 means in the writing Department Etc, so you can be a producer of a movie if your job IDs too and finally, we have our people table or CSV file are reflecting the the person ID just like here, the person ID to its names.

A

So this is the typical how data are persisted in uh relational database. They don't they don't duplicate information and that's why we we sometimes cannot use them. You know to actually reflect the correlations. It will be a cost if you don't duplicate things right, um okay, so uh I just uh show you four tables to you know with those four tables information you can have a one type of uh uh Edge type, which is the person directed I'm. Sorry, uh the direction here is it isn't right. So movie will be directed this side.

A

uh This direction uh movie directed by a person, and actually you can also have the movie acted by person right. So it's a similar thing and uh so buy this directed by. So it's a ash, so you will find the starting notes from uh either it's a movie, ID or uh so here is actually starting from the uh a movie uh to a person and for the for the movie ID, and you will find the job ID you should you should filter.

A

You know the All crew all cast table uh where the the job ID is a number reflected director in the job names in English and English. So you can see it's actually a kind of different tables joined right and similarly, you can have the movie uh vertex.

A

You have the ID in the AllCast table and those names are in the all movies. Maybe you want to filter the language as English. Similarly, you can have the person vertex right, so basically, uh uh that's only for uh directly buy or act by, and we have a different uh like movie have categorized by so it's similar here so basically I draw the table that required to constrict to have this uh properly property graph information. We need table in this way and we we have those three types of Ash edges.

A

Now then uh we will leveraging another data source to help generate this information of the the user or people watch or rate a movie, and that comes from the the movie lens. Sorry.

A

And the movie lens we have this rating.csv, so it's a table reflecting user with a unique ID and a movie ID and with the rating. So this is a a ash type and we can put the reading as a property of this size and, of course, the time step if we want put here.

A

uh So um so, what's the problem here, the problem here is uh this movie ID. There are mapped to specific movie in title uh in the movies.csv. In this data set it's actually. uh We need to correlate it with all the uh the movie graph, uh Knowledge Graph in add another data source, so they should be correlated because you can directly use this movie ID. So this movie ID should be mapped in some way, so uh By Nature, you just map them with the title.

A

So um the problem here is uh in the in this data set. Your uh every title has a quote of the year of this movie and they they are all in English. So uh we can know that we, if we want to link of the two data set in a certain, join way. We need to remove this cervix and search from the the omdb and the only for those English titles right. So we need to do that to make the everything connected so previously, with the omdb.

A

We have this mapping relationship and together with the rating and the movies, we are doing this. So basically we are adding the user uh from here to this graph. But when it comes to the user uh watched or read movie, the movie ID shoot comes from the all movies in the omdb, but we should translate this movie ID with the title and with some Transformations and search it in all movies, English name and so that we have the movie ID in the all movies of the omdb.

A

So that's the thing we want to it's quite uh it's a simple case, but quite typical in our real world scenarios. So, finally we want to afterwards of the transformation. We will have some recording it's still tabular, but it will be in injected to the graph module line by line. So in every line you will have the user ID come from the movie lens, the rating from the movie lens, the title they should have in both movie lens and omdb, and the movie ID so we'll replace States. As the you know, omdb movie ID.

A

And finally, we have this uh user watched movie with the rating as the property in this Edge right, and it should be something like this right. User watch, the movie and everything was connected with the tabular data in in this colored lines, and we have the directed acted and with a girl or cat with the category.

A

I, don't know how to.

B

Pronounce it, can you hear me yes, um so I wanted to ask you a question like how do you model a user watching a movie multiple times.

A

um If you uh okay, this is a good question, so, if in in my current module is not doable, but if we want to do that, we need to bring this Times tab into the rank of this of this, uh this type of edge. So as I'm shared to you oops sorry.

A

A

Oops I think we have a I have a better uh related.

B

So, basically, essentially, what you're saying is if, when the source, ID and source and Target uh uh edges are same, we are saying use rag as a.

A

Exactly you can see this at rank in in the insert clause in the insert query. This is optional. If you automate this field, it will be, but it will be persisted as at zero, so run with zero. So if you don't care about this uh capability, it's it, it's not a curse. It's just a gift. You can ignore it, but when you need it, you just basically add, for example, add to yes.

B

It is specifying rank um a a cipher uh feature.

A

It's a nebula native uh concept, actually yeah in in Cipher World. There is no such thing. It's something unique in nebula, because the you know the background of the team that comes from the end uh Financial end, so they're they're used to working on things around the alipay. So in their use cases they have this uh requirements, so they bring this into this module. Yes, so.

B

We also observed when the data set is huge, meaning billions of uh nodes and billions of edges are.

B

Not performing better when compared to Native features have you observed the same thing.

A

Exactly and we are continually uh optimized amazing that and for example, in this in this cycle, we did a bunch of optimizations on that because the you know the cipher is writing a cipher uh pattern is easier for users, but a lot of constraints were not actually applied. You didn't precisely uh tell how the you know. The query was done like the native ngql, so um we need to did to.

A

uh We are trying to make it uh closer in the performance comparing with the native queries, because in Native queries you will actually do every step precisely just like that, but the the drawback is a if you write it in the in this way. It's not that flexible to write some multiple Direction patterns, so we have to do trade-offs uh when we, but generally, if possible.

A

If you can write a query in Native, you should do that instead of the cipher, but in long term the cipher performance will be close closer to the ngql native query, language. uh Good question. Thank.

B

A

Yeah, thank you. So I will continue. So we have this uh modeling uh things ready. So then we want to you know, do this transformation, but how we can do that.

A

um Of course you can do it in any handy tools you are using, but this time I I just want to try DBT, because it's it's really uh interesting to me. So it's it's just a opinionated way to do so. I just use the DVT. You don't have to use it. You can use everything, that's um fitting your requirements, so basically um I'm using DBT to do this transformation in this step.

A

Just like I'm mentioned in that diagram and then I will output the data into CSV and finally use the Importer to ingest it to Naval graph. So um in DBT it is a tour to open source tool just to help you do the only do the transformation in the ETL Pipeline and it's assuming that you are putting your data in in one data warehouse. Of course you can extract data from different sources, but finally, you will do this transformation in one single data warehouse.

A

So in this demo, I'm using the permanence uh Warehouse uh just a one postgres on a single poor, Docker container, but you can use any other things and DBT supports a lot of different um setup art infrastructures. Either it's open source one or a cloud one. For example, you can do it on gcp, uh bigquery, Etc.

A

Everything was actually already composed in my blog post, so you can, you can know more details and all the codes are also open sourced, but I want to quickly show you uh a walk through a demo, so you have to ideally in short time to have more ideas. So first I set up everything in my server and DBT is actually a one package written in Python, so I create a virtual environment and install it with the paper. It's a it's. A package manager in Python and I installed the DBT hyphen postgres.

A

So this is a plugin that, when DBT is doing the transform is leveraging the postgres as the data warehouse. So it will help us install the uh the main package as dependency of so one package. Here is enough and then I a initialize, a project of DBT. So it's a project, so DBT is a a whole mind. Module of how you want to make your transformation in the modern engineering way and all of the meta data you want to describe this transformation are file based. So you can do the git Ops of the this data.

A

Ops thing you can put everything collaborated in in the central uh gate Repository.

A

So when I initialize this project, I will have a couple of dummy files in this folder, so you can see so um DVD have its own infrastructure to help you set up the module in the web documentation way. So you all those information- can be presented some way by others. So you can see it in a in a quite more than offensive way. uh So you, of course you can describe it in readme.markdown and The Meta or index or entry of the project is the project yaml file.

A

So you can describe things and everything else will be described and mentioned in this file. For example, the module, which is the core concept of the DBT, transform module. So module refers to how you want to do certain transforms on certain data sources is, and you want to Output it somewhere else in this transformation and on a specific transformation rules.

A

They are defined by the circle files because, when we're remember, when I was saying how we want to make certain data or information from those tabular information sources into our graph modules, uh actually those record or files to be injected to network graph are still tabular right.

A

How they're mapped they can all be described by SQL right and they are actually just the joints, and you know some functions manipulations so later, I will show you how and this yamo under this module, 5 instance of the folders are, you know, described the columns and those Circle files are describing um how you want to do the transforms and there are tests and something else if you're interested uh just try it so I'm not going to delve into all details. So then we have this uh DBT project.

A

Then we set up the uh environment for the our experiments. So if you don't already have some equivalencing on data worm or you just you don't want to make things dirty. You can just let let me make a ad hoc data warehouse so I add this um half and half a RM. So this means, when you stop this instance, everything will be wiped and I didn't map any uh data volumes inside. So that means all everything will be wiped. Nothing left.

A

So if you want to have a long run environment just remove this RM, so you can start it again and have everything resumed and I said, have a postgres in in single uh container and then I download the files of different sources such as to put it in in this uh this this folder. So this is uh uh I just defined it.

A

It's not a form uh existing uh name in in the DBT I just created and uh later you can see how it's invoked and I downloaded the file from the omdb or and the movie lens data says official websites and I did some uh data wrangling to pre-precise some specific, like escaping characters, and some words just copied, as is so the destination of the file was in under the seed, so see, there's something uh uh defined already in DBT. It means if you, your data sources, are CSV.

A

You can put them under the Seas uh uh for uh directory, and you see it's a concept in DBT to help you to. You know inject the CSV data sources and afterwards you just run the dvtc. It will help you inject everything there. So I have actually I have um environment here. So this afternoon, I put a things under the seed and then I run the dbtc.

A

It will take some time because I'm using a single machine postgres so to insert everything and afterwards you can see all of the raw data were already in our Warehouse. It's just a postgres and you can see all the tables are are already there. For example, the all casts so yeah you, you can see, you don't have to Define any uh schema and everything is uh just injected with quite sweet right.

A

uh Then um we come through the transformation part. uh So I create this uh transformation. Module called the movie recommendation and I will show you with the example of the user watched movie uh this relationship. This is the most complex.

A

uh Actually, so uh it's more juice, so everyone is reflected to one uh relationship for this is a user watch the movie, and uh so what? What is doing is you just uh as I do remember the the the diagram I just joined, select and join things? I select the um ratings user ID and movie ID from the rating CSV. So this is a movie lens, ratings.csv and I.

A

Also yeah I I select the title uh with the movie lens movies right, the the relationship comes from the ratings underscore reading CSV, so the name the title from comes from the movies, with the movie ID and I removed the um you know the surface of the year in the quote in the end right and then I have the user watched movie with the movie name and then I need to trans translate this movie name into the omdb movie ID. So then I continue uh doing this another drawing towards the all movie uh uh CSV.

A

So this comes from the omdb right I'm doing this join uh on the title, with all movie name with a like clause and the filter, uh those names in English. So then finally, I will have a user watched movie. So if you see uh I'm using this client to so it will be mapped to user watch, the movie so yeah. This is the final uh one and you can see uh this is the result and I didn't write this properly in one go.

A

So in practice you just uh I'm using a extension of a sub vs code called SQL tools, so it can connect to postgres and other uh popular databases.

A

And after you connected this, you can you can for sure you can disconnect yes disconnect and connect right after that, you can, um for example, uh I will show you another one movies or uh waste category. So you actually writing the the queries here and you can join it at any time. For example, I can add a limit eight and collect the run on acting connections, so it will help you.

A

You know, run this uh query so then you know your query is valid and then you can continue write another one and after you uh finish them, you can call them by. uh Oh. This is sorry I didn't mention. This is the there is a schema yaml inside this module right, for example, this uh user watches movies. You will describe the meta data of this transformation rules. uh There are three columns and that there are even test Fields. So this test is related to the another capability of of DBT.

A

You can run DBT tests to ensure your different kind of constraints. Can you know you can ensure your quality of your data in in a controlled way? So if you are interested in just diving into the documentation, so uh with with this module being said, you can then uh trigger a transformation uh with DBT. So now, for example, that one uh user watched the movie are reflected here. The circled part right as I mentioned, then you can run this with the name of this circle to trigger this.

A

uh This run and afterwards you you, you make everyone ready and the random accordingly, so I can run you on the Fly for one of them, for example, uh uh we can see. We already have, uh for example, uh this with category and we can drop. You can drop this table.

A

So, and and now you can see, uh this table is gone right and now um now we can run this DBT run.

A

So DBT will help us do this transformation just in in the back end of this uh plugin in postgres, and then you can see.

A

Yes, it's back again and the the data can be uh previous preview as well. You can dislike the source, ID is the movie ID and the destination ID is the category ID like I? Also have this one? It's reflecting which category ID are in certain category musical right. So let's continue so till now. You have every transformation ready and the the transformed data are in a table, uh tabular format in in your postgres.

A

Then you can make it uh reflected to the naval graph, the DML query right and we can leverage different tools like um in nebula exchange. You can do the postgres connecting to never graph directly. There are a bunch of different ways, but in this example, I just make things simple, because we don't have a large scale data, it's just export it to CSV and then finally, I'm using another graph importer importer is the most lightweight injection tool. It's just a single file, binary written in Golan.

A

You just specify uh where you want to uh you want uh with the configuration file you just let the it know how you want to map your CSV files towards your uh graph Vertex or a Vertex type, which is tag or the edge tab. So, for example, the the people.csv I'm going to map it to um uh a Vertex or tag. So we can see what uh what we got in in the uh actually in uh actually here after we copy uh sorry I have to recopy it to the files we uh we we move.

A

Everything to this two uh Naval graph folder. So you can see here is the transformed data, for example, people so uh here I'm using uh a format of the header that so that you don't have to specify which column are mapped to which vertex.

A

So here I'm describing the header as a people, a person, dot name in this column, person, dot birthday in this column and the uh vid string in this column. So it will directly have everything known by Network graph importer. So, with with this, you can just uh call nebula importer.

A

uh This is a containerized importer and consuming this um importer configuration file in yaml, you will have everything injecting in Naval graph and yeah. This is a example in I want to know why we will recommend this movie to user one for one to four, so we just do a fine pass. You will know all uh this is the path with um uh most uh reason, reasoning, possibilities because uh it could be most of the cast and crew of the ones favorite.

A

You know a right uh radiate this star or I, don't know which of them info or 0.5, and also they shared a lot of different uh connections with this movie. So, of course it's not a star Star Wars, so we will know the reason why we want to recommend him or her this new Star Wars in this fine past.

A

So it's just a simple naive example, but afterwards you will have a whole user and movie knowledge graph with this process, and so everything was done by uh in this way and in this fashion, so we just with DBT seed. uh We make the CSV files from different sources into our data warehouse uh and uh with the DBT we defined our mapping rows in SQL, we tested it and it works, and then we use DBT run to run this trans transform and afterwards you will generate new tables hosting those data you want to.

A

You know you want to inject to another graph and finally, you you output it as CSV and inject to Naval graph.

A

A

And that's uh I think that's uh all from uh today, so um thank you, everyone and feel free to. Let us know if you have any questions about today's topic and check out our uh stack, Channel and GitHub repositories. uh If you want, you want to discuss, discuss anything about graph and nebulograph, uh go to our discussion uh uh forums in GitHub or our stack channel. So thank you.

A

Let's call this again. Bye.