Dagster Dagster Community Demos, 5 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DataOps TUTORIAL: Data engineering / Ejemplo DAGSTER end-to-end con BigQuery, dbt, spark, jupyter

Description

#dataOps #dataengineering #dagster #dbt #bigQuery #SPARK
En este video vamos a hablar explorar los principios para el diseño de una plataforma de Data Engineering moderna y construiremos un pipeline de proceso de datos con dagster (https://dagster.io/) - un orquestador de pipelines de datos utilizando:
* BigQuery como DWH
* dbt como herramienta de transformación de datos SQL
* dataproc/pySpark para procesar datos a escala con SPARK
* Jupyter notebook, para explorar y visualizar el resultado
Código del video: https://github.com/velascoluis/dagster_gcp
Sígueme en:
👉twitter: @luisvelasco
👉medium: https://medium.com/@velascoluis
👉github: https://github.com/velascoluis

A

Good morning, data maniacs welcome another day to full scan today, I propose to take a step back and not talk about dacca science, but about data engineering data engineering in the lane we have focused on this final part of exploitation of model creation, etcetera, etcetera, but This data has to come from somewhere,.

A

So I propose you to analyze the state of the art in data engineering and see the principles of data, and we will build a data panel in to end convit query with de bt with spark with jupiter notebooks and all of it, orchestrated by baxter. Let's get to it.

A

A

As the intro said this past year, we have been focusing on analyzing everything that happens in the final part of a much larger pipeline. We have seen non-regional regional models, how to process audio text video a lot of things to run these models, that you need data and how to generate treat process. This data would be a bit the field of data engineering of danger and the concept of data warehousing, as we know it today in Today, this centralized site where to store,, integrate and normalize.

A

The data of the entire organization was born in the 60s and 70s, and mont adopted this term in the 70s. The data warehouse data lived through golden ages in the 80s and 90s, with technologies such as Oracle teradata, and then the truth that they experienced a small decline with the advent of big data worldwide, but it seems that now they are living a second youth with the birth of center plates in the cloud.

A

Well, here you see on screen the classic framework that is used to logically describe a data warehouse data engineering is this discipline that brings together everything related to data processing and this generation of data from taiwan. Out for subsequent exploitation, we find concepts as diverse as data processing in real time or quality rules, or how to generate a central, logical model, with its dimensions,, its fact tables,, etc.,, etc.

A

Okay,, and now what I propose is to analyze what major trends are currently occurring in the market in the industry that are in some way redefining the entire concept. In data engineering. We are going to analyze three major trends,, the first of which is what I call the common separation of responsibilities, specialization or hyper specialization,.

A

What is this from a technological point of view,? No matter how you look at it,, we are moving from a world focused on data process, produces zentric to a data-centric world data centric in the 80s, 90s and 2000s. The kings of business software were undoubtedly the erp or crm applications such as whatsapp or cell phones, where they somehow digitalized key business processes in the organization from 2002 2010.

A

We are seeing this yes, where really the main source of value is not so much the process, but the analytical treatment of the data in some way generate these processes, for, as you already know, to improve decision making by placing the data in the The center of everything that is happening is that this platform is that at the beginning, it was a little more monolithic,. They are being disaggregated,.

A

They are specializing in different components where today, where each one of them is adding a lot of value to a complete state, before we had, well, one database, that pretty much did it all. Now we are seeing this database break down between the storage layer, sql compute engines or real time stream, processing or a semantic model or scheduler separate from the database or different tools to generate this data within and, of course, the different modes of consumption both in prizes and in the cloud.

A

This concept of technological or pp, decoupling or hyper specialization is not only occurring in the technological components themselves, but also in the roles that these systems operate before we saw that the duty of dataviz administrator towards a little bit of everything load the data in the database towards the tables towards the models towards the reports until many times today.

A

This central role is being divided into multiple specializations and We find titles as specialized as, for example,, a pipeline orchestra, engineer,, a machine learning engineer, a data infrastructure engineer, and on the exploitation side, business, analysts,, vif, analysts,, etc.,, etc.,, etc. Finally, as a summary of the concept of separation of responsibilities and What. Hyper specialization does is dismember monolithic components that are different, more agile,, highly specialized subcomponents with greater autonomy. As. You can see,.

A

This approach has multiple advantages, but also some disadvantages,, such as the need to integrate and that all these components speak naturally and not. We have to make a brutal effort in the integration, well, and I'm telling this today,, but tomorrow it can change. It's a very changing ecosystem, where practically every month or every quarter. We are seeing new components that appear within this framework,, some of which are successful,, such as the concept of store of characteristics of the blind physio others are not so much and what we are seeing is that they tend to disappear.

A

Therefore, it is a super, dynamic and vibrant ecosystem. Well, the second macro trend that I want to talk about within the world of data activity. Is this everything as code is talking about data engineering and images like the ones you see on the screen? Come to my mind, these famous drag and drop applications of little boxes to compose your transformations. Applications such as power center informatics, oracle, stop great or talent from ita stage, etc. etc.

A

Although. The learning curve of these tools is relatively simple,. They present a series of problems that I call the gray ones of the seven,, which is, first of all, scalability and performance in general,. These integration tools collect data from various data sources. they integrate them into an internal engine and finally deposit it. In the data warehouse the one from the on-board warehouse, then what we are seeing is that the current power of these practical warehouse entities is brutal,.

A

Therefore, it is better to make a goal paradigm, all within the data warehouse and that's it integrated it into the data warehouse- is the concept of lt versus e p. L well most of these tools. The truth is that they were not born with this embedded concept. Many of them are trying to make the change, but they are tools much more to think about data integration than for work in the new 20 warehouses in the cloud,.

A

Therefore, they present scalability and performance problems and, as we said before, we are moving from a world focused on processes to a world focused on data. The data the data is going to go more and more and more, and it is super important that we have tools that scale as our data parks also increase,. The second gray of TVs brings together a large package of things, such as the need to configure the reliability of their rigidity, collaboration or automation,. These tools are, by definition, quite rigid,.

A

They provide a series of transformations now of the box,, such as making a joint, makes an addition in some ways of loading,, but when you take your feet off the plate a bit and want to do something, a little more complex,, what you end up with is a little box that is already a script with a lot of code that you have developed,. Therefore, we lose a bit of the grace of having these tools that, in the end, are relegated to being mere script.

A

Recorders developed in this way, on the one hand, and on the other, and on the other,, the version control of having two pain with little boxes as you do, for example, a push response or a merge, that is,. They also present serious problems of collaborative development deficiencies or the level of automation that we can reach.

A

And, finally, the third green of the seven L's is that of the kings standards and open source in the end,. All these tools always have a pretty boy or a pretty girl, for example,. If you use oracle data integrator,, it is normal that it is very well thought out and integrated with the oracle database, of course,.

A

It is used stage data because the ibm solution, that is to say, they are solutions that present a quite important blocking and if they want to immigrate from one to another, it is practically a nightmare because in the end they are not based on open standards. They are proprietary codes. Therefore, moving from one to another It is super complicated and we are a bit a prisoner of the vendor that you have chosen well,. Therefore, the new data engineering tools break directly with this approach and as I said,, they have made holes, well,.

A

All these problems that I have been commenting on were already solved, but they were solved in other fields of engineering, much broader fields like software engineering, application development.

A

So what we've done from the data level is look at examining the software and apply these concepts like version control, integration, continuous continuous deployment based on standards and proposing the integration, extensibility of the frameworks and portability, and what we are doing is having a programmatic approach to the world of data transformation and cleaning it with the concept that I told you about previously of the disaggregation of the integration. We are having specific tools, for example, for extraction and loading or specific tools for transformation within the warehouse using open standards such as sql,.

A

Therefore, the promise of data ox is to somewhat fill this gap that exists between data engineering and engineering of software in living, these good principles, this discipline, maturity and maturity. Well, I went through a bit of an approach, as I said before all as code. The third great trend that is a matter of speaking in the world of infinity of data is the embrace of some form of ecosystems and open standards. Well, and what does this mean?

A

I remember a few years ago, approximately in 2012 13, when everything was big data every day he appreciated a new framework, and when he saw this, the hadu ecosystem seemed like there is a zoo with so many animals. What we have seen according to This discipline, has been maturing,. The years have gone by,. The dust has settled and well,. When the dust settles a bit,. What remains standing has been what adds value,. It has been what adds value and well,.

A

It did n't add smoke today and what It has remained standing, because not so many things have remained, for example, all the promise of hadoop and all the things that are appreciated around it have practically been distilled into sparc. Payton, a bit has been chosen as the default language to program. These data panels that We will see later the concept of the world in container has also remained, and in q vernet on that chrome platform, where they execute with containing these containers, lobito a resurgence of sql.

A

There cannot be a person in the world of data engineering today, who does not pass ql is one more and you have to start somewhere. I always recommend that you start with always sneak in and then look at some other frameworks like pandas data frame, etc, etc. But in the end, look at the new platforms that we are seeing are fleeing from the concept of one stop shop. Where, for example, a vendor would give you all other packaged, data, everything already pre-integrated, but not tumorous, from there.

A

We are opening a bit that the current grace is the concept of ecosystem, the concept of platform. Where today you can be running a flight in python, But hey,. If something comes out tomorrow, that is something new that makes sense, is relatively easy to integrate and well,. Don't have this login with the slightly more monolithic concept of one stop shop,, which was a bit the main one last time and finally, the Last, but not least,.

A

You have a little cross over everything in the use of the nail as the default means for the deployment of our advanced test of our pipeline. Practically of all of them,, therefore, summarizing data, engineering, 2021, 1, separation of responsibilities and preparation of components we break with that concept, guards topshop monolithic and we embrace very specialized component bases based on open standards that communicate well with each other. Secondly, everything as code data or the application of software life principles to the world of data is said of software testing, automation and Lastly,.

A

We embrace ecosystems,, we flee from approaches with veedor, locking and all of this is deployed in the cloud today, so that this introduction a bit and what I propose to you now is that we are going to build together a modern pack line using all these technologies well and that we are going to build. We can start deploying a day the warehouse in the cloud.

A

In this case we are going to use google cloud and we are going to use and within which we are going to have some public data of the paddlers and who are the ones we are going to work on. Specifically, we are going to work With, the data from this,, the flow of the posts,, the comments,, and we are going to do some kind of process with them, how we are going to process the data within disc query,, using sql with bt from ita bildt,.

A

A virtual sale is a framework, is a framework for the construction of these parents of data transformation using sql within the database and everything as I always say from a programmatic, approach, well,. We are going to generate some tables, some aggregations and then we are going to export this data in a format that is currently very standard,, such as, for example, jason.

A

We will export them from visual and to see how easy it is to integrate these platforms and the most complete ecosystem, because we will make this data the google cloud store to a distributed, object, storage system in the cloud, and what are we going to do? Then? We are going to process these data with a spark cluster. We are going to implement a little park page program.

A

We are going to do some treatment that we will now see and we are going to deposit it again in our doctor, but, for example, the park format compressing with snap, and you already know what is park and park gets along very well and finally, we will visualize this data with something closer to the world of thatta science, such as a jupiter notebook well, today, I hope how we integrate and orkestra moss all these steps well I want to talk to you today also about baxter well and that It's caster, well, Baxter,, that's very cool harangues,.

A

It's a data pipeline orchestrator, that is, look a little bit that it attends to these principles that we talked about before, the principle of ipl hyperspecialization,, where we go from a generic state,, such as apache, for example,, the flow,, which is a super cool framework., but the generic hyperspecialized hyperspecialize a recorder to a specific state of data, with all the advantages that it has. Second point: everything that we are going to develop with daxter is going to be code, all the python code.

A

Now we will see some example that it is simple to develop this line and third country, well, all based on open standards, where we will be able to integrate things like vt spark, sql or jupiter notebook using paper, milk and all of it. Extensible and open here is the git ham repository where you can see that it is totally software free.

A

So we are going to develop this incomplete page in your in using duster and all these components that we have been seeing victor and they must pay with data in jason, parker and visitation with after jupiter. We are going to go very well well, I am here in pay charm and what what we are going to do is create a new project,. Ok to show you,.

A

We are going to put baxter here,, we already have it,, so we are going to create a new folder here to put the src code, and the first thing we are going to do is install the takster package,. It is a package which is in the pib called daxter. It is something as simple as doing this.

A

We install it well in daxter. There are two super important abstractions: it is the concept of solid and timeline, which is a solid. It is a piece of code where the logic will be from our pipe line, step that can make a sun and then a sun and can run sql code a sun and can run code with de bt can run code spark can launch a jupiter dust can run that can nou, flake, etc, etc, etc, and What a pipeline does is compose several of these solids,, that is to say, of cinema, well,.

A

The steps are a bit of the execution,, so we are going to develop a super simple table with a single, solid that, for example, writes something on the screen, for example,. We generate here a new, a new python file, python file.

A

We are going to call it simple, highlight and pitt point well. What we are going to do is import sun blaine from the masters package.

A

Well, how we are going to define these pipeline solids? We can use decorators, for example. Imagine that I had this write function.

A

A

From baxter, how do I convert it into a solid, well,, something as simple as putting this decorator on it, and we also put this with the theme of the context and how we develop a pipeline well, again, we decorate with.

A

A

And we generate our function me.

A

A

And what We are going to do it within our plan,, since it is calling the functions of solid,? It is that we already have a pipeline that simply uses this solid, that axis on the screen, well,, how we compile as the daxter lock that executes uncover line,. So we return here to the command line where we have the function that it we pass the file to it. It also has, of course, also has an sdk in python, but it also has this part of the interface command.

A

We come to 313 here efe and the name of pfizer and touch and a bit simple.

A

We launch it and it is going to open us here in a port because the axtel application that I show you is this right here. So here where the pipeline is graphed. In this case, it only has only this step. Ok, we can see all the metadata. The truth is quite a lot rich and in the play ground we can do executions.

A

Therefore, what we are going to do is launch this execution that in principle it will only execute us, as this solid said, that it should write on the screen and they have executed here we can well, we can see it as It is being executed for the time etc.- and here we see the execution steps below, and we can show you the output, that we came here needing an error that simply has debut themes and in the output. Well, we have here what we have written well, this It is a super simple dax plan.

A

So what we are going to do now is complicating this, a little more and the first thing. I propose is to play a bit with bit query and dbt here, well, another code, a little more complicated that we are going to do. To analyze it well, first of all, within the code, I, want to show you a bit of the structure that I have, since I have a dbt folder where it generates a dbt. Project.

A

I have a jupiter folder, where I have a jupiter notebook and I have a spark capita where I have a code. In paises park,, it's good that we have built it,, because I think the best way is to first compile it,, see the tree a bit and we will analyze it now step by step,. This is called daxter pipeline,.

A

We compile now,, the process will also be opened and we can take a look To. The plan that we have generated,.

A

As you can see,, this panel has 1, 2, 3, 4,. It has 7 steps and the dance does a bit, well,. What I was saying: before, we are going to execute a little bit of sql logic within bitcoin, with which it contracts virtue,. This framework will allow us make transformations using sql within the database.

A

Then we are going to export, as the big query table said, to jason format, okay and notice that it is also sql code, and this is why because within which- and we have a specific statement, that is export data, which will allow us to be able to take a bit query table and export it, for example, a jason format in a cloud web package, then a couple of pardons to three steps for the spark part we are going to create a spark cluster in the cloud.

A

First for First we create it,, we execute the spark job and then branches,. So look how cool. We will only pay a little for the time that we have this cluster raised in the clouds and then what we are going to do is download this file locally. That I already said that for this parker format and we are going to visualize it with a jupiter notebook that we already have created and financial'. Here we come to a link to be able to see even the notebook, we'll see it, okay, well,.

A

We could execute it directly, here, okay,, but what we are going to to do now is to analyze each one of these solid, prudent of each one of these steps, a solid to see inside a little. What is there so we are going to go back to when we click and we are going to start to gut the code.

A

A little look at this is my table where I have defined the solids to, for example, create an atap log cluster before deleting it with this one here or launching the spark job interesting to say that baxter already has active integrations with datapro, so we are going to be able to create an ata cluster, but simply with this function here createch lester. We will be able to launch a spark job simply with submit job. Nothing else is necessary,. Everything is already pre-integrated based on open standards.

A

In this case, well hadoop spark and to delete the cluster well, the same goes that is to say,. It is already integrated with both data pro and bit cueli. Well, here we have, for example, another solid one,. This one is a little longer than what it does is download a file from a gcs ship. Ok later we will see a little detail of these parameters, entry that we pass here, but simply the grace that I want you to see that they are all solid.

A

Well, we have one two three and four hey, but let's see before seven, where the other three solids are because they are not defined here somehow if they are, but we have put them directly within the creation of the pipeline is somewhat more convenient because we already have here a solid predefined code to generate code with de bt like this one here.

A

What do I know? We are going to start analyzing now daxter is integrated with de bp. The only thing that we have to say is simply: where is our project directory from better and it will execute it. We can also pass it a lot and it will execute it using the dt as in the interface. In this case, therefore, we cannot pass all the dishes, the parameters that we want,. Therefore the first thing our country does is go to de bt. As I said in de bt, I have already created this project.

A

It must have it. We are not going to go into details of everything that is from bt, but it remains with. This idea is an open source framework to execute transformations within elect saved in the cloud with beat query snowflake in this case, as I said, what we are doing is connecting it to bit query within bt. The main concept is the models that are nothing more than the table that we are going to generate and add.

A

They will be two tables derived from each other in this case fixed that I simply have two tables here. All aggregation is the first one that I do is this. That is called stack, overflow stein, and what I am going to do is generate a table once, let's say copy of this table here, which is the table of bit query pablo and sale of this double flow, see it in beat 4. We are already here in beat query, and what we have to do is the table is it is here it is for a question.

A

We can look it up here.

A

And that is here we can take a look at the figures table that what we have is practically the title of the question. The question itself.

A

Well, the number of answers, the number of comments when it was created, etc. The user who asked it and the tags that It has been put, well, as I said,. We do not return them to each mind and simply what I am doing is here is to generate, in this case, the yes, yes, in this case,, which now generates a view that highlights the leather one and I can refer to it by name of the file highlighted the flow stage here to this other aggregation.

A

I am already saying that the table materializes it instead of a view and well here. What I am doing is what I am doing is a code here that we have as an example that What it does is it generates some statistics of the different questions.

A

In fact, I could execute this DVD transformation directly.

A

In the project, and if we put terram,, then this transformation is going to be executed, well, using bt, notice that I found two models, well,. What is generating the view, it generates it, it does the aggregation and the final table remains,, which works perfectly well,. What I want to do is the same thing that I have done from the command line,. Do it from daxter to chain all the steps together, well,? How can you do it?

A

It is something as simple as I said before we come to the pipeline and we already have- and we already have this command here. Ok, yes, the only thing it is going to do is execute what we want here done. We are going to execute it in stand alone mode for that I'm, going I'm, going to comment on my plan and I'm going to leave only one solid one,, which is this one from dbt and.

A

I'll come back.

A

A

We actually have only one step,, which is this bt step,. He has seen that it's a bt step, we can see if there was any metadata, etcetera, etcetera, and what to do now is to simply launch this execution here below. There is a little button.

A

For a little launch, execution is here.

A

We give it and simply what it is going to do is use alain interface, as I said before, just what we have done before since from like the interface but already integrated from daxter. So we can see it here, for example, in the rounds.

A

We have this strange running that practically as you can see, it is calling de bt, and it does exactly the same thing that we have seen before the pimps as I tell you is that duster already It integrates the calls to us from bt, which is quite cool. This Russia has had to end successfully. We look at it below here. We can see all the goals that are being generated, all the output from the storage room. Here we can see all the output.

A

Here this is the direct output Since. It should generate, as I say, the view and this table very well,. So what we are going to do now is to go composing, this paula,, making it more and more complex, well,. How are we going to define the relationships between the different solids so that one It is executed before and another after,? So what we are going to do is embed the calls here and fix this halo cannot be deleted and it comments here.

A

If you look at this solid at the end, I am passing it by parameter to another solid that is here. Víctor y, solid, ford, query: what is this? These is a solid that executes queries inside the box. An sql string in what I'm doing well practically is to tell it that the first one executes what it has inside, which would be well this here and then what is executed from outside in some way to make it clearer, because it has many parameters here.

A

What we are doing is imagine that everything is a solid that is called se bad solid, because I do want to execute dependencies between all of them.

A

Solid one first, it is going to be executed by solid one. Once it is finished two will be joined. We are going to do like this composing these dependencies, ok or to make it a little more readable. We can put things like this solid 1, and simply here we pass it.

A

So it is a little more readable. Well well, a little. What I'm doing here, sql process I, take this output and, as you can see here, I put it in the creation of this data pro to generate this. Let's say dependency between step. Well, so how does this other work? This other solid here, visual infor queries like I was saying it is another integration that daxter already brings directly. That brings a lot.

A

We can go to the documentation, and here we see the disip and all the integrations that it has,, for example, with beat leather, and so where is it to delete it to throw a string and that the one we are going to use now, contracts pro with gps, etc, etcetera and well,? It has a lot, as I said, beloved, for example, with.

A

Blake also a lot or, for example, when spark with data breach also.

A

Jupiter with air flow, the truth is that it's amazing, the interactions that have been done here. We see a lot here. We see the entire list, dated also with das, to execute be distributed with greg expectation, very cool for all the textile part with mysql, with request duty to send, for example, alerts with bands for all processed data frames, with país par with gel with slack. It is very cool all the interactions that have been generated to which we are going to see the execution of an agreement execution of an agreement into account.

A

Therefore, we have bisoli ford query that what we simply have to go through is a sequel query and in my case, where I have it, because the other configuration file has defined it for simplicity. It is here generated here, peace of mind for cloud config generated a lot of jason with some data, so for readability, then, here look at the string. I have It.

A

Is this export, I am telling you to export it to a gcs bouquet in jason format with over with white, and the query that I want to do,, which is practical,, is simply a string and it collects everything from this final table that we generated before with de bt. Ok, then, simply well, this. This solid is quite simple, executes custom queries in this case inside that sun, and what does it do then? Instantiate a client of bit query in the object. Type me of cloth. Query throws the joe and get result.

A

Well then, connect with this collect line. Here we have already generated from bt some sql transformations and the export of the table. As I said, the seconds that we are going to do now is to process that data extracted from bit query with that with park countries to process peace country. We need a spark cluster, as we have said, and daxter is also integrated with datapro -that data pro is beta progress within google cloud. This functionality, this service that will allow us to execute war, love, duo, spam plus, radio, jai, tetra, etc. in ephemeral cluster.

A

Therefore, we will have to do three steps: create the spam class, run the spam and delete the three park. The good thing is that daxter,, as I said, already has these integrations, look here, that what I'm doing is the first thing, I'm doing,, but it's a bit. The last thing we've seen this incarnation operation of solid is to delete the 'cluster' of spark that depends on launching them from spark. That depends on the creation of john spark that depends on sql process. That is the previous output.

A

Therefore, the way to read it would be from right to left so like what the way to create the data pro cluster, well,. We are coming here and notice that we already have a function, as I said, integrated into daxter, to allow us to create this clan, this cluster,. What we are going to need is to give it the configuration of the cluster that we want to create and This, as I was saying,. The town is also taken to this other configuration file that is here and here I have generated.

A

The configuration of the cluster that is here is a little trick to generate this configuration that,, as you can. See, is a little long,, well, look,. It is that we come.

A

To Google cloud a data from, and let's do it from the interface series,, which is a bit friendlier,.

A

Here we give it to create cluster and what we do. We define everything we want here. If I don't know, if we want to install conda and if you see, for example, If, the nodes can be of one size or another, for example,, we are going to put here standard nodes, 8 and the workers how to put 6, for example,.

A

And we can go customizing it if we want to put internal iphes properties etc. Also, when we have defined the cluster that we want to create. What we are going to do is that the ones of the equivalent between this and here, look at what is going to give us the class configuration of this jason format,, which is exactly what we need to pass here in daxter,. So it is an easier way to generate this, and we make sure that we have no errors. It is exactly the same to generate the job.

A

The me basically in this case we do the same. We can come here.

A

This is a plan that is already created here. It cannot be left here submit. It is worth saying that it is a type park pages that we can say where this file is going to be, and gps.

A

A

I point and I can put all the parameters and in the end we also generate the equivalent res below and well. We will be able to copy it as long as it is. Okay then notice that here I have already generated this park. Country code, that is here, I, have uploaded it to a google store package, but I also have the code here, because you have to take a look because basically, what it does is from that table that we had before we have done. The typical word count worth nothing like that special process.

A

Well, what follows special that we process this file, this file, jason, okay and then, yes, we are going to save it where the parquet format is here, Tuesday a little bit well, this we do with and then well process itself. Well, nothing! Then we use in that context to Make this code with a park country like the one I live, in., Nothing,, nothing, special., This, code,, like being here,, you can take a look at it, okay,, but we can, the pipeline that we had before, okay, so look,.

A

What I'm doing here is you can launch this spark cluster. that generates the data or that behind the data exported from bi query and in vitoria and treated with the better I am not going to leave it like this.

A

And finally, what we have is that data that we have processed with spark- and we have saved it in even format that is worth What has been generated- is a small notebook that we don't have around here here since the community of the free version, passed, well,, as you know, notebooks, don't look good,, but what do notebooks look, like?, Well,? What I've done is generate it from a Peter notebook. here, local. In fact, we are going to do it and that way we see it better.

A

We are going to this route of this being on twitter. There is the notebook and well, he simply ripped the notebook from here.

A

So that we can see it.

A

He does not hesitate. It is super simple simply what he does is take the We read the data in parquet format and simply paint it with Balotelli, which, in the end, is a bit ugly, because it is a count of all the words as I said work even in the texts in the starwars flow posts and well here is the one that It is the one that is repeated the most,.

A

It will be the crow,, that's okay,, but it's good that simply teaching the integration of daxter with notebooks is also very cool, very good,, since we have already analyzed all the code practically and we make a small summary of the pipeline as I said. What, the pipeline is going to do is call de bt, 2 Export the data using the se q l statement with the interaction with bit: query: 3 Generate, a cluster of at approx, 4 Launch, a youth with countries for this cluster 5 Delete the cluster and that's it Lastly,.

A

What we are going to do is download the data,. This is the one that I want to point out. Specifically, and finally, run 2 as it runs with paper 1000. The integration is worth as we know, to run this, these, these notebook innovation that comes from the very cool people from netflix hey. Well, we already have everything ready and what we are going to do now is execute the complete pallet and take a look to see how it goes so now. We simply know that it efe and we pass it.

A

The name of the timeline that these three line, the local server, is going to open here and we are going to do it.

A

Here you have all the steps. Ok, we come to the playground and what we are going to do is the execution, but beware that they look here in red. It is telling me that we need to configure this section of download data, and this is what I want to show you, well, notice that here it is solid,. We have talked a lot about it,. We can give you a series of configuration metadata for a lot of things to say, to define, well,, simply a good practice. Description, bring the documentation closer to the code,.

A

We can Define, then, the data that is going to enter them, for example,. If we were to pass a data frame from solid to solid, our case,, as you have seen,, we are, the sooner a disk will be made and read it in later. Steps,, and the most important thing is this right here is the confit scheme are and input parameters to configure the step? Well, let's say are the typical parameters of any process in this case. What sets this is solid and admits three parameters. Therefore, we can have reusable solid, reusable components.

A

We are telling you the uri, well our reader, a uriz from where to download the file the name of the file and where to leave it local, ok! So that's what daxter is asking me here. He already generates this configuration in jamel format. Ok, we could obviously provide it before or at runtime. Then I already have it prepared here. So I can save it. I think it is this. I.

A

Am simply saying, well,? Where are they,? What is the gcs package.

A

? I put,, let's say,, all the parts of it from the file in parquet and until then you already configured the input data to the input products. For that step. This obviously is extensible to all the solids that we want. What happens that as I say our pipeline, the It's true that we have n't used it, a lot. Yes,, we're all giving it this this. This entry description of saying racing, because what we're simply chaining together is these steps, but without passing anything between them,. So we do have to define this here. but hey,.

A

It's a more technical, issue, okay,, it's already launched, execution,. It gives you something, I'll quote you below,. It looks good here,, it's down, here,, launch the execution, and well,. It's going to start to be done. Here,. All this movie that we've been talking, about, we're going to analyze it a bit all the steps. Okay, we can see in different ways. The first thing is going to be executed. It must provide you with the data 3par. We will download the data and we will see it from this spark notebook.

A

We can see the execution and, as I tell you all the output that it is giving us The truth is that it is quite a lot of verbs and all the information that Daxter is giving us is quite cool as time goes by here.

A

It takes a few minutes and also well. We will go here to the data pro part to see if this is how this cluster is staying, ephemeral.

A

Has also already started, it is provisioning, therefore, the step from go and the export must have finished. We can also see that you have ported the data well to the basketball and so far.

A

We open the daxter package and here notice that just now we just got port this data in jason format,.

A

The data that was before in the data that we have taken from vic web.

A

This is the jason that they have generated.

A

For me here, as I said, it is something simply what we have taken is well for each we have generated the titles and it is It seems to me ordered by the number of files,, since it is etc., okay,, and that a bit of the question is with the flow, okay,. Let's go back to datapro to see if the ephemeral master has already been generated.

A

In. This case, I think we put only one node The. Grace of this running class is supplying you and we should see a self running from park farmers or not yet.

A

And well, it has actually been carried out in the step of the young park generation that we already see here. It is actively part of the country park, type We, give it and, as I said, we are going to see all the abilities of Joves Park. We execute in an ephemeral cluster, and here we already have the whole workout and simply what we have done is count the pressure of words, the typical example and finally, as you can see, also what we do is save it in parker format, with snacks compression and up to here.

A

This step should also be finishing,.

A

Effectively, it has already finished, the one that is also being deleted. This is indeed so because the cluster is already here being deleted and, finally, nothing. The last steps is to download the file and see this notebook with paper. Milk.

A

We have here, well, the file download step that I can see directly here we open- and we already have here all the pastures, all all the parts of this file in parquet.

A

And that's it The last little step is to run the notebook with paper 1000,. So here we are seeing that they are putting together all the steps and now, well,. We can also see how it has finished, it.

A

Will be a bit more for the simple fact of painting with the libs,. It is a lot of words,, not very useful, the visualization, but well. It is more than anything to show the integration.

A

Ok, well, the incomplete dance has already finished, executing the truth that the little step this from the jupiter notebook has tried a lot. We can see it I think it has been a little temporarily to see a little well that this step is indeed the one that has taken the longest here. We can see the notebook signals, as I was saying well so far. It is we see that we have done a bit of dexter in this tool. That is very interesting.

A

Obviously we have seen a level 100. A great daxter brushstroke It is much more powerful.. I, for example, am researching the parts of how to deploy it, a lot, for example, a case of rockets, or things like that and nothing to finish. The video, a bit, I think we will be collecting these three principles of data engineering quite well. In 2021 that we have been discussing separation of responsibilities and specialization. We have a specific registrar for data blake. We have a specific repository that is bit query for sql type data processing.

A

We have ephemeral cluster, the first three to run code at spark scale. We have notebook for visualization for history of lincoln data and then obviously, and a lot more tools that appear in this park. Well, the part of me that feels the jet in the car streaming part, etcetera, etcetera. Second, all as code note that we simply use the visual interfaces for what they are for visualize, not to develop all the code. We are doing it in python, where we are going to have versioning.

A

That is a hit so that it can be made public to be able to apply the good software principles such as testing cc, documentation about the code, etc, etc, etc. And, finally, there is and finally ecosystems do not sell it kings and open standards. We have used python used sql using spark. They are all standards and all are open source projects to prevent the kings from events. And finally, we have also seen the nurserymen the cloud to take advantage of the cloud. The only thing In this case,, the local was running, but obviously,.

A

As I said, before, I am now investigating these daxter deployments in disip and, for example,. The vernet 'cluster' is very good,. So far, the video, I hope you have found it interesting to do this next action and this reflection on the general data in 2021 and modern principles, well,, getting a little bit of the machine. Learning phase has been, as I said,, this most initial part of the ingestion, processing and transformation of data,.

A

If you liked it, well, hey, leave me a like if you like, the content uploading today subscribe to the channel and leave me a comment. I will be happy to answer you. So nothing thanks for your time and see you in the next video here in July.