Cloud Native Computing Foundation Kubernetes Community Days Guatemala - Día 2, 27 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes & Data Engineering - Salvador Cruz, Bonzzu

Description

¿Por qué Kubernetes se ha vuelto tan importante en el campo de la Ingeniería de datos? Es bien sabido que los contenedores se han popularizado hoy en día debido a su portabilidad, además de que Kubernetes es una herramienta robusta y escalable. Descubramos de qué manera K8s puede ser de utilidad en un datalake o lakehouse.

Contacta a Salvador en:
- https://www.linkedin.com/in/salvador-cruz-0840aaa3
- https://twitter.com/salva_sgcg

A

A

In each thought, creak of.

A

A

Good afternoon to all it is a pleasure for us that you continue tuning us in the edition of professionals of the cabernet community day. I have the pleasure and the pleasure to introduce our guest speaker to this opportunity, salvador cruz, nice to meet you,. The time is all yours,. Thank you very much,.

B

It's good, well,, thanks for the introduction, today I'm going to come to talk a little about what governed test is, and data engineering, above all, because rulers is a very powerful tool that we can use today. So we are going to see how we can take advantage of all the big data side. We are going to review the agenda.

B

This I am going to be talking about. Why use containers for All this part of data processing, likewise,. We are going to review why governments have increased their popularity lately in this field. We are going to see some big data practices, this above all,. This is a little bit also focused on the experience. I have had working in some projects with clients and users and well. We are also going to see some of the tools that can be quite useful.

B

These are just some of the whole compendium that we have, but I think they are quite useful and we will be concluding with some points where they can be done. O well,. Some recommendations are going to be made about what can be done in cover net,, which are the best approach that they can give to your project, and we are going to be finalizing the issues in general. Now, let's start with which container It's good,. Let's start by mentioning that containers do not offer standardization. As you well know,.

B

It is somehow an isolated environment in which we can have our libraries,. We can have packages, and well,. These containers help us in a certain way to make migrations of In a simple way,. If we want, for example, to run this container on a machine that has linux or on a machine that has windows,, it will be practically the same, because it is offering us that standard of use, that is, we do not have to do any additional configuration or anything for it. style,. That is why containers are really very popular today,.

B

They help us to do a lot of tasks of this type and especially for the applications that we want to develop now,. This also provides us with the ease of support in terms of repetitive tasks and jobs.. He is already talking specifically about data engineering,. This is focused on this part of repetitive tasks: because, as you know, in the world of big data.

B

Tools, temporal computation is used,, so that means that, well,. We are going to be perhaps auto scaling incrementally, according to the high amount of data that we are working with,, we are also going to see why containers offer us better support,, especially for an architecture focused on microservices in the case of cuber minds,, since we can have microservices for our data platform configured in different ways.

B

Let's say streaming for data processing, for analysis, for machine learning, models, I'm going to be all clearly separated and I think it's a better way to manage the whole application in general, for example, before if we Let's go to an example of applications as such, since they were all monolithic so to maintain it. That was quite complex because we have to modify in a certain way all the dependencies and the entire system in general. So this issue of microservices has helped us to leave a little All. This administration is more separated and which,.

B

Well, has brought us many advantages. Now. We are going to talk about why the government has become so popular in this field of data. First of all,. There is what is the orchestration of the containers,, as you well know, before cooperating. Well,, it was a little more difficult to do that configuration and that administration, because many things had to be done. Manually,, the connections between the containers have to be ensured,, so rulers help us to make this a little easier for talking specifically about data engineering, well,.

B

This It is a very important point, since orchestration will always happen, that is,, we will always almost always be talking about what temporary compu is. We also have the advantage that it has a declarative definition,. This means that, with simple templates, we will be able to configure everything all the application everything we need for our data pipeline, that when we want to allocate memory, we don't want to register the service. So this is a great advantage because we only have to work with the templates called and well.

B

We can have everything in a compendium, maybe in a repository this, and we can work collaboratively- is one of the advantages now also what vernet helps us a lot is that we can maintain the health of the execution layer. This means that we will always have a desired state. If we want a In, the number of governing posts,, it will look for a way to always have that availability for the service that is being requested and as you well know,. It also somehow has this capacity to be able to take action,.

B

This means that if a container is not working, properly, as the mind does,, it kills it and tries to keep the service alive, then build a new one that offers us, well,, perhaps a minimum of necessary capacity that we can offer for the user if they want to have at least let's say, 10 10 posts running for the ingest service, then that is one we are going to ensure that we are always going to have the 10 points running for what they need now.

B

Also talking about rulers as it complements, it is autoscaling of a In, an impressive way, then, as the coherent information grows,. It can help us so that everything we need can be worked correctly or it will escalate to make a is that the car is that Jim, depending on the configuration that we can provide,. We also have the part of parity between the environments, because sometimes there are many differences between the development environment and the production mind.

B

Then it is when using templates, because it will give us that variety that we need to be able to do precisely the tests that we normally need, because this is the environment that is more like production and is where we can do all these types of tests, and we are also going to have faster interactions in the sense that we are going to be working with the code. With the configurations of rulers. We can work together, With other teams.

B

We can do the part about whether there is a cidh, and in this case it would be something similar to the box,. It would be something more focused on data,. In this case. Graus is practical,, it is not only ebooks applied to data,, but it has a lot to do with all the automation part of those paivenses data that they have on their platform, and one of the advantages of using it is because it has surely become so popular is that we can use gel like you as you know.

B

Well, it is a tool that is going to help us help to deploy faster, because it has applications that are pre- configured, so that only requires us to pass parameters that give a minimum configuration and with that we will have applications running in a short time now, from the best practices of both big data like hot cubes. That I can recommend would be to first keep the images small,, especially because, sometimes when we are building docker or cone images, we want to get a container,.

B

Sometimes we install libraries that perhaps we do not use,, so we just have to try to build the containers according to the libraries that we need and liz use images. If this is isolated, if possible, we can, as everyone knows,, we can have a post in q verne test and that pod can have more than one container,, only sometimes due to issues of configurations or for other reasons that can make the whole administration more difficult,.

B

So it is recommended that they be a container for a well in this case,, but it is only if possible,, there are services that sometimes require more than one container to be able to function,. So it depends In the case of use,. We must also verify the base images,, especially since the base images are the ones that we are going to be using as the root, as the pillars of our project,, and we have to make sure that they are reliable, that they have sufficient security, among other things,.

B

So it is very It is also important to review how our pillars for the project are going to be.

B

Now we are going to see that we also have to use country names and labels.. This is a practice that is also rooted in Cover. Net. We can use it to separate the services to quickly identify the applications. So. It is something that I strongly recommend in the part of it, the containers as such,. It is advisable not to use the root user for security reasons, above all,. You only have to give the necessary permissions to the folders to which the user requires access to the executables. that they have to be done.

B

But beyond that I think I would not recommend using root, that is,, because it is a very open permission and you have to use services to expose all the containers that we have in function within cover net,. They are going to help us to quickly identify, let's say through a port again from rl, depending on how we want to call the service, but it is very important to always try to configure the service it can be in a public way in a private way.

B

This already depends on its design, but there is You have to remember that. You always have to identify the services in that way and well,. It's on the topic. One point would be focused on the big data part,, which is recommended to run the whole part of hd fs in 12. In a separate node,, this of hd fs, well,.

B

The file system has been distributed,, so as you know,, all the marius tasks are done there, and it is very important that if you have to communicate with other nodes,, it can take a latency time,. So it would be advisable to run everything in a single place and there they will have a little more improvement in performance.

B

Now we are going to mention some of the tools, one of the tools that I found and some that I have used. That are quite useful for this type of cases. For example, here we have what? What is it for? Chile? Is this a tool that will help us for the entire life cycle of machine learning?

B

Behind tensor flow is used, then well, it will facilitate in a certain way all the training of the modelers that have to be the team data scientists or perhaps the team of data engineers,, but they are already focused on that use that they want to give it, for example,. We also have what flow is,. The pro is one of the tools that I have seen the most, this is. It uses a lot for all the jocs sky wing, so this tool, yes, I, think that in a data platform, it is very important.

B

I have had to configure it, for example with turbinates in one in isolated instances, and the truth is that, yes, because it is very important to the time to have all the jocs in order, which is not good time when, if they want to execute as if they want to execute, then if this, this tool is also going to help them a lot, and we also have, on the other hand, something that would be a suite of projects that we They help the whole part or they can help us.

B

If you decide and something is good,, it is quite focused on currents,, so you can get a lot of use out of it and it will help us, above all, to automate the more automatic and the processes. Are, I think it is The situations that we can do are faster, for example,. We also have cloud,, which is also dedicated to the whole part of machine learning. Basically,. We have the raw data,, we prepare the data,, then we do the transformations,. We do the training,.

A

B

Make an equal type m,, and this can work even for pixies when they want to do a very quick test, so they can check all those tools and, from my experience, I have used the projectors a lot. For example, what is kafka? What is sport? They are quite useful when they want, for example, I used them with gel. It helped me a lot to make configurations super super fast. So the truth is that it covers minds and offers us many benefits now.

B

To conclude, I think that the most important thing of all, that is components is agnostic, but it does not always mean that it is the best solution, so you have to use it Only when you have to deal with this, will it go hand in hand, well,, doing an in- depth study of the use case of your client of your user and from there more things will be derived because they have to be seen, for example,, because it is the The amount of data that you are going to ingest or process or analyze is very important, because obviously it can be very powerful.

B

But perhaps if we are working with little data, it may not be very necessary. We will have to see the availability of these tools, let's say, for example, kafka. No, if I want to configure kafka that will be running 24/7 streams, it can be a good solution, but if perhaps a streaming service that works only a number of hours a day, then maybe I should analyze or reformulate. What It is what I have to deal with. In this case, we have to see the frequency of the execution of the jobs to know.

B

If it really covers minds is the best solution and, above all, to see then calculate the costs and calculate the costs and see the part of the auto scalability for See. If Cubre Mente supports all the data load that we need and I am sure that it will be,, but also as the autoscale test goes, on, the cost increases,.

B

So you have to do a very detailed cost analysis and then monitor everything, monitors everything when you can see the services that are healthy, so you can also have a better idea of how your entire configuration is working and that all your services are working perfectly and well a stress test simply to check that the auto scaling is enough so that we can and can have and offer the services according to all the data ingestion, the data processing that we want to do this and, for my part, it.

A

B

That is more than anything, it was a quick chat to review. How cover mind is Combine it with data engineering,, but it is definitely a very powerful tool that can be used for the whole big data. Issue. Thank you.

A

Very much, Salvador, for the talk. I think that to lay the foundations to clarify many issues,, it is very interesting,. It is very enriching. And. We really thank you for that time and we have a couple of questions and you could answer them. Yes,.

B

A

Ok, well, the first question says: do you think that with q vernet the doc could become obsolete.

B

And I think it depends a lot on how much they have implemented, that they are, but I think that Cooper Mentes was born precisely from that personal endo need,. There were configurations that were too manual that had to be done by those who were configuring and maintaining the system,, so Escobar Mente. If it comes to solving those needs in some way automatic, so it may be that yes, I can easily replace it.

A

Yes, definitely maybe in the future.

A

Well, here we have another one that is formulated more or less like this says gubernatis. Do you recommend it engine cloud in adolece yesterday in windows or linux? What is your opinion on this I.

B

Think It depends a lot on where they have all the infrastructure. Configured. The advantage of Cúber Mente is that, since it is agnostic, that is, I can go from a w,, it is possibly to be, or from yesterday,. There is sp, due to the fact that we have a clear,, declarative, definition. The. Only thing we need to be able to have our poster is simply the templates, but it's good to work in the same way in hard blue that one and I think this is indistinct. It depended on it.

B

It depends more on where all your infrastructure lives and maybe, if it suits you I, think so Maybe. There is not much difference in costs, but it could be that in the long term, if it is convenient for you to have the structure dissipate, that would be like doing a study.

A

Even for convenience, right.

B

Yes, even for convenience,.

A

It's fine, well. Another question we have says: did you touch the Anger issue,? Could you broaden that chip of.

B

The ops a bit basically and as I already mentioned, it is not only box applied to data, because I think there is a little misconception about that. What we want to ensure in a data ops process, it's the whole part of automating the pipeline, but at the same time, being sure that the quality of our data is not going to be lost. So it's like we have to take care of two things, instead of just automating the whole process of doing the rodizio or Either.

B

Here we have to concentrate on the code, and apart from that, we have to concentrate on the data. So there is an aspect and in that aspect is where all the data part enters the cycle,. It is not just like an iteration like the one we know from the box, but which is more like: let's see it as an en tuenti. So we start in this part where the programmer only cares about his code or the data engineer.

B

We make sure of the quality of the data that the code is already working with the appropriate data and, at the same time, In the end. We are going to the best where that part of automation ends is where we are going to deliver the best to the data scientists to do a subsequent analysis of machine learning or something like that, and there would start a different flow, because different teams are the ones that who work in.

A

Such a definitive and good big data environment and the last question we have, why does it say more or less what advice would I give you if you want to start in this world of rulers.

B

I think the main thing is to start playing with docker. Let's say that All, the containers are the base of the rulers,, so I think we have to start to see the whole part of microservices,. How the containers work, understand that a container is this immutable,, so that sometimes requires that we have to design them in an ephemeral way,.

B

This and also You can see what you want to do, that is,, because we can use microservices for fairly standard applications or we can have microservices for data applications,, but I think they are different things,, so I would go to the basics,, which would be microservices, containers, configuration. of turbinates, the templates jeon heon. The truth is that it has helped me a lot, so they are like my general recommendations,.

A

Very well salvador. We thank you very much for your time. The exhibition as I was saying. I think it is quite enriching for all of us who are watching and well. Thank you for nothing. True and well. I, don't know if you want to say something at the end, a final reflection or something I.

B

Just want to thank him for the time they gave me to be able to share these ideas. The truth is that I am very passionate about data engineering. Although I am currently working As a development,, then there are disciplines that it is time to combine them. If we can realize that we can do very cool things only because we have to see how to configure them, how to automate as much as possible and then ensure the quality of the product that is being delivered.

A

Definitively, many, many Thank, you also to our viewers,, just to remind you that the next talk is on another track that you can see in the agenda part, and also that we have a raffle for a book with. If you put the hashtags on twitter is castell guatemala and also dry, then you can participate for the raffle of a book and well, without further ado,. Thank you very much Salvador and thank you very much to our viewers until the next opportunity. Thank you.

B