gRPC Community, 24 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: gRPC March Community Meetup/ gRPC in Dotnet, Python and Golang by Mauro Bennici

Description

gRPC in Dotnet, Python and Golang machine learning system

How gRPC is used to minimize latencies on distributed machine learning systems. Different languages and operating systems communicate quickly without the overhead of REST communication: when avoiding a 0.1ms delay matters.

A

And just to start to say and enjoy the foundation member and an european member of the digital alliance, and that means that a community european focused on the focus group on artificial intelligence, I'm a tech, star's mentor and ask a data scientist and I founded the turian.net meetup.

A

So my life started with a dotnet framework and what we have to see here. We moved to also use python java and golang, and today I will present you how we use a grpc in an unconventional way.

A

In our cast in the ghost rider ai, that is a product of the my company or my guide, we use advanced natural language processing algorithm and this algorithm I use to understand text and now images just to understand what people are talking about, just when the contact on the crm on the social networks on the calls with the company. So the content file.

A

Intel's is the touch point, so here they're happy if they are not happy if they want some assistance, if they needs to buy something they and all of them to create advertising and ends to support the marketing process.

A

The idea is to collect data, to analyze this data and to provide hints and suggestions just to arrive in the last year to create automatically create text and automatically create advertising so, but something that now looks uh quite good, but not so crazy. It was crazy when the ghost writer, yeah and um guide was created and the idea I was born just imagine we are talking about nine years ago.

A

So uh there is a very long time. The first project that they used, the gospel by the right algorithm, was called the yangu and the yangoo was a preview of google trips that told just to create the personalized routing.

A

So if I said I want to visit rome, the ghost writer just scraping the voices going on a roam on the events on the museum on tickets and knowing my preferences so just put my head. My preferences to the preferences of my friends just create a real time guide, with the text with open hour to tickets by so we are talking something nice ago 90 years ago, with something really hold the age of technology.

A

That means that there is no tensorflow and no by torch. There is no labeling the support services. So if we wanted to create a data set, we have to collect the datasets. If we need data, we have to create data. If we wanted to clusterize something we need to create by ourself a system to clusterize data analysis.

A

There isn't a lot of machine learning, artificial intelligence and services, as we know today, just think today, the google cloud aws microsoft, azure there's a lot of services next ago. There's nothing about machine learning.

A

The services that's just started, just as a beta product were in us and what we worked also with financial compartment and bank department and natural security services. So it's not possible to move this data outside of europe.

A

For example, there is no gpu support of a lot and a lot of services, and then there is a lot of this support in the cloud, so it was very a different world with respect to what we now know about the azure in general, ws and cloud services, we started the service using c-sharp because was the framework that we know on windows and in a few years we moved to the machine learning area on python and just we start, we have the first problem to create a microservices, because c sharp was when our api was written and was well all the other frameworks that we use was with them, but to use python to have the machine learning services and just to have to put them to communicate using rest api at the beginning, just the microservices use a simple rabbit and queue, so a single queue, simple queue to have the mask and the rest api was the core all the communication.

A

So there was all the inside machine learning system communicates itself all they communicate through the rest, api and the high-level apis. That aggregate machine learning algorithms inside our platforms and was there just the integration with the other services, where our customers are also rest. Api.

A

So, just it's not real architecture, not just is an extraction very simple. Just imagine that the laptop is our customers that we know of the customer uses like when you create on google cloud tokens, but we know the apis. We know everything about the service customers and these customers arrived and just to have the access to the public service just to have authentication authorization first thing so: load balancing the rest api yourself, an application file just to understand how to mix this information and just to authorize some kinds of operation.

A

There is an analytical services, this eye level service can communicate itself and just use the low-level machining exposed models just to collect the data. I have the information and aggregate a response, so it means that a single call in the that arrive from the laptop is completely different than something that arrives at the end of the of the flow and is not a one-to-one. But just the high-level service can aggregate a lot of other services.

A

In the beginning, a lot of restriction arrives and we have to get there and just transform there in opportunities.

A

We have something about the gdpr, as you know, but also we have this contract that was not possible for sample to sharing data. Also, the algorithm occurs. Writer, ai grove up and some customers decide.

A

Okay, I want your algorithm, but I don't want to use not our cloud cloud in general, we cannot move our data outside, so we wanted to you just release and deploy a lot of this kind of little services to have a mishap of on premises on our cloud or on device for something that is completely not connected to internet, and we also can work inside a specific machine.

A

So just we say we cannot go out to europe and for bank. We cannot leave the country, and this data center is a lot of problem. We don't have access, we don't have internet, we have no cpu, and sometimes we have a lot of other contracts and legal restrictions.

A

So we have to manage something that is to be fast, that they have two works, but we cannot manage in any way and a lot of our customers have no idea how to imagine- and you can imagine a lot of data incoming a lot of microtasks.

A

So there is a very big task of okay. I have to clusterize the machine learning the system to learn to create the models, but there is no way to say if we have income a lot of micro data like twits.

A

So it's quite strange and our first customer asked us to say: okay, I want to use golan because internally in our data center, we have gola and asked us to create, for example, sdk.

A

So I want an sdk to do all the the calls in golang, because I see your rest api. My rest api are a lot of calls and it's very complicated because they are to mix them together and sometimes it is uh take a lot of time to response, because I asked something very complete.

A

Okay, so we started to do everything on golem and our second customers asked okay, I want the same for java, so just a week start to create sdk, but this a lot of works. We are only five five of us at the time and in the many times, as you know, this is all the problem that we have and without seeing them so inside the data center of customers, they start to have lack of data.

A

They start to have a problem network. They start to have all the problems that we know not exist on microservices.

A

In the beginning, our first customers there was the big customers that we have the time. The messages in coming on the platform rise to 88 times the original numbers, so they moved from 60 000 messages to 200 000 messages, but the first six thousand for fourth day the 220 000 for each hour.

A

So the problems that we can we solve the just scaling and put more machine just become too costly. So it wasn't really impossible to manage these only more servers and, on the other end because all of them are used from socials or these form interactions like the advertising or mobile purchase.

A

There is not constant, so we have hours with nothing happens and there is other hours and we have a lot of millions of messages arrived to the platform.

A

So we start to collect, also metrics, just to say: okay, we cannot resolve this without seeing what is happening inside your your data center inside your containers.

A

We need a lot of metrics, so we just collected the metrics and we see that cpu ram is not the problem inside this data center. The memory in general there's a lot of memory from the data center, just to view machine learning analysis, but in the 90 percent of resources on the machine learning air.

A

So, okay, this is an area that is not possible to optimize. It was just optimized, but there is a 10 that is from the arrival of the first calls at the end, to arrive in the machid learning area and come back on the response that is possible if we thought that it was possible to be optimized and just to imagine, 200 thousand calls every hour the ten percent optimization it is not.

A

I mean there are not few data, the first one, okay, so, okay, we have to optimize the data center area. If I optimize the data center area and optimize the machine learning albus, we can do something. We can remove some api services, it's okay, what I removed I removed. So if you go to our slide, we can see that the highlighted services.

A

We include the onyx and machine learning ai, so we don't have a services high level that calls a machine learning exposed models, but these models is included in the relevant services, but it's not possible. Also if this model to be shared, we have some kind of other problems, because we have to put this inside our applications and we have 400 instances.

A

So it it wasn't the solution getting more metrics. We see that the problem is on the serialization virtualization communication dedication. All the problems of the metrics analyze is in this area. Sanja will start okay.

A

If we use the protocol buffer that the standard of grpc, we can have a way just to generate some kind of sdk, so we don't have to create something just from scratch from other languages.

A

But if we use just the protocol buffers that's from the eye level, we can also put them inside in other languages, so we are created for java. With metaphor will be: we are created just for rust and unless this is not totally complement and so on, it's just okay, one problem is resolved.

A

The sdk is easier to be created.

A

We started our service definitions, so we create all the services. Then there we see on the messages, so we started to move from the rest calls inside our system to move to jrpc.

A

The second problem that we have was the problem of the load. Balancer. We moved around the load balancer directly on the client ojrpc. It was easier also to manage or to reach the other system. So just imagine there is something at the moment that everything is written inside of the balancer. To do rest, api calls.

A

Our customer obviously- and we have some networks, uh as you can imagine, with cobol and so on, that asks okay, but we don't have the the firewall editorialization the people to immediately switch some of our services on uh to call the grpc not just on claudio just an internally on device. So we use also kraken d. It's just for to say: okay, we can create a gateway for jrpc, so our customer just can continue to use rest api. We can just move it to jrpc.

A

Internally, our system is moving to jrpc the authentication that we have on the high-level system. The recent authentication on the model system at the time was completely reviewed to be used, the authentication inside with certificate.

A

Now we can see just the the few calls from the guide just to have the c sharp pulse, to create a channel with ssl and the same with python, to say: okay with jrpc just inside wilfin said something to authenticate people. We just can use this, so it's not a way to create automatically a token just what you do on cloud platform.

A

That's a simple! We have do the same to generate certificates also for the customer that we want to call us outside.

A

So in the green, you can see what is changed, just okay to work better using jrpc automatically jrpc. What is inside the possibilities.

A

Just doing this little trick from the final user api, so the first, the average of the code at the start from the laptop arrives directly, for example, to the score high level services and come back using lots of machining expos.

A

Just what we have that we save for milliseconds.

A

For milliseconds, that's something that seems not so much but for few hundred thousand calls for hour. Just this means that we save 13 minutes per hour. So just saying we are rewriting something in jrpc, okay, we gained 20 or 20 percent of data. Just on the channel of movement of the data inside the platform, we were new for us jrpc, we don't know so much jpc. So we configured them all the services using ownery and using some channels also, but just just to start to have a look.

A

When we go to jrpc, there is something that happens when we are using rest api. We are just doing a call and have a response.

A

We don't manage too much about latency at the throughput, because there is not something that was looking for. So okay, there is the rest api. This is the answer in jrpc we we have the possibility just to move something and just to say, okay, what we have to change, because we need one and on the other- and this is interesting, because if we have 108 projects from these 108 projects, a lot of these projects is completely different by others.

A

Just have a look. If we receive a text, a text, we have to call machine learning model to be aggregate data to create sentiment, analysis, hashtags entity, recognition links and so on. With thousands of data, the image is different. I want a semantic classification, so I want to know that this is inside. There is a kids. There is a car. There is a knight within the city the colors are parker vivid and so on. There is my brand inside.

A

So this kind of operation are different kind of the computation and also, if you imagine, if we had to analyze a tweets or a spot, because I want to analyze it or to predict what is working. A new advertising on facebook, for example, I needed on my call on my analysis.

A

All of them for the text will be completed until I have a response, so it doesn't matter if the sentiment analysis is very fast. It's the job actor is very slow because I need all of them to have a response.

A

The image is completely different, because just also the the transmission and the channel of all the bytes of an image is very different to move something like a bit when we started to deploy this solution to other customers.

A

Also, we see that. Obviously, the data center is not the same. The computation possibilities are not the same, and also all they use. The our system is different.

A

So if they analyze video or analyze tweets, it's completely completely different.

A

So our content or latency and through output are to be configured for 108 projects in different way, and they also can change customers to customers, but obviously a lot of our customers have no idea all their customers use a data platform to collect the data because just they simply don't know they have never traced just to understand in for count many countries or the times when something happens.

A

So just uh when we have to resolve this issue, you say: okay, we have resolved an issue because we are using less computation resources because we switch to grpc, but we can have no idea or we can scale something that isn't predictable.

A

Okay, we are use kubernetes, we use the dockers world, we use a lot of other systems, but the only way to scale is to put more machines now if the machines is limited here, there isn't business just to say if I have a lot of images, but I needed to solve the entity recognition on text to response.

A

This isn't true. That is okay. I just scale something I have to scale something to be sure that a single response arriving time is a little different, so we don't know the number of messages that could arrive. What types this is a tweet and images an article, a video.

A

We don't know when these messages arrives. The length and the size so is an article- is a tweet. It's a video, the network state itself. We don't know because the network space depends of the or the data center of the customers and the load of other machine of the customers.

A

So I will start okay, but we are companies that use machine learning.

A

And in the past we used a custom protocol just to simulate something that the tensorflow and python now now is natural for them to do now. We know of the trace the latency, the through output.

A

We are the admin of the container, so we can. We recreate the application inside the docker of the of the customer. We cannot access them what we created, but we don't know the load. Okay.

A

If we are able to collect this data, we can use another system of machine learning just to put them inside a recommendation system for the possibilities to create customized grpc configuration last before return of the architecture.

A

Imagine that a lot of simple arrive to rest api. If a lot of calls to rest api is okay, but we can switch to grpc.

A

If we move to grpc and the high level service, there are two level sales, they know how many data there are many messages, the size and 11d messages to call the machine learning models that also are exposed on python at the moment.

A

So what do we do? We put all inside a machine learning system and just to have the metrics as response from the models.

A

Just what means what means that we are able to change all the old system is exposing as a calling is channeled, so the clients and the server on grpc and the system clause. One of this channel reopen the connection using a completely different way, so they they can change the the number of channels to use they can use.

A

If this is a one single call linearity is, is different if it will happen on a stream and both of them have to open and closed also, when the the load of the system change or predictable, when we expect that something happens so studying the data that are arriving inside the system, because we know that every sunday morning, for example, our customer received a lot of interaction.

A

We can just move on this way, so we left the machine learning system out there to automatically decide how to create the connection.

A

Obviously, we can adjust the configuration to the business purpose, so we can say: okay if I can collect the data about some kinds of services to say that all together, I have to respond fast, it's better to share and just to provide more services of the change.

A

The way the service are exposed, respect to have a more machine to a service that is okay, it's not fast, but it doesn't matter because you have it also to expect just to wait the other system that is not possible to to have them more responsible, obviously uh easy on the 100 lives. The difficult part was that we are more complexity.

A

Absolutely. We know that the operating system difference works in different way. Just for example, we have customers that deploy something directly on metal, so not clouds. Non-Docker kubernetes are using, for example, windows.

A

We have to consider the switch time so, okay, I can be more fast, but just the time to restart the services. With another configuration detects time. There is problem about the program, languages policies, for example. If you see in the optimization system the the python, the python infused, the stream connection, the stream connections is not so good on python is so you don't gain the same performance for some pieces of switch on c-sharp or on-go.

A

There is a downtime policy, so if the system goes down because you're restarting too much machines just have to understand, if your message is lost or is in the queue, if you have the default just to start and the possibility to also the customer to stop something and to say, okay, you don't have to get more machines or just to create more channels, because you are saturating the the network connection or you. Your machine are taking all the bandwidth that we have.

A

The the impact of uh just to leave the grpc machine learning to optimize by itself the grpc connection just was again 0.003 seconds that again, there is 10 minutes left 4 hour, so we gain 15 minutes, switching from rest api to jrpc and without knowing nothing about how to do the the optimization, because we have to configure 108 projects, suggest the machine to say: okay optimize by yourself. I just have just a language, just a description to say this is important. This is connected.

A

This is not important. You don't have to use too much resources like cpu ram networks and just optimize itself to just predict, because you know you started from two months three months you just to start to understand that it works the the consumers, that's protection, so the big picture just to leave jrpc optimized by itself, is save as 22 minutes for four hour per hour in total in general. Every day, just we start to save eight hours nine hours per day.

A

So what we do is just switch to jrpc and just say: okay manage byte itself. What is important for you? We can adjust it. We have some limits and this map just do what happens in our cloud. So when you can see just we have the what's is the most used configuration from our customer and our system, so the stream is most used just for analyze, long text, obviously for images streamers, never use the on.net core windows, so so the test, the machine learning algorithm tested it must just continue to use unity.

A

Unit is always used on python. Just for the problem said about the streaming that bi-directionally less than five minutes and the channel's pulse just to five to 188 l, minus 28 in general.

A

So now we have the possibility to do the things in another way, so something that is very important. All this is very constant. We are dismissing the optimization or just to use this algorithm. So if we say that in one month, two months is obviously constant now automatically, we know it is the best configuration for the customers.

A

So we put something to say: okay, don't try to optimize it every one hour, but just you can check if it's going to work in the same way, one in a week and so on, so we optimized the jrepc optimization as just because we are lazy and because we do know the optimize and because our customers change without the possibility to access just to say on the last one in the agr this project we are working to just to say and just release on open source and so just to to have the other posts, just as google myself.

A

Just if for you, if you think this is interesting to have it in open source, just you can use it or no, and I thank you for for having me. If you have any question, please don't be shy. I'm here, for you.