ONNX June 2022 Community Meetup, 13 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Billions of NLP Inferences on the JVM using ONNX and DJL

Description

This session outlines the recently rolled out Hypefactors' MLOps infrastructure designed for billions NLP inferences a day. The workload serves media intelligence and OSINT use cases. The infrastructure is designed with a Java Virtual Machine-first approach that is enabled by ONNX interop and AWS' Deep Java Library. On top of that, we show how quantization drives further performance optimizations.

A

Hello everyone, so my name is viet, I'm the cto of hype, factors and hello from copenhagen everyone we are like a small media intelligence company and with media intelligence. It means just a moment where's. My slide ah here ever so with media intelligence means that we are mining. The media, landscape, ongoingly for all various use cases like product launch, tracking um and such and a long time ago, I chose to base our infrastructure around the jvm.

A

For the reasons that adam mentioned. In his talk, also, we find the developer experience very good. You know this building. Profiler tapping makes refactoring really nice at large scale, there's a high there's, a big ecosystem of reusable components, um and then, if you want to make your own, you can do interoperable to the ce foreign function. Interface and we've built like a whole web crawler around it across 8 million sites and all our data pipelines are the consequence built around it.

A

So like how that looks in our infrastructure today is that we get all these all these different data sources. My websites printed newspapers and magazines, television and radio broadcast and social media posts into our system, and we turn it into business solutions like product launch tracking. We track also uh trust and reputation or share voice, to see how I, how you are faring compared to your competitors and there's much more, that you can actually mine out of the media landscape, because it's like an ongoing information generation engine now under the hood.

A

um This is powered by jvm for the majority of it, and we've been doing that for quite a while when we were much smaller. It only made sense to enrich this data that we get at this website and all the articles selectively.

A

We did it with api because it was faster way to do it, but then we started to make our own first. Our models had to to customize and to be more spot-on with these enrichments and now, we've grown to such a point that we're practically now enriching 50 of our data intake. So we are now migrating to a system that principally enriches all data that comes in, and that leads us to a new big engineering problem uh because we calculated it. It goes into like a few billion in gpu inferences per day.

A

Another thing is that, like then, our product features are they are driving on it. So uh this suddenly like the machine learning and all this model around it are not like nice to have. They become an essential part to keep everything up and running and therefore the criticality of it increased.

A

How does it look technically under the hood um yeah, we get data in all sorts of formats, uh html pdfs, and then it goes to pipeline where we use dgl, but it was mentioned by jin from amazon and all where we just tracks down its runtime, which also extracts a part of hugging face hocks faces tokenizers for nlp tokenization and we use xyrontop for uh cooperative multitasking and that combined together enables us to build a high performance um machine learning pipeline where data is being streamed in enriched on the fly and then spit out and put in a database that that leads to enrichments, like uh um readership, named entity, recognition, salience sentiment and such and just to to run this full scale.

A

We do we use gpu acceleration. Our initial go to is usually cpu simply because it's simpler and we've, we have have pretty good experience of exploiting the avx 512 instruction sets uh to accelerate that.

A

um But now we've reached a scale where that didn't suffice anymore and also to horizontal to scale it. We looked at the cubenated system to do that. The system was launched last week, so it's running it's humming right now. It's now uh yielding at peak loads over nearly a billion inferences a day, and it was quite a challenge along the way to get it running.

A

First of all, we needed to make it economical, so we looked a lot of quantization, which is usually our go-to approach, initially to eight bits, but we in this case for this model, we noticed it was uh we we lost too much on the effectiveness of the model. um 16 bits seems to be the switch spot where we got like a three percent gain uh three-time game over, not quantizing it.

A

um We also ran into conversion errors onyx to fighters to only conversion error, so suddenly, like the onyx model, would yield not a number for the same input, whereas the pythos model wouldn't, and we saw that actually having for specific queued drivers as well. So it was a little bit like uh figuring out. What's going on, um we figured out. It was in the end, because one layer was converted correctly, so we we decided not to quantize that layer and that fixed it all for us.

A

On top of that, we invested a lot of the robustness.

A

We added the ci that that's gpu powered, because we by because we run into these gpu and non-number issues, we decided to make these issues reproducible easily and to alert them when that happened.

A

Another thing was memory leaks as soon mentioned. Gtl is indeed I can also speak from our own experience very robust, and uh yet we were unfortunate enough to to hit like a very rare memory league in detail and it took us a while to hunt around hunted down. We replaced many mellow implementations to profile that, after a while, we figured it out and also really happens, because we do quite a lot of pre and post processing, and so it was tricking quite fast and basically was unstable in our production environment.

A

The drivers match up and also that it matches up with the container os drivers.

A

But uh it's running we're happy here, and it's now also served to our clients and to make sure that it keeps running we set up a whole uh prometheus logic, advanced stack uh that money uses like a number of inferences, latencies, tokenization latency.

A

To do that, then we will get alerted based on that. So what we're looking next is increasing gpu efficiency of our system. Right now we seem to be used around 10 to 20 of the gpu, not sure exactly uh how that is measured um that so we need to dig into that, uh but we we have managed to push it sometimes to 50 percent. So maybe it's a matter of let's say uh using a tensor rt engine and loading the onix models directly into it. uh We're also looking to add more models.

A

So that means a multi-gpu setup and the next thing is to ensure the robustness of the whole system and we're adding more tests, particularly behavioral testing, which needs to be gpu accelerated, because there's quite a lot of that. So that's why our gpu ci infrastructure is set in place.

A

If you have any questions about this particular use case that we are users of the onyx ecosystem, uh let me know reach me out. Watch me on linkedin uh or about hi factors. I'd be happy to take them. Thank you. Everyone.