OctoML TVMCon 2020, 9 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: TVM Conf 2020 - Day 1 - TVM at AWS

Description

0:00 - TVM at AWS, Yida Wang, AWS
10:57 - Dynamic Model Support Work at AWS
23:23 - TVM and Machine Learning Framework Integration, Wei Xiao, AWS
29:45 - Q&A

A

Hello friends, my name is I'm from aws. Today, together with some of my colleagues, we are going to update you the latest progress of of our work using tvm at aws in the first 10 minutes. I will share with you some of our ongoing work, as well as our thoughts and plans in the near future. After that, my colleagues will introduce to you our recent efforts on dynamic model support via tbm and framework integration of tvm.

A

Okay. First, let me start with some interesting questions that we are thinking about at aws. The first one is about the feasibility of model inference, and the second question is about uh extending from model inference to model training.

A

The third one is also about training, but here I am. I want to talk about human training. Okay, so I will unwrap these three questions one by one. First, about easy model. Inference here is a typical scenario that we usually encounter.

A

We may have a customer approaching us, no matter internal or external, so the customer comes to us with a specific pre-trained model in a specific platform, and the request is to you know, obviously execute the model inference efficiently. Okay, I'm pulling this request. What would we do?

A

We have a mixture of good news and bad news at hand. Firstly, this is a typical use case that tbm can handle right. So it's great. However, the life is not that perfect. The model normally would contain a few unfamiliar or unseen operators that may not run well or even cannot run on a given platform.

A

Then what well good news is that we can tune these operators for better performance. We have auto tvm and recently we also have answer, and we may have other more advanced tuning mechanisms in the future. Right, that's good. However, tuning is often time consuming, especially on h devices that has very low compute power.

A

So if only one or two cases like this, we can make it work manually. We work very hard on that. However, you know we are receiving more and more requests at aws today. So can we make our life easier?

A

Here? Is our proposed solution? Basically, we we believe in data. Let me explain so we would like to have a gigantic database to store the schedules of the operators or, in some cases, sub-graphs of all models that we have ever seen on all platforms that we have ever tuned with such a database. Now, when we get a new request, namely, let's say executing a model on a platform, we can pass the model and make the queries to the database if getting a hit.

A

We then have the the optimal performance of the correspond of the corresponding part, which is great otherwise, when getting a miss. We do two things. First, as an immediate solution, we use a custom model to generate a responsible schedule for the compute there.

A

There are a lot of cost models in a field nowadays right, so we can leverage any of them. The more important thing here is so remember that we have a large amount of data with them. Our cost model can perform quite well.

A

Yep, um so you know um ten years ago some of my lab mats at princeton. They worked on supporting infra image net infrastructure, so we were at that time close to the imagenet team now, so I have heard enough stories from them about the big data magic, which is like the big data, would lead to amazing results on the same algorithm where small data could not. So you know the same thing happens here in our preliminary experiments.

A

We have have observed uh this, so you know we our collected large amount of data, our cost model without any fancy algorithms could produce schedules with near optimal performance.

A

Later today, my colleague cody will talk about the construction of the database and the cost model in more detail. So we call this ongoing project lauren. Okay, let me continue so in addition to the immediate result: immediate solution of the cause of the cross model. On the other hand, in the background, we will kick off a tuner to tune the compute on the given hardware. Again, the tuner. The tuning techniques is orthogonal. Here we can use any tuner. The tuned result will be stored back into the database more advanced.

A

We can also consider auto scale as auto scaling and auto quantization and point request receiving a request, especially for model running for model running on edge devices which prefer later lighter computations next, and let me switch the focus from inference to training. We have been talking about model inference in tvm community for three years how about compiler based model training?

A

We know that our xla is such a solution, especially you know a solution to tpu, and I think people in the tbm community must have also been talking thinking about this for quite a while right. I don't think the extension is trivial and I'm sure you would agree here. I would like to share with you some known unknowns that we summarized first in training all of a sudden. We have many more operators to worry about, so how to write them or say how to generate auto-generate. Them is a question to answer.

A

Second, given so many new operators- and let's say you know in this case twice as big as the graph or computation graph- there must be more room for optimization both in a graph level, like you know, operator, fusion and in a tensor level like performance tuning.

A

Lastly, large model training requires a number of techniques we barely didn't visit in tbm, yet including distributed training. That is how to partition the model and the data, and also the data to parallelize a number of devices and also memory optimization, something like the trade-off between storage, um communication, bandwidth and computation power right. So we need to consider this because the model may be too large to fit into the host memory of the device, and another interesting aspect is to consider the the sparsity you know here.

A

You know in this case the sparsity is used to manipulate and regularize the training of gigantic models like gtp uh gpd3. So these things are not new people have been thinking about it from one aspect or another. We are working on this on bringing them to the tvm domain. The work is still pretty preliminary and we are looking for collaborations.

A

um So uh you know we are also looking for talents to join the team in adress. We, as you can imagine, we have a large infrastructure and many use cases to support us to explore along this line. So if you are interested drop me a line thanks.

A

Lastly, I want to talk a bit about human training. Here I mean training beginners to learn about how to use and develop in tbm. I have talked about the same thing last year and I think it is worthwhile to bring it back again. We have been frustrated by the difficulty of getting people on board for years, so we would like to provide a systematic pro tutorial to make a training the learning curve. Let's stick in our team, we have a successful story before this d2l book.

A

Written by my colleagues, they received positive feedback from readers and have been adopted by many universities. So a straightforward idea is to extend this work from dive into deep learning to dive into deep learning compiler so and we are doing so using tbm as a compiler. We brought the entire pipeline and infrastructure from d2l to here. We aim at putting together a systematic tutorial for beginners who want to use tvm. We started this effort last year and uh but still in halfway, so the major part of the operators are done.

A

I mean in operators like uh mathematical convolution, pulling they are defined and optimized on both cpus and gpus. In this tutorial, however, we are now short of hands of extending it to the graph level compilation. You know the relay stuff actually, so things like uh how to use relay to represent a neural network, how to run relay passes like constant folding operator, fusion data layer, transformation so and so forth, and in addition- and there may be some interesting recent work to be added into this tutorial.

A

So here again, I'm sincerely calling for contributors to this work, which is you know. This is all open and free, so hopefully, together with you, we can lower the bar for newcomers to learn to use and eventually to contribute to the deep learning compiler fields.

A

Okay, with that, I would like to hand over to my colleague yao who's, going to talk about some recent work from aws about handling dynamic model uh using tbm. Thank you.

B

Hi I'm yau from amazon web service. Today I will talk about the work we've done at aws. To support dynamic model in tbm.

B

Static models only has fixed shape operators and doesn't have any dynamic structures, so they are pretty well supported in today's tvm by relay and graph runtime.

B

However, when we talk about object, detection models from tensorflow and pytorch, they both have dynamic structures such as control flow, tensor array and dynamic shape operators. They require different handlings in three major parts. The first is a front-end parser needs to handle these dynamic structures. The second is, we need to implement dynamic operator in relay and topi.

B

The third one is: we need a runtime system for dynamic shape and runtime memory management. So, first, let's take a look at the front-end part. We will take tensorflow front-end as an example, because it's especially complicated so for a static neural network model.

B

We can treat it as a directed acyclic graph, which has no circle, and we can straight forward parts every node one by one following the topological ordering, however, for a dynamic neural network model with control flow, especially with loop, the condition doesn't stand anymore and the while loop will create circle for our graph and we cannot match these nodes one by one with topological ordering, which will lead us to an infinite loop.

B

This is how we handle this problem by pattern. Matching the whole loop blocks into a relay recursive function, so there are three major parts. The first is alternate visiting order, so we need to visit the control flow nodes first to ensure all of them are converted appropriately. First.

B

The second part is the loop invariant code motion. We need to identify the scope of every loop and moves invariant code out of the scope to make sure that there is no duplicated computation in our graph.

B

The third part is the backtrack, node construction.

B

We right now it's possible that we need to start from an intermediate node of the graph and obviously we need to backtrack all the ancestor node until we find that all the nodes are converted and then we can go back to the current node to complete the conversion.

B

This is the major work for the front-end loop handling. The next part is for the front-end is the tensor array. Optimization tensor array is a widely used data structure in these dynamic models and basically it's a list of tensors. Theoretically, the tensor in a tensor array can have arbitrarily data type data shape and even dynamic tensor rank. However, the dynamic tensor rank is very rare in deep planning models, so this makes opportunity for us to further optimization.

B

So previously we only bind the data type to a tensor array, which means the input and output tensor has arbitrarily or no shape for a tensor array, and it's not very good for our optimization and right now we implement a new primitive to bind the tensor shape to our tensor array, which makes the input and output tensors has the specific shape bind to them, and this helps us to make the whole graph more static and friendly for optimization.

B

These are the two major parts for the front-end enhancement to support dynamic models. The next I'll take a non-maximum suppression as an example to see how do we support such a complicated, dynamic operators in relay and topi? So non-maximum suppression has the number of bounding boxes as a variable and to compute the number of bounding boxes. It's almost to reduce the whole computation of nms and it's not feasible.

B

This is how we manage the memory allocation for this operator. So at the beginning, we keep all the valid boxes and indices and we saw some in descending order of the score to make sure all the lowering lower scoring boxes are at the end of the box tensor and as the end of the nms computation.

B

We have already gets the actual number of bounding boxes. At this point, we can remove or pull the unselected boxes with operation strided slice. This is how we handle the fir the dynamic operator, part of our work, so in the next hytron will present the backend runtime system to support dynamic models.

C

Okay, next I'll describe how tvm supports the dynamic object, detection models. So in fact, last year jared and I presented the nimble project that adds a support for compilation and execution of the dynamical models here. I'll just give you a quick overview of it and for more details.

C

You can check out the last year's presentation so as previous described by yao, we cannot convert the model to object detection model from tensorflow to relay so next we enhance the tbm compiler to by adding a dynamic type system, support the shape function that computes the uh that computes the output shape at the runtime and add a set of uh optimization process, including the memory planning and device placement that are aware of dynamic shapes and control flows, and last we also enhance the symbolic generation to improve the performance.

C

So next the compiler will output the two objects: one is the vm object and the other is the kernel library. The via object is hardware independent. It has two segments. The code segment contains the vm byte code and is hardware independent.

C

These vmware codes predicate how the model executes and the data segment contains all the pre-trained weights and the constants.

C

The kernel library, is higher dependent and is highly optimized for the specific hardware, and later these two objects will be loaded into the lightweight virtual machine. The which is the runtime, so interprets these instructions and execute the models. Invoke the kernels uh just to give a little bit more details about the this project. We introduced any dimension that represents an unknown dimension at a compilation time.

C

So after we're having the any, we can now define the tensor, with the shape as like any by 3 by 32 by 32, where the batch dimension is unknown, and we can also define the type relation function for those data dependent operator like a range that this type relation function takes the start, stop and step and output to a 1d tensor with any dimension.

C

Accordingly, we also need to update a type system to be able to infer the types with any dimension present and, however, the type system cannot rule out all the type arrows, because we cannot determine the value for the any dimension so that we add the shape function, which is a lightweight way to compute.

C

The output shape at the runtime and also perform actual type checking at runtime to make sure the program will execute correctly and the type the shape function is actually defining in tr and is embedded into the uh the data flow program so that it can.

C

We can apply the same optimization to the both shape function and the operator kernels, and also, as I mentioned before, like we also added up a bunch of optimizing processes that are aware of dynamic features and last we designed and implemented a test oriented virtual machine runtime that can efficiently execute, interprets these bytecodes and execute a compile model.

C

So next, I will talk about like a new things, which is called the operator strategy. So this is focus on how to an operator defining relay is lower to your kernel implementation. So now we are having like more and more kernel implementations in the tvm and also as as well as the third party library.

C

So the operator strategy provides a mechanism or an interface that allows the developer to program the kernel, implementation selection process, and they can also help the the compiler to select the best kernel implementation, whichever possible so take the com2d operator as an example. So, in the first step you are based on the target you compile to, uh it will invoke the corresponding uh shape strategy function. So in this case, if we consider like this cuda is our target. So we invoke the cuda strategy function and further based on the off-road attributes.

C

For example, uh if the operator has the nc, the continuity has nchw layout, it will include the nchw top implementation and if the kernel size is no more than seven, then it will also include the window, gray, topi, kernel, implementation and third, if the cooldn library is enabled in the target, it will also uh include the coding implementation uh kernel information in the strategy um and then, after that, it will then query the other tvm log.

C

If there's pos, uh if it's exists so and then check out like what's the latency for each implementation and then use the one that gives the lowest latency to compile as this kernel, so lastly, let's look at the evaluation results of the objective detection models on the on an instance. So we evaluate the object detection model on the ec2 m6g 8x large instance. So this is an arm based instant that has 32 arm cores.

C

So we use the docker image provided by arm that comes with the tensorflow and python uh pre pre-installed, and we also use the tvm to compile the models from tensorflow and pytorch and then optimize that and compare the performance to the native frameworks. So we can see that tvm is slightly faster than the tensorflow on the ssd mobile net model, and now it can also 1.4 x faster than uh for the faster rcn resnet 50 model than the tensorflow.

C

The python uh model, because the version in this docker container is too too low. So it cannot execute the first rcn resin model. However, the tvm can still compile and run that, but due to like there's too many dynamic uh shaped uh uh up operators in the fast rcm from the pi touch, so the latency society is higher than the what we get from the tensorflow. So we'll continue working on to improve the performance of the python uh for to support the python models.

C

So next we is going to talk about the integrating frameworks with tvm2 make the model run faster.

D

Hi everybody, my name is wei. I'm going to talk about a work that we did in 2020. The work is tvn and machine learning framework integration. My name is wei xiao. I work for the aws ai organization, so the problem we're trying to solve is how do we optimize models using a compiler like tvm?

D

A lot of models cannot be supported by tvm, because the number of operators in tvn front-end converter is a lot less than the number of operators in the framework, for example, in the table. I have here mxnet the front end supports 228 operators, but the framework supports 1044 operators.

D

These numbers we obtained on october 2020 the numbers- are increasing all the time. So how do we solve this problem? If we try to in add support for each operator it's going to take a long time, so the solution we came up with was to do a partial, compile for a model, so I'm going to use an example here. This is the alpha post model from the gluon cv model zoom.

D

In this model we figured out that we can have these sub graphs from 0 to 4. These are the blue circles in the picture that are completely supported by tvm, and then we have these orange circles, which are operators not supported by tvm. So by having a runtime. That is a combination of the framework runtime and the tvm runtime. We are able to do inference on such models and this work has been released in the amazon sagemaker new compiler service, as well as the sagemaker hosting service.

D

So in this slide, I will show you a little bit more details about the compiled partially compiled model. So with this simple cat and the grab command, you are able to see this json file at the bottom. You'll see that there are five of these operators, which are called tvm subgraph up. These are the blue circles.

D

You see in the previous slide, and then you see these four operators they are broadcast like and they are not ones not supported by tbm, and then you have these other operators which are indented and that's the operators in the subgraph.

D

So now I'm going to show you the speed up of the partially compiled model. So there is, the speedup is defined as long compiled divide compiled model latency.

D

So here you'll see that we have an instance, a cpu instance ec2 instance c59 excel, and for this alpha post model it has a speed above 1.28.

D

And on the gpu instance, it has a speed up of 1.23. Now I'm going to show you a few more models in the tensorflow and pytorch framework, as well as the mxnet framework. So we have resnet model here, inception and yellow and we have these different numbers of speedups and we have cpu and gpu instances.

D

So now I'm going to talk about a few learnings we have after this project, so one of the biggest problem facing us was that we have to determine the number of sub graphs and the boundaries of these subgraphs.

D

That is going to have a big impact on the performance of the partially compiled model. We also must deal with various operator related restrictions, such as data types.

D

The second thing I want to mention is that we have to figure out how to allocate threads properly between the framework and the tvm runtime.

D

When things are not properly allocated, we see lots of contact, switches and bad performance. We also figured out a way to do zero data copy to the subgraph to reduce the overhead.

D

I also want to mention one more thing. That is, we take advantage of the work where we can integrate tens, rt and tvm together. This is useful because we discovered on certain gpu instances. The tvm performance is not as good as 10rt performance.

D

Finally, I want to share a few things with the community.

D

One thing we're working on is we're always working on supporting more models, so we're working on a project to have the new front end to tensorflow, to point x as well as taking advantage of the mlir project, which means more operators can be supported in tbm more easily we're always adding data types and more operators to the existing framework.

D

Front-End converters as well. By taking both of these approaches, we can have a good range of models supported.

D

The second thing I want to mention is that we wish that the relay vm compile time can be reduced right now, it's substantially more than the graph runtime compile time, and the third thing I want to mention is that we're also working on improving runtime performance, so the relay vm performance on gpu in particular, is not very good right now.

D

The other. Another thing we want to work on is to remove the extra memory copy for the output tensor when the tvm runtime is used.

D

That's all this is the end of my top. Thank you.

E

And uh given that there are none right now, I maybe just wanted to ask one question, uh so we can part off with some thoughts about the future of tvm.

E

But we've we've been given a lot of very exciting examples of uses of tvm by several companies to support special purpose hardware right beyond going beyond cpus and gpus, and we've already seen examples in past conferences of hardware being supported, and this is only growing.

E

What are some of the features that some of you are most excited about in the next year or what other things that you'd love to see and start a discussion on? Anyone willing to take that.

A

Question so this is ida from aws. Maybe I can speak a bit about that. So the interesting little report was that the video was recorded before reinvent, but after that you know that aws announced a new hard work called aws training. So it's a special purpose, hardware for large scale training and we are actually working on the compilation chain of it and um you know, as like aws influence yeah. We would continue utilize, the uh whatever feature that tbm would provide.

A

We would definitely reuse um the things instead of reinventing them and then you know it's a special purpose uh accelerator. There are a lot of interesting uh challenges inside that's something that we are interested in working on, and you know we would keep updates to the community and you know to find support and maybe also find collaborations whenever it's possible. Okay and the other thing I think you know in terms of the new hardware- things that I would like to point out is like I think um many speakers already mentioned.

A

It's about potentialization so, broadly speaking, is to try to utilize uh those customized compute units um in an efficient way, so in general recorded tensorization right. So, whether it's it's this historical array or just like that in so-called nvidia gpu, these kind of things that would be interesting for the entire community to think about a generic way to utilize those spatial purpose, computer units yep. That's it thanks.

E

Great thank you very much, shida and and- and I think the rest of the community is extremely excited about. You know further developments on training, further developments on supporting tensorization in the next year and I'm sure that in the next conference, you'll hear a lot more about some some early achievements in those in both fronts. uh I'll take.