Chips Alliance Workshops, 12 Jan 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Data Sets for ML Chip design

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Sorry event, coordinators, my side job so much.

B

Saman right, yes,.

B

There you go: oh yeah.

B

A

B

You guys see my screen.

A

Yep we can see your screen.

B

Okay, perfect, um thank you. uh My name is Aman Arora I'm, a PhD candidate and a graduate fellow at UT Austin um a little bit about the background of you know why I'm presenting this talk here today, um I I was a leader in the AI uh fpga Committee in the osfpga foundation, which recently got uh you know, joined hands with chips Alliance.

B

So this project that I'm presenting here was actually something that we had been working on at UT in my research, lab and um I was trying to get some traction on this uh in the uh in the committee at osfpj, and now that you know it's everything is a part of Chip's Alliance um through through this presentation and kind of I'm, going to give you an overview over this project as what we have done so far and also kind of do a you know, a call for contributions towards is the end to see.

B

If you know people are interested in this kind of work and if there is, you know a scope for you know we will be submitting this project as a Sandbox project under chips. Alliance and seeing you know which uh working group does this fit under or maybe starting a new working group, or something like that? So uh that's the background um and the title of the talk is data set for ML, guided chip, design, um I'm gonna, get started now, uh but before actually I go into the details of this talk.

B

um I want to thank Qigong way. Who is a PhD student in our lab? uh He's the gorgeous person who's doing the Hands-On work on this project and I also want to thank my advisor. You know Lizzy John, um you know who has been funding all of us so far, all right, so um ml or machine learning has been used for chip design for a lot of work. uh You know I'm, showing here a few of the papers that I've come across in the recent uh past. uh You know the works in this area.

B

uh You know range from applying ml to you know, doing placement or floor planning better, which is the famous paper from Google to um you know doing prediction of of some Metric like predicting resource usage uh for an fpga. You know design uh using an ml model or trying to predict the power consumption of running some code on an fpga or a GPU, given the power consumption or given the performance counters when that application is run on a CPU so um and actually specifically, because I come from the fpga a field.

B

That's my main area of research. There are a lot of papers, applying animal to perform prediction or improving. The estimates of uh you know: area perform area frequency power, consumption Etc um that are that that are gener. The estimates that are generated by you know high-level synthesis tools, for example. So all of these um you know experimental projects to use ml for chip design, they need data sets and um if you see that you know generating a data set for for for these projects is actually pretty time consuming.

B

It requires you know, tools and licenses that may be proprietary. You know it requires. You know a person to generate uh scripts to to run and parse the results. uh Of course it needs a lot of machines.

B

You know to run those tools and also um you know somebody has to period the whole thing to make sure that the data set is actually usable um and all of the I shouldn't say all, but we looked through many of the uh projects that I flashed through on the previous slide, and we see that all of them had proprietary data sets that are not available in in open source and for every project.

B

You know a new data set is created kind of on an ad hoc basis um and therefore we have a lot of custom data sets that are available right now and um and if you look at the contents of the data sets used by uh you know the plethora of studies that exist out there, it's not a it's, not a large variety. There is a small subset of um type of data that keeps getting used in most of these uh projects and some of them I'm listing here.

B

You know, graphs of netlists of HDL designs, uh signal activity with that with the graph of the net list performance counters. You know of a c application running on Hardware um in power consumption, either measured from a tool or measured on a board. um You know fpga resource usage and timing for a particular HDL. uh You know design, 2D images of flow planning and placed under object circuits.

B

So this is this is the kind of five six type of data that we we came across, and so it turns out that many of these projects can actually reuse this data. They don't need to recreate the data set, but for that data sets need to be present in the open source, so they can be used by multiple people.

B

um So we believe that open source data sets can be very useful in the research community and we have actually started this project quite a long time ago. um You know we only have one person uh working on this and not even full-time, um and there is a lot of other. uh You know Focus uh when we are working on developing this, uh this uh data set, so um so the the the you know the data set that we will present right now um until very recently. We believe that that was the first one.

B

That would be, uh you know available, but we recently saw just in October, 2022 I think it was in ic cat I, don't remember which conference, but it was released uh by Peking University, it's a it's. A data set called circuit net and.

C

B

A very small data set that covers mostly the you know, floor planning, placement and routing type of information uh for Asic designs, but what we are working on is this thing that we are calling chip, design data set, CD squared s, it has a set of HDL designs and a set of C applications. Rather I should say it has data collected from a set of HDL designs and a set of C applications, um the htl designs. We are sourcing from open cores.

B

You know there is a bunch of designs in VTR, very locked to routing uh there's a there's, a benchmark called koios inside VTR and vdla, and some other sources of open. uh You know designs, so we are taking these designs and then taking them through different flows to generate the kind of data in the data set and similarly for C applications. We are taking C applications from polybench CF, Stone, um Mac, Suite, Etc and generating similar data for C applications.

B

We first take them through an hls flow and then generate the data because eventually we want to you know Implement these C applications onto a hardware. Now um let me give you a quick overview of what kind of data we are collecting. There are features.

B

um You know every every model that you train needs, you know features and and some metrics that are typically Target metrics for training for the model, training, the model for um the kind of features that we are looking at or we are collecting, are number and size of primary inputs and outputs number of operators number of memory bits size of the design. What application is the design from um you know? What are the number of registers uh you know or signals or fsms in in the HDL designs? What are the number of basic blocks?

B

Conditionals Etc and the C applications, um and then we have some metrics that we are collecting so for each design. We are collecting how much area does it consume and the metric for that like it is in terms of resource usage for fpgas, but in terms of just an area number for um for Asics, and then we are collecting power.

B

Consumption numbers, wireland numbers operating frequency Etc, and we want to do this for multiple fpga devices for multiple fpga vendors, uh because we want this data set to be usable by many people, um just collecting data for one device or one vendor is not enough, um and similarly, on the A6 side, we want to be collecting data for multiple Asic libraries or multiple pdks, and also for multiple implementation. Settings which are in case of C designs refers to. You know different settings of hls fragments or in in the Asic World. It refers to.

B

You know different gig level, synthesis options and multiple process Corners also, so we are trying to make this data set exhaustive enough, so that it covers a lot of these studies that are being done by researchers um and also that people don't have to kind of you know redo a lot of the work that is involved in generating a data set now um before we actually kind of published this data set.

B

We also want to run some case studies to make sure that this data set is valuable enough, and this is the this chart here- is showing the the space of case studies that we are thinking of undertaking. We will not be doing all of them. We have some in mind that we are working on right now, but the idea is that, let's say the user wants to train um a model.

B

An ml model to uh you know, model and the user can be, can give either a c code as its input and or a very long code as it's input so for these two. That is why we are using you know: collecting data for C applications as well as HDL designs, um the kind of input required for training. For for for these um you know data sets, it can be.

B

You know, features generated from the RTL or features generated from synthesis, outputs or you know, features of the fpga that the design is being implemented on or features of an Asic Library. All of that is, it can be training input and the metric that we might want to predict um uh can be.

B

You know, predicting power, consumption of a given C code, running on a specific fpga or predicting uh the internet usage or predicting the operating frequency, for example, and we want to Target both fpgas and A6 and I'm, showing here a stack of fpgs and a stack of Asics, because we want this prediction. You know these case studies to do cost prediction also predicting from one fpga to another.

B

um So right now the case study that we are working on uh actually follows this it. It follows this path in this in this chart, the part that got just highlighted with yellow. So um let me Define that that that that problem, that uh for which we are designed the case study, so uh we will. We are training, a model that will take in a piece of C code or a c application of a user. It will be trained on RTL features and HMS outputs.

B

It will predict the power consumption for that c code running on a particular fpga and actually running on. You know, train on one fpga and predicting on another fpga. That's the case study. We are working on right now, um so to this case study, this is just the first one. We hope to do more case. Studies in the same you know in this space and kind of um you know establish that this data set is is useful enough.

B

So where are we right now? This is the link and I'll quickly um show you know just flash this. uh uh This GitHub link uh there is the the QR Code by the way, but this is how the the you know- the GitHub link looks like it's under right now under our uh Labs um GitHub uh project, and so there is some documentation.

B

It's it's definitely needs to be improved, but there are you know at the top level, there are two levels in it Asic and fpga, um and if you, for example, go into the Asic, there is some documentation of you know where the data data is how it can be used.

B

Etc and right now we have two types of data. One is you know, data and CSV files, um and one is data entire walls, because there was some data that we wanted to make a part of this data set. But it was huge right. You know multiple gigabytes of data that we didn't want to put on GitHub, so we have created Tower balls for that kind of data, um but for some simple data, like you know, collecting information about resource usage or information about timing, Etc that we have parsed ourselves and put in CSV files.

B

So that's how the data set looks like that's where we are um uh the current Focus that we have is fpga connecting data for fpga, we're not focusing on Asic right now, um and we have a sufficiently large number of HDL designs and um you know generated designs from C applications and we have some funding for this project as well. uh We recently apply for funding uh with meta uh for this project, and uh this is this is going to.

B

This is planned to be an open source project, um and so the next steps for for this project are, you know we want to in the fpga Flow side. We want to continue collecting data for Max, read and CF Stone right now. We only have data for all events benchmarks and on the verilog side we are.

B

We are currently parsing contents from Joseph's reports, but we also want to you know, run VTR and vivado and maybe quarters and parse reports for those uh from those and generate data for that and the Asic flow is something we haven't started yet. So, as I said earlier, we want to bring this project to chips Alliance. um We are going to submit this as a Sandbox project soon and the call to contribute is, uh you know, to kind of help us build this data set, be bigger and have more value and use.

B

You know, use this data set right, um find issues find bugs Etc in the data set. While you contribute um and the kind of work that will be involved, it will be, you know, writing scripts, running these tools and parsing data and collecting them.

B

So in summary, um you know ml is being used in chip design processes uh by so many researchers out there. So many companies out there, um but the the data sets that are used by those projects are not open source and we want to build an open source data set and that's why we at UT Austin are working on CB Square s and um we hope that people will be interested in this kind of work and contribute search of science.

B

All right. That is the last slide, so I will stop sharing and um please ask questions if you have any.

A

All right uh so uh I, don't I, don't know if that was mentioned in the presentation, but I I think I didn't see that uh the repo does not contain a license. uh I assume it's just a temporary thing, but just make sure that you kind of uh put in Apache License there so that we can kind of smoothly onboard it into chips. When the right time comes perfect.

B

Thank you thanks for noting that yeah we're missing the license and.

A

I asked this question uh not just to be smart, but more like very often when there's no license it kind of implies that there is a problem with the license or someone has some kind of uh no no doesn't have crystallized plans for what line is going to be I. Think that's not the case here. Right, you're actually trying to get it in the chips Alliance which requires Apache.

A

So it's kind of an obvious thing to add which raises the confidence of people and they look at the repo they're like oh yeah, it's Apache, it's fine I can use that and, of course, like the follow-up question is uh and I don't want to make things hard for you but like when you generate this data uh and I'm, not an expert I'm, a lawyer, but just make sure that you can actually license it under Apache um because, like you know, AI is complicated in that way where the kind of data you parse uh kind of end results from this uh whole Endeavor kind of uh the source material can influence.

A

So to say, the output.

B

Okay, gotcha yeah I think we have to avoid having some discussions in general about you, know, sourcing data from you know commercial tools, uh but.

C

B

There are specifically thinking about whether it is you know, are we allowed to, you, know, publish it under the apologize.

A

And, of course, chips, Airlines kind of has a legal committee and and kind of potentially could help in figuring that out I'm, not saying that we're. uh You know we have a very strong track record of figuring out. You know AI for chip design data sets as such right.

A

It's a fairly kind of New, Field I would say, uh but certainly there's there's lawyers involved right, so it doesn't have to be just developers talking to developers and trying to figure out if we have a good understanding of the law, but we can actually get professional help.

B

A

A good idea, I'll.

B

Reach out to Rob um I hope you, you can connect me to the legal people right now.

D

My favorites, don't quote me on that. Please I.

B

D

Being recorded now, I'm in trouble, I did want to ask you one question: why I enjoyed your talk? Have you working at all with the si2 I know? There's some initiative. I was meeting with Professor Andrew Kong here about a month back in San, Diego and Tom, Spyro and I know, there's some initiative to create a standardized API for collecting metrics that you would need relative to chip design. You know rather than having to endlessly parse her course.

D

You know never my favorite topic I might add, but just curious, if you guys are working on that at all.

D

This is under si2, which is uh I, don't want to say it's a proprietary organization, that's not quite right, but it is heavily participated in by the Eda industry, but it may help resolve some of the questions or concerns, assuming that they make this API publicly available under Apache. Two dial license that you know. Michael was correctly sharing right and, of course, that is a concern relative to using data that is generated by any of the proprietary Solutions. So.

D

C

So I had this question that um it ties into this as well, and that I saw you we're talking about the FBA flows that you were using and you had thought about. Nazi flow I was wondering if you have looked at using either lies uh to um to Target a lot of different fpga amazing targets and also related to that I mean that could also be a good place to put uh this kind of report parsing uh in a centralized place, two birds in one stone.

B

Yeah we haven't looked at the idealized. I am actually slightly familiar with it. uh I remember, you know, you know questions about it, my BTR group, um but you know I, haven't we haven't even been to it at all. For this purpose.

D

Great one, more any more questions or one more question.

D

Okay, thank you much so much for your chat, I'm on really appreciate it. Thank you. Thank.

B

D

Okay, so our next presenter is Michael Gilda from Ant micro, co-founder of vant micro and VP of Outreach and development. Michael helps me quite a bit on Chip's Alliance, so I really do appreciate that so I think this will be a great talk.

A

Let me call in and see if it all works,.

A