Cloud Native Computing Foundation KubeCon + CloudNativeCon North America 2022, 11 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Training AI To Code Using the Largest Code Dataset - Tommy Li & Animesh Singh, IBM

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe in Amsterdam, The Netherlands from April 17-21, 2023. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Training AI To Code Using the Largest Code Dataset - Tommy Li & Animesh Singh, IBM

Speakers: Animesh Singh, Tommy Li
Project CodeNet is a large dataset of 14 million code samples totaling 500 million lines of code in 55 programming languages. It enables machine learning for code, like finding code similarity, extracting semantic context, and even translating between different programming languages. Using the Machine Learning Exchange (MLX), a Linux Foundation for AI & Data Sandbox Project, we demonstrate how Project CodeNet can be leveraged to classify code and analyze code complexity in three steps. Using DataShim we turn domain specific subsets of the data into Kubernetes Custom Resources. Running Jupyter notebooks on Kubernetes we use the datasets to train deep learning models. The models are then served for inferencing as Kubernetes Custom Resources using KServe. For each of these steps, MLX generates Kubeflow Pipelines on Tekton so data scientists are not required to write Kubernetes specific code.

A

Thanks everyone for coming to the last session of the day, my name is Tommy I'm, a senior software developer at IBM's, and today we are going to talk about how to um training like AI to call using the largest code data set called konet that will release, and in the past few decades we have, you know focused a lot on. You know: training AI on natural languages, but there's also like different kind of uh other form of languages that we use in a day-to-day basis, such as in the molecules.

A

We have different kind of representations for molecules stories, imagery and especially for coding as well, and today we want to focus on how we actually want to emphasize on building AI for code and as a power initiative in IBM research. We have this um new. You know program called afo code, which we want to focus on teaching the machine, the language of machines, and it is very new in terms of like fields in computer science, and we leverage a lot of the you know.

A

Core Concepts such as nlps document understandings and you know, code, analysis and compilation techniques and the goal of his whole initiative is actually helps the acquire new models to help automate the you know software engineer process and able to help them improve their productivities and in addition to that, we want to like create.

A

You know, tools to help your software developer to build your practical tasks such as you know, code search, summarizations, completions and protocol translations, and with all these tools right, we are hoping to able to um like accelerate the output to you, know, modernized magazine or software, and how to migrate their model ethics um software into like a new. You know microservice architectures, and you look in the past right uh when Engine imagenes just announced um they first have like three million image, and you know 5000 class and just put in three years of spam.

A

They able to like expand it to like 14 million images to end 22 000 of classes, so the data Circle, you know um expanded very fast in a few year um spams and would be a big data says.

A

Like you know, image net, um you know, developers and scientists is able to like um spend like just put in five years from, like um you know, more than 30 percent of error rate right um from the beginning and able to achieve a human accuracy right, just bring five years of developments and research times, and we want to do the same thing with like air for code so and we'll see a lot of the processary business use case.

A

As we see that a lot of the governments and banking automobiles software they're still using like you know, legacy codes like covers right, and these are more than 220 billion lines of you know, Legacy Global codes and as we were like, we don't have that many. You know code developers they're able to translate this into modern Technologies.

A

So this is why we need to develop tools to accelerate this effort, so we can migrate like like ancient Legacy technology into new architectures and when we actually look back into how we have done for natural language processing with you know like deep learning efforts introduced uh in, like you know, early 2011s, we in just five six years of stamps.

A

You know with deep learning and GPU accelerations and all the research we're able to achieve a human level accuracy just within six years and even with IBM, will be able to like produce a lot of NLP based. You know, services and document understanding using this new like AI Technologies. So we want to focus the same Technologies on core as well, so it helps us to do nlpn MLS, for you know, software artifacts, other main reasoning and decision making and furthermore, is one able to explain like how this code's been generated.

A

So we create an explainable AI as well and with all this, we need to create a new data set right, just similar to image in open source. So you know, developer can use this to create. You know like code, language, translation code, search code, similarities, um code, performance, Improvement and uh core memory, improvement and code classifications or examples and algorithms. So we got to use this and put into like um more specific use case for each Developers, and this is why we want to announce the project co-net. We have released the protocol net.

B

A

Set to the open source last year, um this data set is actually a high quality code data set of um augmented, Innovation and benchmarking. It is very large scale, so you can see, there's 40 millions of code samples right more than 4 000 um code problem is out there and it covers um 55 different languages from like Global ancient Legacy language to the latest. Modern C, plus, plus Java Python language as well and Optical are actually well tested, and it also provided a various test case.

A

So you can choose um training on top of the different classes and make sure that you could understand how this code are being ran and how this code has been failed as well and from some of the examples we have seen like the potential use case is actually modernized. You know legacy codes, so uh one of the kinds we have is the model um Automotive clients they have very old.

A

You know like um stack of java legacy codes, multiple version monologue, applications more than 3500 driver files and more than really nice of calls- and we want to you know, might we even migrate them into the modern architectures and in our initial effort, when we actually estimate how many therapists and how many effort we need to spend yeah? Actually, at least it needs to take us a year to migrate. All these, you know um code, internet, modern, Java versions.

A

um That's why we kind of initiate like um the area for code. You know um project and then able to build some models to help us accelerate and re-architects, find out like code that need to be. You know, refactors and after building these models right.

A

This model able to help our developers to you know reduce the time from a year down to like four weeks of modernize, all this code into like 25, different microservices and 450 different Java classes, I'm running in the latest, Java versions and furthermore, you can also um the AI is able to like how was comprehens like how much runtime and data dependencies do.

A

We have right for this course, and they will expose that any debt code that is no longer being used or is no longer um suitable for the new applications, and as with this, um when we open source, you know, project co-net. We want to see like um put this data out to the open source, so open source.

A

The developers could create a new algorithms, new new method to train these data sets and a new way to compute this data using a new data, parallelism and model parallelisms, and once we have enough you know research, then anyone could just put in their AI system Stacks, you know fit into their own specify, like data sets to do right, transfer learning, build their own models and and put their own your business Pipelines and once at the end, once like all this effort has been done, then we're able to like use this new kind of creative model to you know enhance our business values such as in modernized legacy, codes and boost.

A

You know their productivities, so now I want to dive into the. What is this, you know called the data set and what is actually you know contained so content data tests. Have you know, as we described, have a lot of languages 55 different languages?

A

um Most of them are composed with like modern language like C, plus plus python, Java, C and rubies, but we also have like ancient. You know, um Collide.

B

A

Language like kobos, um where you can actually use them to do like code, translation and understand how like um Legacy code, is being coded.

A

You know in the modern you know um coding problem as well and you're trying to break down into like how each language, uh what part, what kind of problem is contained with each language um 80 of the problem, actually have more than 100 Solutions in each of these 55 languages, and you know more than half of this problem is how to Extended so they're actually like um workable examples, but you also have the other half where you provide use like mistaking examples or runtime error memory error, so you could actually like use them to figure out when a developer create a different uh like one kind of solution, you could able to find out how to optimize them and tell them the right solution to recall that into the right way and um well.

A

What was all this data set like that's? This is how we kind of collect all these data, so we actually, like you, know, um talk to the uh Azure online church and uh at code and help help them ask them to able to collect those data for us. um This actually contains like more than four thousand um problems and have more than 30 million uh some submissions right and, as we described before, like um more than half is actually accepted.

A

Solutions only up about 30 is wrong answer, and you know 16 is rejected with different reasons right, like memory error, runtime, error, Etc and within this 55 kind of language.

A

um You know the main six languages C process on python, JavaScript Ruby, zip shop, and they are all coming into different version as well, so you could do um they have different kind of version of C, plus password and Java Solutions, and with C, plus plus there's more than a million submissions and more than four minutes, except so C, plus plus it's like kind of like the biggest data set we have in um konet and um you break down how this data has been collected.

A

So this data is actually like um completed programs in a particular um like programming, language and each of them.

A

Each of those programs only contains one single files and it would, as you try to attempt like a particular programming task or problems, and a lot of them have like multiple solutions that you can have multiple service and multiple runtime, a different approach to actually uh tackle this problem in all different languages and at the end, once we collect all these kind of data, we actually like make sure it's all certified under the uh cdla um permission, V2 that is defined in Linux foundations.

A

So this is actually like able to uh good for every open source developer, to use it and do development research for their own models.

A

um So, let's dive into a little bit on like what kind of metadata is actually provides so for um each of those problems that we have defines like uh um kind of like ID names. You know, time limit memory, limits and complexity, that's required for these kind of problems and when it comes to submission levels for each of these problems right, each submissions is able to identify how many CPU times memory times.

A

Accuracy has been produced right because we have a tons of you know, test cases available to calculate how how much accuracy you could um satisfy with these Solutions and once the submission is done, um what whether or not it pass a no pass will output a status. So this is actually how you filter out like what kind of um problems that you want to like categorize. So not only does a problem could be just passed or fails, but you also click like um determine whether or not this problem is like.

A

um This is the time limit, memory limits or if you just have like wrong runtime or outputs.

A

And furthermore, so with all this kind of information we have, we also provide like different tools and examples on how you get how you can get started. With this. um You know data sets, so the tool we provide is like statistic from the data set. That is the stuff you have seen before. So we have um I told you, how do you join a different scientific?

A

You could get subset of the data set as well um using our tools and also like you could convert this kind of data set into like different kind of data formats right so by default. It's just like um you know, code files, but you can also convert them into like a stack of files or you know, put them in the text file as well, and we also provide different kind of pre-processing.

A

You know source files to data, depending on what kind of you know um model you want to train, you might want to do different kind of pre-processing and we provide like some simple tools like a tokenizer to generate estimate tokens. You want to build like traditional, you know um abstract syntax tree. We could have like AST generation as well, and you want to do like just control and um the data flow graph, and we also have like cool analysis.

A

How you do that, and we also have a few initial experiments we have put out anyone in the open source will just try. So we have like some simple. You know: uh qna experiments um that people could, just you know, train them in their. You know deep learning, Frameworks.

A

um We also create, like some simple in the math language model, so you could get started with like um simple math language and you know feel on top of like different models and also like uh create you know token based similarity classifications and um for all these experiments right, um two of them, the mass language model and the.

B

A

Know token, based similarity classification, we also have like a very simple notebook to you know: help you train a very small model, just to see how it could be work right, um Step by steps. So we also have like no photos.

A

You know guide you how to do that from pre-processing to actually producing your models and test it out uh on the mass language model and language classification that we have fit and with all this kind of information and Tool, we are actually aiming to expose the potential use case right with these data sets, because we can see that afo code could help. You know developers do you know closed classification to know what language the code is, um how to do like code, summarization search.

A

So when you want to identify a particular problem, you could just you know, search it based on the topics and then you could do the social source code translation. So when you migrate from Legacy code, older version of the code, it could help you like automate some of those process, so the Republicans just spend less time just on you know, migrated man, earnings um and lastly, also, um more importantly, is I want to help deal with the right, much uh medical and faster goals.

A

So um uh some of the kind of like Technique we have the other is actually um using natural language to generate code. So when you actually type down, let's say I want to like run an array from two to five and you could generate a function for me to do that um and also you know it improved like existing. You know code performance and memory Footprints so and it will analyze.

A

You know code and tell you the runtime complexity and able to provide you a better Solutions and then, of course, more importantly, is like um the key for this stuff help us, you know, find different error and debugging um the existing code and create you know. Code, test generation so make the course more robust than in the long terms, and when we see out there, um we have multiple in existing application.

A

It's been using this open source CSS so in our IBM air for code Stacks um in IBM research we're using it um to do our code research, but we also see like the deepmind alpha course also using the code as one of their training data sets and we could see like they're able to use this data set um to help them achieve like human programmers 50 to 60 accuracies.

A

um So, let's dive into one of kind of a use case um that we have.

B

A

Know um figure out in like IBM research, um you know so. Last week, I've been researchers just announced, like this collaboration with red hat, to create this new project called project wisdoms. So the poor concept of particular system is actually helped to generate the red hat answerable pipelines.

A

You can just plain English, so it's um a use case where you just provide simple English on contacts and create in the automations and infrastructure as code and the the main focus for us right now is actually just aim to build their Foundation models right and with, while keeping the accuracy High. We want to figure out how we can reduce the number of parameters, so we like this new model, is able to compute and we train like in a decent amount of times.

A

um As you can see in like this. Like page, when someone kind of like put down a simple text on installing in nginx and node.js 12 package right on red hat, um it was able to generate you something like scripts right. Some infrastructure, that's called in ansible to help you like just run the install package. So this is kind of like very useful tools for anyone who just want to like do simple task, but they don't have any knowledge of infrastructure in sascript.

A

This could be like pretty a big role um when someone just do the automations, without relying too much on automation, teams.

A

But uh with all this kind of information, we have right when we get through all this kind of research and like we provide tools, put it out in open source, how we actually share our fun things right, like uh within our teams or within the communities.

A

um With this last year, we actually announced a project called um the machine learning exchange um and we work with the LF Ai and data and propose this as a um Fai and data project sandbox and now um this is actually um hosting on Fai um infrastructure as well, and the key concept with this is to provide a data and AI asset catalogs, right um and further modes.

A

We also integrate some of the execution engines so for those who want to try out in their own machine and see how these assets have been execute, they also could do that in their own clusters.

A

um So some of the high levels on what the machine learning exchange is actually contains. So mainly this machine has changed. The main purpose is actually showing all these catalogs, so we have different kind of like um data assets such as data pipelines, components, models, data sets and notebooks. Those are like the core components that other data scientists have need to build and share within different teams and uh I.

A

Think the next step we kind of see from the dialogue step is that when data science is supposed to want to try out like they want to just have a simple experiment: we actually like Leverage some of the execution engine we see in open source, so they could just they do a simple run and determine what kind of two they want to use.

A

um So, for you know, pipeline engine will actually like Leverage, you know Q4 pipelines and uh because we actually run it on openshift. So we actually also um use the Techno version of people pipeline, so we could actually run um the red hat. um The openshift approved version of the um textile runtimes and for serving engine we chose k-serv, or we also have an option for you to just Deploy on playing kubernetes deployments.

A

So k-stop is actually a very popular project for deploying you know, serverless models on top of kubernetes um and then for data sets. We used a project called Data stream, where I could to help you. You know like uh host all your data um set into your local cluster and use them within um your custom nodes and, lastly, with all this kind of execution on engine, we're actually able to fine-tune our um data and AI catalog metadata and review those metadata based on a spec called ml spec in open source as well.

A

So with this, um let me just show you like um a default catalogs. We have on machine learning Exchange in the open source, so we have some like sample pipelines and components that you could.

A

You know, get started with and, more importantly, we also have like different kind of models on data set and notebooks to help you see how what kind of like um data and models people have been trained and uh how you get started with this model, using a notebook as well- um and you can see like um some of the data sets like project co-net and the IBM debater data set, is also on machine learning exchange as well.

A

So with this, let me go on the machine learning exchange demo, um so this is actually the host in the public website in apple AI. So you go to ml-exchange.org.

A

You're able to see a list of like catalog hosts on machine learning Exchange. So this is just only a catalog page because um on on the private server, we don't have. We don't Leverage The runtimes so, but you could see it. What kind of like data is that you could download and what kind of model you could use. So we look at like the list of data sets.

A

You could see like um project co-net, um different kind of sub step, project code and language classifier, and we click on like let's say the project code in Australia I actually have like different selections of the BSN. Let's say the full data sets right like it might contains. Like several gigabytes of files, you might not put the downloading in one fly. uh We also have different kind of data selection where you want to just view the metadata or you only need a benchmark for python Java FC plus.

A

We could have like um those selections for you like preview over here already, and um you wanted to see how this metadata is being built and want to upload your own data sets and purpose. Your own data set in open source like this is like a kind of preview on how the method is being stored in yams.

A

And similarly with the models, so uh we also have like different kind of models that you could write in containers, so just upload as a model files. So, for example, we have this like uh co-net language classifiers. Let's click into this, um so with this we actually have um determined like what kind of like um um data set is used. So with this data set we're using like Covenant, so it's under the ctla permission V2, but also with the model weights and the model code. We actually make sure it's under um Apache 2.0 license.

A

So when you actually use this model to test all the play around you guys to make sure that this is actually certified under this open source license as well.

A

um And similarly, with the pipelines, this is like we leverage IQ for product behind the scenes where you can actually see how you could build a simple partner, just leverage, multiple, you know, tasks and join it together. So when the data scientists kind of build like a full bloom pipeline that do like different kind of data, pre-processing different kind of like data training with how you distributed the training process, this could also built into a single Pipelines, and you know upload it over here and share with different.

B

A

um Data science teams and each of those you know, pipeline components also could be shared into the components categories. So this is like the category how each component is being built and you could connect them into a whole pipelines and, lastly, with um The Notebook is where, if you wanted to see some examples and how these examples has been executed, let's say, for example, you could pick like the um project called net um Mass language models.

A

um You could see like how, where is this data? Where is this notebook? It's been host right on githubs, but you could also like Leverage our internal backend engine, where we actually help you to rather than available to preview on this page. You just want to feel like just render and stay on this page to see how you um you know, let's say for these examples for the mass language models.

A

You could see like how you know you could just take a subset of the model from conet prepare the data right, um do some kind of tokenization and create a you know, available models and at the very ends right like it's able to train in just using CPUs um within an hours and with for this example, will make it very small. So we just train for five parts and just show you some of the evaluation on.

A

Let's say you want to just predict right and um Master words how you could actually run like the top five accuracy right using a simple math language model. This is like an example on how you could start learning and using a data set.

A

um So with this, um if you actually Wonder, like you, know, find out like what kind of models or like data set you want to book on, and you just happened more than I try out on what should I exchange you guys already Deploy on your own cluster and run on the cluster, using our integrated runtimes. So, for example, I have one instance that is deployed on my development clusters, so I have imported the same kind of like um data categories here.

A

So let's say, for example, if I go to um decoder language classified constriction models um in my own cluster, enable like the execution run, transfer so I could actually able to launch these models right on on my cupidity and try out so let's say: I just want to launch this model as a plane. You know deployments container and you test it out. I will just simply do like launch and behind the scenes right, depending how you actually train the model or certain models.

A

You can actually have a preset pipeline to help you, you know, get the data trainer, but for this example, we just have like a simple pipeline where we get the model information and deploy it on our cluster, using the latest image um that we have registered.

A

So once the pipeline is finished, it will just show us like um whether the department is available and um for this pipeline I actually configured to deploy on um no ports around kubernetes. So um with like 100 IP and an open, we're able to like just try it out so with the note put, we have is 3 11 74., so I just do here um so with this way um behind us in this pipeline, just kind of expose these models as a rest, API server.

A

So you can actually leverage currently this Swagger uis and this model is extremely simple, so you could see like this. Apis only have like two method get and post and you could just try out with this model. Just by sending you know some files, for example, let's say we submit like um python files here and we could try out these examples able to produce us the top three.

A

You know accuracy um on the you know language classification, so you could see that python is the highest accuracy rate with almost 70 percent, and then you could kind of see how you know this model has been built and can it give you some ideas on how you could leverage? You know, code net and.

B

A

This to you know, enhance your user experience and able to build tools to help you like, um enhance your developer. Cycles right um on migrating calls, building codes and completing calls.

B

A

um With this, I will just gonna summarize like what we have um discussed today. So um the main point we are going to show is that we want to open source this project called net. This is that the high quality uh very big data sets we have created, and we want to see like um we posted in the open source.

A

They want to see that open source Community they're able to leverage it build it enable to like give us feedback, and hopefully, in the next five years, we put to have like very good tools to help us migrate, different kind of legacy, codes uh able to enhance their product able to code them in new ways, instead of like in the current way where we have to go like, let's say, start overflow to five Solutions right, like that's kind of like the goal, we aim for and um once we kind of see that being being going in progress, then we were able to like Leverage some of those um kind of like open source models and put into kind of like production, AI system stacks and with production, AI systems that then we have able to develop a business value.

A

Such as actually able to use those.

B

A

Future models to help us do like automatic. You know: code, translation and modernize Legacy code with very minimal efforts. So with this um I was just a thank you very much for attending, and is anyone have any questions.

A

uh Yes, please.

A

uh Yes, so um I would say like some of the kind of like public announcements we have I. Think one of the dealers advancement we have is the project wisdom. So we actually work with, like red hat team, to attribute new tools uh to do like code generation for other Automation and infrastructure as code, so we hopefully to help like help developers to able to do automation by just you know um putting like a plain English text right, so this does describe your workloads and build a answerable pipeline for you.

A

This is kind of like one of the use case and other customer use case. um We have done um it's actually like help us modernize. You know old Java code into new Java codes, so those are like kind of the use case we currently have, but we're also like Envision like more um use cases in our open source and hopefully like people, will get feedbacks and we're able to use those feedback to enhance and improve this data set over time.

A

uh Sure yeah go ahead.

C

A

uh Yeah, let me repeat the question so I think the question is around like how we got the approach. You know different uh institutions like Bank, you know to um migrate. You know old Legacy code like Kobo to like the market, Technologies, um so I think at this moment for the research and open source team. We actually just um want to leverage with a kind of different use case and.

B

A

Lot of the kind of research we have done is um more on like um research and developments, so when it comes to the kind of clients, um I I think like um we do have a client engineer team that work with different clients, but uh we don't have any, like particular data that we could expose that at this home.

A

A

um Maybe uh someone behind yeah.

A

uh Yeah I think the question is like how we envision you know this data set to plug into attribute like proposition and values to real customers right um so I think it. We are still kind of in an initial stage, um I think, as we kind of see it in like different kind of AI road maps. When you know undue data set is being created, new method has been created. It usually takes several years to actually able to achieve like um like more real world or like um more sustainable use case.

A

So um right now, I think our main kind of commitment is actually like show what kind of um use data we have and also like, if open source, this data set to the communities.

A

So anyone could just use it without any risk and uh from there from a research team at open source team, we actually did want to just leverage feedbacks from the open source um and of course, do you have any like particular customer customer, like kind of requirements, uh I think at this moment um it's actually just based on whatever company you work with, like they've, got like internal feedbacks, but it's on the open source side.

A

We actually want the community to show us right how the different research paper and different like open source use case um um that have you know, kind of, like just um posted on, uh like developer blocks and use those feedback to help us. You know improve and build a new kind of models that could leverage data sets.

A

um Yes, um is there any other question or, if not then yeah.

C

C

C

A

Right I, I think like right now um we do have like 55 different language right for our initial kind of data sets. So as you can see that even for the cobles right, like we kind of describe over here um and with the code, was like, we at least have like 100 solutions for them uh for 80 percent of the problem.

A

So um without initial kind of Investigation, we do able to see like a lot of the problems you could I put like um software very simple, but, as we kind of discussed right like um like this, data set is more like for people to explore different kind of use, case and um I.

A

Think Kobo is something we have still working at and to see that what kind of like performance we could get um and um and I think that is a very like good kind of like we see that a lot of good feedbacks from the research team already and they already like have a you know, producing like kind of those examples, um so I would say, like initially I think we see enough uh for at least for code generation, uh and at this point we can see some good results and we want to see committees like take it to the next level to like more advanced.

A

You know, coach translation and code completions from here.

A

uh Is there any questions?

A

uh Yes, please.

D

A

uh I was saying more or less similar yeah like I, think I would open AI for it, as you mentioned like how openly have I kind of Leverage, similar Concepts, um because openly I kind of like proprietary, we don't know what kind of core models they use behind the scenes. So we cannot really say, but is the concept similar?

A

And what would one inversion here is like um there's, no good like data set, it's been test and have good test cases verified and open source right I mean you could also wear a scrap code from GitHub, but you have to like I, have someone to do labeling and test out everything for use, so um this data set is actually aimed.

A

Just like imagine that you could create like a benchmark on other, let's say models you build and just to make sure like um when you create a new use case like you could use this data as a benchmark to test whether or not um your generated, let's say um code is actually matching the one we have in our data set and use that to improve your model. Accuracies.

A

um Is there any other questions.

A

If not yeah, thank you very much for attending.