GitLab AI Assisted - Code Suggestions, 17 Oct 2022

Previous Meeting

⏯

youtube image

►

From YouTube: QnA with Brendan Dolan Gavitt on AI Code Suggestions

Description

This is a QnA with the author of Fauxpilot where Gitlab Product , Engineering and Incubation team asks question to understand further on AI code suggestion , future of Smart Code as well as large scale models

A

Yeah all right, hi, everyone uh we are at a q, a with Brendan, uh Dolan, Kevin, I, hope I'm, not butchering the name um um in reference to uh our work and with gitlab AI assist as well as all pilot and uh anything to do with smart, secure uh coach suggestions, so I'm gonna, first start off with actually just probably saying what Brandon would say in his own words who he is uh Brennan is an assistant professor in computer science and engineering department at NYU.

A

He holds a PhD in computer science from Georgia Tech and I'm ba in math and computer science from Wesleyan University. His research focuses on software and system Securities, most recently, the security of large language models bring on code and how such models can be used to improve security.

A

He can ultimately found posting pictures of cats and making very bad jokes on Twitter at uh I would like randomly how you say it sure.

B

This is just moyex. um This is one of these things where you chose a username in high school, and uh it's been with you ever since so um yeah. Thank you very much for the introduction.

A

Perfect um I um we're super excited to have you we um within gitlab. We have uh the product and Engineering as well as our incubation engineering team, with a whole lot of questions um as it's there in the Google door.

A

um We can first probably start just kick off uh in just a rough, uh giving your in your words, uh uh what is full pilot.

B

Sure, yeah so um I guess full pilot kind of came about because um so I had been using GitHub, copilot and thought it was. You know uh very helpful uh in doing my own programming, um but I guess as researcher we wanted. You know my lab wanted a way to be able to do things like you know if we train our own model or fine-tune our own model for different purposes, we'll want some way to.

B

Actually uh you know use that in something that looks like a GitHub, copilot kind of uh use case um and at the same time, I guess you know. Whenever I saw people discussing um you know things like uh co-pilot online, one of the things that they were sort of concerned about. Is this uh the fact that you have to sort of send um your code up to uh some remote server, uh hosted by GitHub to actually get suggestions right because it sends sort of the code that you're currently working on and maybe code, that's working.

B

That is in other files, open your in your editor um so that it can get code suggestions um and there were.

B

There was a lot of interest in uh being able to kind of run your own locally hosted uh version of it um and around I think this was sometime in maybe January or February um Salesforce released these uh fairly high quality code models um that were open in the sense that you could actually download them and run them locally yourself, and they had reported at least that in their evaluation, some of the models like the 16 billion parameter python model had comparable or better performance on.

B

You know: sort of standard code benchmarks like human eval or apps um for solving programming tasks um so yeah. That was essentially it um you know. So a lot of the research I'm doing these days focuses on like okay, you know here are these code models. It seems like a lot of people are going to be using them um over the next few years.

B

um I think I saw a stat the other day that GitHub co-pilot got 400 000 new users in their first month, which that's a lot of people. Suddenly using um you know, AI code suggestions, and so we want to do research on you know.

B

How can we make sure that the suggestions that it uh produces don't make your code worse um and ideally can they suggest things that are sort of more secure than what you might be inclined to write by hand um and questions like this, so I guess I see it as a kind of a combination of this is our research platform, for uh you know running future studies.

B

We've already done one um kind of user study where we had gave half of the users access to the code model and half the users wrote Things by hand, and we didn't use fopilot for that one because we ran it before if a pilot existed but we're planning on doing more of those in the future and we're going to try and use those um with low pilot, because that really gives us very kind of fine-grained control. Over the we know the exact model that they used. We know what it was trained on.

B

We can use models that were fine-tuned on particular code bases, or you know only trained on code that we've maybe tried to scan for security vulnerabilities or things like that. So yeah that was, that was kind of the um the main thing I guess I also don't want to necessarily oversell it. It is a project that I kind of put together over the course of a maybe a month or two over the summer. When I was, you know not didn't, have to teach um and the main components are.

B

You know from other more mature projects, so uh nvidia's Triton infant server and the faster Transformer Library um are what we're using to get this kind of very nice uh low, latency, inference um and I. Guess that's the main thing um so yeah. Hopefully, that is kind of clarifies how this thing came to be.

A

uh Absolutely uh I think that says a lot um I know this first. uh This probably helps us with even the second question, but Fred um I think that was your question on what was inspiring uh inspired to build full pilot and did you have anything more on a thread.

C

No I fully answered my question. Thank you. So much.

B

Yeah cool, um so if we want, we can go, keep going through the questions.

A

Perfect I think next was actually Fred again. Okay,.

B

Oh yeah, so maybe wait for him to step back in front of the screen.

A

Yeah, um we can come back uh into that as well. We can skip to probably buy, which would be uh my question and we've spoken about it, but I think um uh we are looking into building out of a POC using full palette. So um what would you think we would need to consider in actually using full pilot and testing it with, uh let's say even internally, with our gitlab audience of developers with writing in different languages with RoR, BJs Python and all of that um and also keep in mind compute storing what?

A

How do we actually go about it? So any advice on that.

B

Yeah I mean so I think some of the things you've already kind of mentioned in your question, um like load, balancing and batching I- think we'll probably be pretty important.

B

um You know, uh batching is a little bit tricky because you sort of have to have enough people using it that it makes sense to combine multiple requests, because if you have to wait, you know 30 seconds for the next request to come in before you try to batch them together. Then, all of a sudden, your latency, is 30 seconds for that first user. But assuming you have a lot of requests coming in simultaneously, then um batching together makes a lot of sense.

B

um You do you know you almost for free get uh kind of higher throughput without really hurting latency.

B

um As long as you know, the batch fits in GPU memory alongside the model, um the uh let's see other things that I would make sense to do so. Load balancing is definitely useful and I.

B

Think that that's something that you can do pretty trivially by just running multiple servers and um you know having either a front-end proxy that redirects to one of them um or by uh you know changing uh the uh little uh flask app to uh redirect one of the inference servers you know either of those should work um pretty well, the models don't have, don't have to really store any state right, they're, all kind of one shot. You give them the context, and then they produce a suggestion.

B

um So there's not anything where like. If the user was uh making multiple requests, they would always have to go to the same server. So it's really a very nice use case for load. Balancing there um going beyond that and starting to think about. You know what can be done to make um things faster and more performant um and hopefully have a lower resource requirements.

B

um There are uh quite a lot now of uh papers uh and even open implementations of things like you know, instead of using um 16-bit floating Point numbers to do all these computations, let's just use 8-bit integers um and surprisingly for many of these models.

B

You can actually convert them to 8-bit, integers and just do the same inference and you don't lose any sort of quality of the output, but it runs faster and uses less memory and very very recently, like three days ago, um one group even managed to get um in four, so forb King, and so at that point you're, where you can run the 16 billion parameter model in just eight gigs of GPU memory um right because forbidden means um you're, only storing four bits per parameter, half a byte, um and so that can potentially really um could really speed things up um so yeah so I think there are lots of you know ways to make this um even faster and use fewer uh resources.

B

um As things stand right.

D

B

For a given model, uh you know if it's a 16 billion parameter model. You need basically 16 gigabytes of GPU memory, just to store the model.

A

Yeah um thanks for that Brendan um I think Fred is back. We just skipped your questions right. If you wanted to take up on uh number four.

C

Yeah sorry about that yeah I was wondering uh if you have some metrics into insights.

C

um There are quite a number of stars and people interacting with issues and board requests, but I was wondering: do you have any insight like how many uh times the image gets pulled, horizontal.

B

Yeah uh so I don't know if I can do yeah I think I can do screen. Sharing um and I can just show you the traffic uh page uh both for the GitHub um repository, but also for the uh uh hugging face. That is the wrong one. Sorry I think it's this one. There we go um so this is the uh page on hugging face where uh the actual models that it downloads are stored, and this is a pretty decent proxy for whether people are actually using it.

B

um Because uh you know when you launch the setup script, it goes and downloads a model to use, um and so you know they're not enormous numbers, um but people do seem to be using this, so the 16 billion parameter model has had about 600 downloads, uh the 6 billion parameter, one is 101.

B

um and then I think probably people who are just trying you know trying it out might be using the really small models. So we get some um downloads there for the python, only version and the multilingual version, um and so that's a pretty decent proxy uh for whether people are actually using it.

B

um Almost no one is using the natural language versions of these, because I think they're not even exposed in the setup script, um and there aren't very many people who are trying to use um faux pilot to help them write English um over on the traffic page. So this it's, you know it was a little bit annoying. That GitHub only gives you the last month of usage stats here, um but it tends.

B

It looks like it's settled into around 100-ish um uh clones per month, um and then you know more people than that at least sort of checking it out and having a look at it.

B

um So yeah I think that's uh about as much as I can say for sure about um what the current usage is is definitely in the order of you know a hundred ish users um a month as opposed to you, know thousands or anything like that great.

C

Thank you. Thank you very much so.

E

A quick question, while you're uh on hugging face um my job as a product manager is speaking with our customers. I've not run into any customer that uses hugging face today. I suspect, that's largely due to Enterprise um use cases, but what's your take on hugging face.

B

um So you know I think I I'm, mostly using hugging, face right now, uh because they have been kind enough to let me host. um Let's see, uh you know something like uh 40 80. uh You know basically like many hundreds of gigabytes um of stored models with them.

B

um You know for free, so that was kind of very attractive um as far as like what I think of them as a company, you know I I like a lot of the things they do. They have made it a lot easier to train uh models and to uh host a bunch of uh fairly different models uh for inference, um yeah, so I, you know I, think I, guess I'm a little bit still confused by uh if they're a company how they are planning to make money.

B

But you know uh that's uh I, guess that is their problem rather than mine.

E

Okay, you're yep right right there with my same questions too,.

B

uh Yeah I think that plausibly, what they're doing is something like um you know: they have a bunch of people who know a lot about Ai and building and training models, and they may be doing a sort of uh Contracting thing where they say. Oh, you can hire us to. You know, help you train or deploy some models that you have or are interested in having um and they may be able to.

B

uh You know, make their money up with that, but yeah I mean it must be costing them quite a lot to host just for bandwidth and things like that to host these giant models. um So I don't really know how that'll how they do all that.

A

Oh, um um we move on I think then it would be uh back to oh I. Think back to me and Alexander uh the models. We would love to know uh a little bit more on actually the models.

A

uh Obviously we have four so all the way to the 32 gigabyte on um and um some more insight as to where the model Works, which in in what areas, what areas it wouldn't um and then I think Alex and I have a few more questions just on how we would go in if we had to take that model and optimize it. What would that look like so yeah sure.

B

um So right now uh the ones that are available are the. uh Basically there are four sizes and uh two flavors, um so the sizes range from the 350 per million parameter model, which is honestly not very good.

B

um You know the code suggestions it gives are pretty bad um all the way up through two billion 6 billion and 16 billion uh parameters um and I think I misspoke earlier when I said that that would take um just uh 16 gigs of RAM, it would actually be 32 gigs um because it's 16-bit floats, um but uh the two flavors are the multilingual one which I believe was trained on uh C python Java.

B

um Let's see, go Ruby and yeah, one other that I can't remember, um but a reasonable set of popular languages.

B

um You know and the other one was then sort of uh further, so it was trained on all of those initially and then it was further trained on a larger data set of python code.

B

um So if you happen to be using python, you definitely want to go with the python only model, which is the mono um flavor.

B

um So you know I think the other thing. uh The other models that could easily be added um I noticed that um the faster Transformer Library recently added um a model called uh GPT Neo X, um and there is a large model um called. uh What is the name of it. uh It was trained by someone at CMU um and I can find it really quick.

B

um But that one was trained, I think on a larger set of languages um and uses gbt, Neo X, um and so that might be another uh one that could easily be um added, because again it uses faster Transformer. You can convert their model into faster, Transformer and use it, um and so that one might be a good one to add in the very near future.

B

Beyond that, there are, of course, lots of bottles that are available on hugging face and can be used from within hanging face and faster Transformer has a or not Triton has a sort of python backend that you can run any model that you can run from python.

B

um That would let you then um run any of those models as well. Those would include things like Facebook. Has this one called encoder um hugging face itself, has one called code parrot which is focused on Python, and um they are current. What's that uh yeah polycutor is the one yeah yeah a polycoder is the GPT neox one and it is going uh and I think that could be.

B

That would be like half a day or something like that to add that to um faux pilot as it is um and uh hugging faces, also uh gearing up right now to train um another large open uh code model uh that they're calling big code um and I'm I have been working a bit with them um on that.

B

Just you know on questions like uh you know what kind of training data, what sort of model um and things like that, so um that one I think is probably not going to be released for a couple more months, but when it is, that would be in a very nice um one to be able to add as well um in particular, because that's going to support some things that are really useful for actually using this CNN IDE.

B

um So it's going to be able to do things like this fill in the middle task, where you give the model not just the code up to the cursor, but the code before and after the cursor, and then ask it what the best way to fill in code in between is- um and that's really helpful, just because a lot of times you know if you're editing a file you're, not sort of writing it from scratch, top down you're making changes to what's already there and the changes you're adding have to be consistent with what comes before and what comes after um yeah, so I'm, I guess uh what yeah are there other things that you wanted to know about the languages and models.

A

No I think uh that's good I'm good for now that uh I think Alexander you're. Next, with the questions on the model.

D

Yeah, thanks for the meeting yeah.

A

E

Question is like that.

D

So I saw to get recommendations. We need to provide several, let's say hyper parameters. One of them is temperature. Another one is Max tokens, so I guess so the question was like: uh is there a way maybe to tune automatically these hyper parameters for each project or how should we set up them?.

B

Yeah I mean so right now. This is very much a kind of um you know uh more art than uh science. People have been mostly um going by kind of rule of thumb. You know we want sort of low temperatures for code Generation, Um so I I, don't know that there are exact ways to derive.

B

This um I have mostly not found that um it is something that I've wanted to be able to change kind of on a per project basis, um but I think it would be something that you might want to say, uh take some kind of standard Benchmark um and do a bunch of uh Generations uh at different temperatures to decide on at least what the default is, um and there are uh you know, benchmarks available, uh certainly for Python and I think a few other languages for just um you know the sort of smallish little programming problems um that could be used for evaluation and you just basically generate code and then run a bunch of tests um to see if the code worked um as far as number of tokens.

B

um So the way that GitHub copilot does it um is they mostly I mean so they always ask for a fixed 512 tokens, um and the reason that this doesn't take ages to generate is that they change the stop sequence. So a stop sequence just says as you're generating.

B

If you see um this and the output then you're done generating and you can return immediately um and so the way they do it is most of the time it operates in a kind of one line at a time time mode where the stop sequence is just a new line. So as you're typing an individual line, maybe it only has to generate like four tokens to be able to complete that line um and then I think if you sort of stop and wait uh a little bit, it will then kick into this.

B

Oh, maybe we should try to generate a larger amount of code um mode where it uh sets the stop sequence to something that looks like you know, an end of function or end of an if statement um or something like that, which is more language specific and then we'll try to generate uh that whole block of code, um so yeah and I think.

B

Certainly that makes a lot of sense, particularly in terms of performance, because um at least most of I think most much of the benefit that I get from copilot is the sort of one-line completions um as I'm typing, and you want those to be very, very fast so that you know you can even see them like, as you type one character at a time it will go through different completions um and so I think that that does tend to work quite well. If you um are just trying to generate one line it really.

B

That does end up being. You know, I think less than 10 tokens um usually.

A

D

So what are the requirements for inputs to get recommendations? uh Should it be I, don't know a good blog of English, Chinese or another language? You can constantly just pass, let's say the python code block and it will be completed or something else.

B

Yeah, um so, generally speaking, uh you just pass it uh the code you already have um and if users want to kind of instruct it in English the way you do it is by just writing a comment um and I have actually found this. um It may be a little bit surprising. Is that a lot of the you know? My code is much better commented now, because I write a comment saying what I want um the model to fill in um and then it fills in based on that comment.

B

um So yeah, that's the usual. um That's the usual scenario. It's not um I! You know, I, think that it is not very good with languages other than English in terms of comments um and again, just because uh there's not nearly as much non-english comment data in the data set.

B

um So if that is something that you wanted to support more, that would probably involve doing some fine tuning on the models.

D

A

D

A

You also say uh basically it it fills in based on writing the beginning and not necessarily if it's in the middle.

B

Right and so I think the I mean um the main limitation here is that there's uh the model supports uh 2048 tokens a time, and that is includes you know the input plus whatever it needs to generate, um and so that means, if you are asking it to generate up to 512 tokens, then as that gives you, you know around 1500 tokens of input and so often, particularly in a larger source code file. You're not going to be able to fit the whole thing um into the.

D

Context window.

B

And um so you know the simplest strategy. Is you just give it uh the most recent um 1500 tokens? um You can get more sophisticated about this, because sometimes there might be extra context that you want to include. So you know you could do something like say, um assuming that you know a little bit about what the language they're using is.

B

You can like go up to the top of the file and look for import statements and then have the prompt, be here's my import statements and then here's as much of the code I can as I can fit so that it has things like Library definitions or structs, or things like that, um and then you can even extend this uh by saying. Oh okay, you know I'm writing C and they include food.h I'll, go to foo.h and try to pull in some.

B

uh You know, structure, function, definitions from that file and cram them into this prompt.

B

um But again you know you're limited on Space, um and so uh all these things you you do have to kind of try and balance um how much context you want to be able to provide with how much space you have.

D

So I just got another question: did maybe did you try? Maybe you have any years of using dependencies along with the piece of service Cloud that as an input.

B

Right I mean so um I think this is something that for now, I have not done very much on kind of the client side, um so this would all be sort of done within. Like a you know, a visual studio code, Plugin or an IDE, plugin um and I have not done very much with that. I know that um the GitHub copilot plugin does do a lot of this, because I see it pulling in pieces of uh other libraries and other files that I have open.

B

um So that is something that they are definitely doing.

B

um You know there's a it's a decent bit of kind of implementation work, particularly if you want to support a bunch of languages, because you have to know that, like okay in PHP, it's required and in Python it's import and in see, it's include and then have to. You know know where to find these dependencies.

D

I guess I have another question, also Mom yeah.

A

D

Me ask you, um so how was the model tested I mean? Maybe maybe several use cases were collected just to understand in which cases the model performs Better or Worse? Let's say I, don't know you need case uh sorry unit test generation or web service or something else.

B

um Right so right now, uh tests are a uh wishlist item, but um as far as kind of evaluating like the quality of the model output um they're, the paper that Salesforce released, um which is called a conversational model of um program, synthesis our conversational for programming synthesis, um but they did actually kind of a full evaluation of uh all of these and uh on a bunch of benchmarks, um and so that's kind of what I've been relying on for my general sense of which model should I use and sort of.

B

Unsurprisingly, it is that you should use the biggest model that you have uh or that you can feasibly run, and uh if there is a version of it that is trained on your specific language, then you should use that one in preference to one that tries to support many languages at once.

B

um uh You know, aside from that, I guess just sort of using it interactively I've I found that it's it's pretty decent, um it's very it's much better with python than it is with, like writing. C code, for example, um and so I think this makes sense. Just given how much python code there is out there, um but yeah I think it would be really nice to have uh like some. You know automated harnesses for saying: okay I mean even just you know: does the model still work to generate code at all?

B

um Can we run it on? You know a small Benchmark um and make sure we haven't kind of regressed on the quality of uh recommendations.

A

D

I think sometimes it's possible that the source code is not compiled right.

B

Right, yeah absolutely uh possible for it to produce uh suggestions that don't compile because, for example, maybe it refers to a variable that doesn't actually exist in your program.

B

um You know really speaking, it's not going to make the kinds of mistakes like forgetting a semicolon, or you know doing something. Obviously syntactically wrong. uh The mistakes really are more of well. You know it doesn't have any context. So it doesn't know what um you know. The field name of this data structure is so it'll just make one up, um and sometimes he guesses right and sometimes it guesses wrong.

B

um But that's you know, with this kind of uh sort of prompt engineering uh comes in where you try to figure out. Okay, what do I need to uh show it so that it can give me reasonable suggestions.

A

Cool um I think on that also I'm, also conscious of time, so we're gonna go through um the questions of people who are also not in call Dinesh um similar to the testing. How would you compare the usefulness of full Pilot's suggestion with respect to co-pilot ones,.

B

Yep um so I think certainly uh Coppell is currently uh better at uh producing mode. um Some of that comes from the fact that I think they are using a larger model. It might be using a larger model but they're, certainly using what's trained on more data, um so I think that they have trained it on basically all the code they could get their hands on um and, as a result, it is pretty good um yeah generating suggestions there.

B

The other piece that it does better at is at this kind of fill in the middle task, because it does support being able to consider things both above and below um the current cursor.

B

um That said, you know I think uh it has been a it seemed. It has been sort of perfectly fine for kind of writing. The kind of code that I usually write, which is like okay I'm in Python I, want to like read in a bunch of data from somewhere. Do some like analysis on it, uh create some visualizations and graphs.

B

um That's the kind of stuff I I do most often, and it works very well for that um and so I guess you know I I. Unfortunately, don't do lots of stuff in like writing. Web apps and JavaScript, or things like that, so I have less kind of direct experience.

A

Yeah, thank you for that. um Dinesh is not in college. For other questions is full pilot able to adapt its suggestions to the current project. What does it accept as input in order to achieve this I think we sort of went through this already.

B

Yeah so I mean there's sort of two ways you could think about this one. Is um you know what do you put into the prompt um to generate the output?

B

um You know, and maybe this is covered in a different question, but I think uh the other strategy is, you could take a model and try to um do you know, what's called fine tuning where you train it on additional data, um for you know a much shorter amount of time and um with a lower, uh what's called learning rate, which just controls uh how dramatic the changes um to the model are at each training step um and so fine-tuning.

B

These models is also possible, um which is something that is not possible with copilot, uh because you know you need to have the actual weight of the model available to do. Fine tuning um and the weights for codex and co-file are not available. um That said, fine tuning is not totally trivial.

B

um It is not that hard, particularly because um hugging face actually has made things fairly easy.

B

um You really just make a data set of your code in Json form, or you know it's just a Json dictionary um where each where the text Keys is the contents of your source code file and then each line is a new Json dictionary, um and then you can pass that to a standard script they have um and it will fine-tune the model on the code you gave it.

B

um The reason that it's not completely uh trivial is that this requires a lot of resources, particularly for the largest models, um the largest ones, that we have personally been able to fine-tune um uh here at NYU.

B

um Are these in the 16 billion parameter model um on a data set of verilog code, which is used for like a CPU design and Hardware design, and that was a 400 Meg code data set and it took uh three uh a100 gpus each with 80 gigabytes of vram about six days to uh to fine-tuning on that data set um so yeah the the computational requirements are definitely not like I can take. You know my even like my souped up gaming desktop and fine-tune a big model on it.

B

um You know fine-tuning at this point um is still something that you uh need: kind of sort of fairly big servers for.

A

Yeah um I think that actually even uh answers the next question on how peaceful it is to point to, and so thank you for that um and then now uh Taylor.

E

So I'm a bit curious on your perspective on just the market in general, so we've got GitHub co-pilot um you've got tab. Nine you've got kite, we've done a pretty deep investigation of tab. Nine. We like a lot of what they're doing, allowing you to train on just your code.

E

um Kite kind of seems to be falling out too so I'm just curious. Your thoughts on the the industry in general.

B

Yeah I mean so I think that um it is only going to get bigger um the other one uh that is available. Now. Is this um Amazon uh code Whisperer, which is yet another which I have not even uh had a chance to play with it all yet um so I've just you know like read a blog post about it, um but uh and we do actually note uh the tab 9 folks uh pretty well, we've been talking a lot with them about.

B

um You know collaborating on some things, like uh you know, user studies with their users and figuring out. um You know if there are ways that we can say if we've trained like the models that um tries to produce more secure code, can we try having it deployed with them to see? If um you know it helps, helps their users as well.

B

um That said, I have not personally actually used tab, 9, um so I don't know as much about how it Compares in terms of code quality um I do think that you know their approach has been to use much smaller models um and then try to kind of make up uh for that by training on your own code and using language specific models, so I think they have a smaller model, but it's been trained.

B

You know specifically on Java and another one, that's been trained on Python and another one has been trained on c um and that's sort of how they uh get around that. um But you know anecdotally.

B

I should have heard that the quality of output it gives you is not up to um what copilot currently does um and I can believe that, because the Codex model is very, very good, so um yeah, you know I definitely am not the person to ask about business questions because um you know I I know nothing about it um at La will say is. It seems like it is going to be very, very popular over the next few years, um because when it works, it works really really well.

E

Awesome thanks for that um on to the next question, um and this is actually kind of funny, as I was looking at the articles that I've been researching, you were actually referenced in a number of them. There's been a lot of research done on co-pilot and detection of you know bad insecure, malicious verbatim code.

E

um How do you well like what do we do about this.

B

Yeah, um so this is a great question, uh so we I'm I was I've kind of gone back and forth on this. So initially we did this big evaluation of copilot on um a whole bunch of different classes of vulnerabilities, where we made these like little toy scenarios um like I'm about to insert some data into a database copilot.

B

How would you do that, and it would say, oh I will use string concatenation uh to put your SQL into this database and that was obviously a very bad idea, because that was an SQL injection vulnerability um and when we measured the rate at which it does that um in that study it was something like 40 of the time um you know uh which was quite bad, um and so that was quite alarming, uh but you know, then we sort of thought a little bit more and said: well, okay, but um a we don't know how often our human users would make that kind of mistake um and be, uh if it's being used by a human, then, presumably they also have a chance to look at the output and say: oh that doesn't make any sense or you know- oh that's, obviously vulnerable and fix it.

B

um So we actually did do a user study and uh this paper we actually just submitted two days ago um so I'm happy to share a copy of that. But we did I use it. We did this user study, uh comparing uh the functionality and security of code um produced either by hand or with uh the assistance of codex and um somewhat surprisingly to us.

B

um It did not seem to hurt security, so the two groups of users- and also this could be the users we chose right.

B

These were undergraduate and Master's students I'm in computer science, and they were writing in C, where it's very hard to write good code in the first place, um but the rate of vulnerability between the two groups was basically the same as far as we could determine, um and so uh that's maybe a little bit encouraging encouraging, on the one hand, but also discouraging, on the other hand, um encouraging because it means it's not making it much worse.

B

But um our kind of working hypothesis at the moment is that, because the models were trained um not to write good code but to um very accurately predict what would come next in a source code file. If the code that you write is not very secure or not very high quality, it will very Faithfully write um insecure and low quality code for you, um and so it did tend to kind of match the um quality of code written by uh the user.

B

That was using it, and obviously we would like it to do better than that right. We would like, even if you have written some kind of bad code, for it to produce better code.

B

um So that's kind of uh my main ongoing research area um and we have a few kind of fun ideas for how to do this. um One is that we can try to actually kind of patch these models um and there's a bunch of different ways. You can think of.

C

B

So you could uh collect a new training data set and annotate it based on some Judgment of its code quality and then basically train it with a quality tag followed by the training data, and that would let the model have some idea of like okay.

B

This is how I produce like high quality code and then, when you uh are trying to generate uh user like actual suggestions for a user, you would always include the high quality tag, um so the model is strongly influenced to produce higher quality code, even if, what's already there is not as good um so. uh Something like that could work uh We've also been looking at ways of doing this without having to do any additional training.

B

um So there are some very, very, very recent approaches um that will try to sort of directly edit uh things that the model knows um by actually just like updating the floating Point weights, um and this works uh in natural language models. To do things like changing facts.

B

So um the example they had in this paper called Rome was you say uh you know the Eiffel Tower is located in and then the models says Paris, and you want to update that so that it says it's in Rome um and they actually figured out a way to do that um and it then it actually affects other things.

B

The model knows you can say like what landmarks near the Eiffel Tower, it will say like the Coliseum um and things like that and so I'm currently working on trying to uh make that work in the context of code, where um I'm hoping to be able to say things like the function you use for hashing passwords is and then the model currently wants to say sha-256 and that's not really correct.

B

I want it to say be quick, which is what you use for: hashing, passwords and um sort of other data, and so but that again like that is definitely research. That is not a product that can be deployed today.

B

um As far as kind of practical stuff, you could deploy things like uh you know, even just encouraging people to run existing sort of code scanning tools um on on their output.

B

um Unfortunately, I mean the recommendations are mostly the same recommendations you would give for people who are writing code by hand like you should test this. You should use security scanning tools. You should maybe try fuzzing your software um and I guess pessimistically I expect that advice to be taken up about as well as it is for human written code, foreign.

E

A

A

um But I think we need to cover that just from Stefan uh any other use cases that co-pilot I think we covered the ideal situation where for pilot Works, anything where it's absolutely what you know for sure will not work.

B

Yeah I mean so I think um it definitely works very poorly for languages that are not well represented um on GitHub. um So it's awful at writing. uh We found it's awful at writing.

B

Verilog, which um you know is why we ended up uh fine-tuning um our model on verlog um and I would expect this to extend to any other kind of languages that are, um you know more Niche or popular, um including probably things like um Haskell or um you know, o camel or um what's the you know, other things in the lisp family, which have just never become all that popular um and even kind of newer languages like rust, where it is very becoming very popular, but there's still just not nearly as much rust code out there, um because there is uh python code, um so I would definitely expect it to pretty badly um on languages like those um I'm trying to think of other cases where it will definitely do very poorly.

B

um You know, I think without additional kind of uh prompt engineering. It's going to do worse and worse the further uh you get down in your code, because um you know, as these long range dependencies get further and further away and are gone from the context.

B

um It's just not going to be able to know anything about them right because it can only remember the last. um You know up to 2048 tokens um and so going beyond that. It's going to start to get um yeah. It's just not going to know anything about anything outside of that.

A

A

um um um If you can just even quickly probably I know, we've talked about this, but we, if you can share with the audience. What do you think? Where are we heading with competition, large-scale models, these three quantization? Where are we going in it.

B

Yeah I mean so I'm actually very um excited about where this has been going so far, because it seems like people keep finding ways to making not to make these models, use less and less memory um and do inference faster and faster, without seemingly harming the quality of the output at all.

B

um So it's it was sort of shocking to me that you could take a language model and compress it down to just using four bits per parameter without really hurting its performance um and I. Think that that trend is probably I mean it has to stop somewhere, because you know I, you can't represent. You can't use zero bits per parameter, um but I think that we, it will still be able to go down even a little bit further.

B

um So I think that that's going to uh be really nice, so quantization is going to be a big area and it's going to be right. Now, it's sort of only a few models have been quantized, but um I think many more could be um and then, as far as just pain, speeding up inference. This is something we actually have uh a couple students looking at right now and.

C

B

Is more just stop.

B

um Looking at positions, can we do on the existing code to make it run faster um on uh you know existing GPU architectures, um you know that's going to be a little bit tough uh just because, for example, like faster Transformer is written by someone at Nvidia, who presumably knows the Nvidia uh Graphics architecture really really well, um but we do think there are some opportunities for uh taking things up um even there um and we're also kind of looking at.

B

um How do you speed up inference in the case where um you know, maybe you've got a data center with some fixed capacity. You've got lots of requests coming in and you know at some point. Maybe you start getting overloaded and so like. Is there like a small reduction in quality? You would be willing to sacrifice in favor of keeping um inference time very low um and there's some kind of more.

B

You know Advanced strategies you can start to deploy at that point like um much of the time when you're doing um inference, you don't even have to go all the way through the model before it's pretty clear what the next prediction is going to be, so you can sort of bail out halfway through um that inference, step and say yep.

B

If the next token is like with 90 probability, gonna be this one, so just return that one instead um and that that seems to work pretty well so I I, think there's you know, I I think things are only going to get faster and cheaper um and pretty quickly.

A

Yeah agree, um thank you for that. I know. uh I think it's from both me Taylor and everyone. um Thank you so much for the time, but we also want to know if you have questions first. Oh.

B

um Yeah I'm trying to think so I mean the main thing is really just um you know. Are there things that uh you've found kind of about uh the way things are currently implemented that you know you say? Oh gosh, the you know it would be much better if it were done um this way. You know this is clearly not something we could deploy, because you know it's something. That's fundamentally broken here or something like that.

B

I know that uh Fred has uh been very kind of providing pull requests to um help improve it uh in many ways. um So far and I've been trying to make sure I give some attention to um his code suggestions but yeah you know: are there things that um yeah? Basically, how can uh I and I will caveat this by saying that I don't have tons of time, but are there ways that I can help make Profile better for uh the use cases that you have in mind.

C

And I think the um you are right on the money with prompt engineering, I think that's going to be like the major differentiator, because I've been playing around with it, and it really greatly depends on what you give at the prompt or what kind of return you're going to get. That's actually meaningful, so I've been currently working on authentication so because we don't want anyone to be using the computer. We're gonna host, but I've also been working on the vs code. Extension for the gitlab official workflow I. Think that will be pretty awesome.

C

If we can like also kind of promote that within the project that we can get contributions there, because I think there are a lot of people with really clever ideas about prompt engineering, because right now it's just the most simplistic thing there is I. Think it just takes the the last I. Don't know how many tokens it takes, but it just gives it and yeah it's um not optimal.

C

Yep yeah, I think I. Think those are like the major points right now at least what I'm, what I'm seeing.

B

Yeah no I think that's a great idea and I think um you know.

B

I've I've definitely wanted to be able to point users to um an extension that works well, particularly with faux pilot, because right now there's some kind of you know you can hack up the GitHub copilot plugin in various ways to make it talk to Pho pilot, but a lot of things might be a little bit broken and it'd be really really nice to be able to say, hey, look: here's a actual Visual Studio code extension that you can install that will work with copilot, specifically um and I. Then I'd definitely be very happy to start.

B

You know uh asking for and hopefully contributing some ideas for how to make the kind of prompt engineering side better.

C

Yeah, it will be awesome. We released the first version of a vs code extension that supports mobile that this week, I think Tuesday. It was oh.

B

Wonderful, okay, I would absolutely love to uh tell you know: uh 17 000 followers on Twitter about that, and you know promote it on the project. Page. That's really cool.

C

That will be awesome, yeah, yeah and I. Think we all also discussed is like cicd. That's probably something that's going to be uh quite crucial, so right now I'm hosting it for on gitlab that does have cicd but yeah. Maybe we could have a follow-up conversation on that on how to make that more publicly available.

B

Yeah yeah that I would be very happy uh to do that and I may actually even be able to um run some kind of CI CD server, um just because I think, like a GitHub action still doesn't uh support things like actually having like a GPU available.

B

um So we might be able to do um some amount of CI CD, as it's hosted on some machine that has a GPU so that we can actually, you know, do tests involve generating suggestions in a reasonable amount of time.

A

Yeah, we can definitely do that with gitlab, with our GPU enabled Runners as well. So um so, yes, sir, that's something we do support on it. So yeah uh uh I.

D

Can I'll explain I.

E

Can definitely send.

A

You the link, you know for it as well. Yes, sorry go ahead.

D

A

Said we already used YouTube Runners inside our team, yeah yeah, so yeah, so we could definitely uh help you with that as well, yeah for sure, um but other than that anything else uh Brendan. We can support and have like really thank you so much for this.

B

Sure, yeah and I guess we're a little bit over time. um So I don't want to keep everyone, but thanks very much. uh You know it's great to talk with folks and I'm very excited uh to hopefully be able to make full pilot a lot better and um you know have a lot more people actually being able to use it to public, build.

A

um Thank you. Thank you before we actually.

A

Feature wish list for full pilot. um That was uh basically the everything we can use to. Anyone wants to contribute into that a whole lot of lists that Brendan actually put together for us. So um thank you for that um and then yeah well, thank you everyone for the time and thank you Brandon all.

B

Right thanks very much and have a great.

A

Day we'll be in touch okay,.

C

Thank you, bye-bye.