GitLab Growth, 17 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: How to launch product experiments at GitLab

Description

Growth team talks about how to do product experimentation

Training deck
https://docs.google.com/presentation/d/1nmStWChWkYad9K-dced9wS4jS7XLIrHB-WKafc7jrMU/edit#slide=id.gca4c496ea4_0_0

A

Awesome so uh welcome everyone. Thank you so much for joining this experimentation workshop. um I have uh many team members from the growth team contributed to this training slides. We have sam emily phil, caroline myself, and we will walk you through how we launch experiments here at gillab. Sam. Do you want to move to the next one so um so for for any people who maybe uh haven't doing much experiments before you may ask why experimentation? Why does that matter? um I want to share an example.

A

We read in uh in a book actually the growth team rate together in a book club, it's called trustworthy, online, controlled experiment. um This is way back like an engineer in microsoft bean search team. He had an idea about how to tweak the search result and you can see the top are the kind of previous search results format and he had an idea to change. It move some of the kind of detailed copy to the title, but he never thought this will change anything.

A

He felt like it's just a small change and the idea was sitting in backlog for six months. One day, he'd, probably just get bored from regular work. The pm assigned him and decided to spend a few hour code. This experiment um and I think, within a couple hours on the entire company, is getting alert basically saying revenue too high.

A

I I sound that sounds an awesome alert, but, like entire company didn't know what happened, um he went to investigate this experiment and realized that this small change actually brought 12 percent uh increase in total beans, revenue, which translates into a hundred million a year only for us alone and think about they can implement the same change for other regions.

A

So I think why experimentation it really comes down to. um Sometimes you expect something to work. It may not work. Sometimes you expect things that will not work. It will actually work like miracle. In this case, a lot of times. Human beings are very confident in our judgment, in how things should work, but a lot of times. That's not how our customers, users, will behavior.

A

So, with digital products such like, such as gitlab, we have a unique advantage that we can test the different product changes, flows, uh testing new features and observe the real change in the business, metrics or or the kind of the the certain numbers rather than guess, so it's a unique advantage that all of us should um leverage um sam. Do you want to move to the next one so yeah? So this is the high level uh the experimentation process we use here at gitlab.

A

It starts from come up with ideas and ideally generate as many as possible, and then you have this idea backlog. You begin to select the best ideas. We will talk about some framework. We use basically thinking about how impactful this change will bring us how much work it is and pick the best ideas we want to test first and then come to the phase that we we need to turn this idea into actual experiment. It involves. um We need to have a very clear and concise design of this experiment. We will share the template.

A

We use to write the experiment design here at gitlab. We we need to get our ux team to help and develop the different versions we want to test. For example, in the bnc example, you have a version that have the the copy in the bottom. You have a version that have a copy in the top. That's a very small change, but there are also experiments require some pretty different ux designs, and then we need our engineer, awesome engineer, team to help us implement those different versions. There is the control which basically is the original version.

A

We may have a version one or version two that has different experience or design for the users and after the engineer, implementation, we move on to a test testing phase. We do staging tests and we also roll out gradually, rather than kind of launch the experiment to all users at once, so going through that launch phase. Now we need to collect results and analyze to see whether this experiment worked or not. So we will add this experiment into our knowledge base, which is basically a place. We document our past and currently live experiments.

A

We will also collaborate with the data team. We analyze the results. We want to know whether it worked or not. We need to clean up the code, because usually experimentation requires some extra code to wrap the experiment and we want to remove them afterwards to keep the code base clean and set up the foundation for future experiments.

A

So now I will hand down to my team and we have our pm our designer our engineer manager and our analytics manager talk about how they are involved in this process, and what do they do in each of the step um coming to you, sam.

B

Sure, thanks hila, um so to kind of start off that process. That gila just highlighted it's important to collect as many ideas as possible and kind of generate a little bit of a backlog of experiments that you might want to run, and you can really go about that in a lot of ways: reviewing qualitative and quantitative feedback and or data that you have is really valuable in understanding that. But it's also important to you know chat with your team work through do workshops.

B

There's an example mural there that the growth team did and then also.

C

B

At your, you know specifically look at your data and see what um the data is saying, what customers are saying and feedback. Do you see a large drop-off in a funnel or people getting hung up on a particular area and we're not totally sure about what the solution is to that experience, that's an opportune moment to start to think about does. Is this a place where we could potentially run an experiment to learn and potentially improve the user experience and then.

C

Lastly, just a highlight is to really.

B

To don't think of an experiment as one and done you should also utilize.

C

B

um To use a previous experiment, I should say to highlight potential future experiments that you can run based off of learnings, that you gather.

B

C

Comes to actually how to prioritize.

B

An experiment and when you should run one I'm not going to read through all of these bullets, there's a lot of content on this page. I think there are a few things that are really important to highlight here. One.

D

Is if it's, if it's a.

B

Poor experience for the user today and we believe we have a fix that just improves it. We should ship that fix and improve the customer experience. We don't need uh an experiment to tap ourselves on the back to say that we approved it by x. If, if we know the current experience is, you know close to broken or close to broken and then the second one is ensure that you have the data you need to understand.

B

Do enough people interact with this area that you're potentially exploring a test to be able to reach significance, um and then, lastly, do we collect data on that particular area of the product and, if not, you might want to start with understanding, adding event, tracking or back-end tracking to that particular area to understand the volume first.

B

um The next step kind of in that process is to write up an actual experiment idea. um We have a template in the gitlab project, which is linked at the bottom here in tip one, and it's also in our the growth process in our handbook on the growth. Excuse me the process page, um and it really starts with defining your hypothesis um and it's important to note here that your hypothesis isn't defined strictly to your experiment.

B

It can be a broad statement about the area that you want to explore, testing, because one experiment could be invalidated or if it basically didn't win out over your your treatment or your experiment. But that doesn't mean that your hypothesis is validated. You may have a follow-up test to run, to try and prove the same theory.

B

um So the the template here is more exhaustive than these kind of five points, but it's a great template to use to ensure that you kind of hit all the points in terms of the write-up that you'll need in order to launch a successful experiment.

E

Awesome and now from the ux design perspective as us as a designer. The ux process in experimentation isn't too different from other design work at get lab, but the main thing is we want to ensure the proposed solution is small, so it's like the smallest thing we can do to get relative change and that the success is defined and measurable.

E

So some of the big dues we do as designers in this area are exploring multiple options early on and sharing those out with the team using existing patterns, unless the experiment um calls for like a different or new pattern that we want to experiment with, we want to get that cross-functional feedback, as you normally would just seeing. Is it reasonable to do this? Is this going to make the biggest change and we've created a figma template for experiments that designers can use and if you're getting different data than you expect?

E

Consider a follow-up usability test to understand why you're getting that data or why? What you're getting back is different than what you thought you would with the experiment, and then some small don'ts here is: don't spend time, refining details that won't impact the experiment results. This can be done during cleanup so like making a very tiny change to an icon that is not relative to the experiment or something that just is kind of on the side and won't impact results and don't make ui changes outside the scope of the experiment.

E

So on this page we also have example design issues to look through as well, as example, sigma files of how designers work in this space and now I'll pass it off to phil. I think.

C

Thanks emily, uh so the engineering implementation phase, so now that we've got a well-defined experiment and design. We need to implement this and get there so growth engineers at gitlab use blex, which is the gitlab experiment gen, and this is a project that lives under the gitlab org group.

C

I call this a second generation experiment framework. uh Initially, our growth engineers uh implemented experiments within the gitlab code base, using a module and using standard gitlab development feature flags, we've iterated on that process and have now set it on clicks.

C

This uses a custom experiment, feature flag type that still supports the approach used with development feature flags, but our experiment feature flags tend to be a little bit longer. They last a little bit longer in the code base than the development feature flap, but this will be familiar to anyone. Who's used to developing features behind a feature flag and gitlab, including supporting tools such as chat, ops for enabling and disabling the flag itself.

C

So glex supports a b testing or multivariate experience for a basic, a b test. We would need to define in the code base both the control and candidate experience for a ui test. This could be defined in a controller action, and this built builds on the approach used in the open source scientist gym now. The glex project read me covers different types of implementation. There are really good examples in there.

C

That would be my recommendation for where to start, if you were looking to implement an experiment, but the get lab code base also includes many examples. So currently there's around 20 experiments defined in the get lab code base, and that's also another really useful resource for anyone looking to implement an experiment and get there a good place to start. There would be searching for the use of the experiment method in both controllers and helpers.

C

In order to conclude an experiment, we first need to track data for our product managers and product analysts to use to make that call and.

B

C

Both front end and backing tracking are supported, including snowplow, for front end and tracking data to the database tracking. The right data to be able to make informed decisions is not trivial. Respecting data privacy is another important consideration. Please reach out to a pm um analyst or engineer in growth. If you want to discuss further thanks.

B

Thanks phil, um so after the experiment is designed by engineering, the next step is to think about bringing it to staging and out to production.

B

um When you bring it to staging, you can roll it out to 100 um and there's a nice trick of installing a snowplow inspector into chrome, which we've linked to here, which allows you to see the if the front end snow cloud events are firing.

B

um This is a nice way before it reaches production to understand if you're you're collecting the front end events that you anticipated and you decided on with your team once you're comfortable with the staging experience you can roll it out to production when you're doing an a b test generally in growth, we always start with 20 or less and roll it out for at least a day, if not a few days.

B

um The reason for that is, our data is loaded in nightly, um so you have to wait at least 24 hours to see if any your actual events have are coming into the database and then the second reason is we want to ensure we provide users with a good experience, so we want to try and listen in and ensure that, depending on how big the experiment is, if we're hearing anything from support or anybody else internally about this experience before we roll it out to the full cohort size so for an a b experiment that would be 50 um and then from an experiment tracking perspective, you want to ensure that you create an experiment, tracking issue related to your experiment.

B

The importance here is that you want to document. When was the experiment rolled out on production at what percent, and when did it? Let's say if it was at 20, when did it go up to 50 and then eventually, when was it turned off or on and what was the result of the experiment? At the end of the day, this kind of tracking issue is going to be, or at least in growth. We treat it as kind of the end result of the experiment. The conclusion will be defined here in this tracking issue.

B

So it's easy for people to understand. When was the experiment rolled out, when was it turned off or to a hundred percent, and what were the results that we found through this test?.

B

uh And the other important thing to do in this process of kind of thinking of as you're writing up. You know when you're planning, your experiment when it's is launched and when it's concluded is to ensure you add it to the growth knowledge base, which is part of the growth direction page in the handbook.

B

um We've recently added some updates to this page so that you can there's a section defined for planned and upcoming experiments, active experiments and concluded experiments, and our goal here is to provide a space for any team to update when they're planning to run a test. They have an active test or a conclusion with the result.

B

So it's easy for everybody to contribute and understand the tests that are planned or have are being done and have been concluded.

D

So once the experiment is out in production, we can start thinking about the analysis piece of it. We wanted to share some resources with you so that you can either self-serve or engage the product analysis team. So the first item here is a data validation dashboard.

D

If you just pop in the name of your experiment, uh you can see um which events are coming through uh and how they're coming through in production.

D

uh The second item is a sql snippet available in sisense that will pull in both front end and back end events for your experiment and if you find yourself needing extra help or you need a more robust analysis, please engage the product analysis team uh by opening up a issue in our project. um The template is linked there. One thing that I do want to emphasize uh is that we ask you open the issue at least one milestone before the experiment is going to launch so that we can plan accordingly.

D

The last two items here the experiment framework and sop that we've recently developed on the growth team can just give you an idea when you're designing and measuring your experiment um as well. As you know, essentially the steps of when to engage which teams and finally, some general experimentation best practices.

D

We as the product analysis team, really want to make sure that you are set up for success, to learn as much as possible from every experiment in a timely fashion and to make sure that we are responsibly interpreting the results.

C

So on to experiment cleanup, um so I'm trying our product managers have made a call on the outcome of an experiment and it's important that we clean up the code appropriately, so our experiments run behind experiment, feature flags which is along with the code they're, both technical debt. So we want to do that as soon as we can.

C

We use an experiment, cleanup issue, template that includes a checklist of decisions when planning how to roll a successful experiment into the product as a feature.

C

Some of those considerations are whether this should run on sas or self-managed or both, whether it's ee or ce, only we usually need to make changes to tracking, add documentation and, of course, how to change log entry by default. Our growth experiments only run on sas, and we do this so that we can move fast. We don't want to be running experiments on self-managed, it's harder to update those experiments and conclude them, so we run on sas by default. So that's probably one of the first considerations.

C

If you're rolling this in as a feature, some of the tracking calls are custom to experiments and that either need to be refactored or removed, and so, if you're rolling out as a feature on self, manage, consider converting the experiment feature flag to a development feature flag. If that's your standard practice, uh if an experiment has not been successful, the code and experiment feature flag can be removed from the code base and any learnings can be applied through any follow-up issues.

B

B

Thanks um kind of moving on here to tips for create uh for creative and effective experiments. There isn't always the data to support and potentially running an experiment. You don't have the you don't feel like you have the strongest footing to understand. Is it worth running that test?

B

So one process you can take here is if it's a a larger initiative that could take a lot of work to build. The experiment is a payment door test in a paid and door test. You design a piece of the ui element to see if users will actually interact with it and you don't build out the full experience. The idea behind this test is to run it in a limited amount of time into a limited number of users just to understand enough.

B

If users are interested in this particular thing, um we've linked to an example painted door tests that the growth team ran that helped us understand if um non-ad or non-owners or maintainers were interested in finding out. Who could invite users to that particular name space.

B

Another example here is an infrastructure improvement generally speaking, infrastructure improvements are, are large undertakings and and take a lot of resources.

B

um So in that book that hela mentioned one of the examples from my team at microsoft was they would slow down a particular area the product by a few milliseconds and understand what did that actually do to the different state uh adoption between screen one and screen two of the feature that drop helped them understand what the perceived benefit would be if they improved the infrastructure speed by that by that same amount, um and then the other thing to really highlight here is to check your experiment tracking early and as often as possible.

B

The reason for this is twofold: it's one is for your team. You know you want to know that you're getting the data, you need to be able to understand the results of the experiment and then two we want to always want to protect the user experience. So if we start to see trends in the data that the experiment is winning out, that's great.

B

We can start to look at significance and understand when it's validated and it should be rolled out to 100 or if we run into an experience where it's in that it's quickly becoming invalidated. And it's not as good as the control. um Then it's important that we understand that and we're ready to return the control to 100 for for all of our users um and then, lastly, is just to.

C

B

Out that when you have a null hypothesis or kind.

C

Of statistically insignificant.

B

Between the two results in your tests, if it was an a b test, for example, um it's important not to see that, as uh you know, in not useful for your team, you should utilize those results and try and understand what, if any, uh follow-up experiments, you can run to try and validate particular areas of the experience.

B

I know we covered a lot um through two slides here. This one is full of resources uh and documentation of where you can. um You can find how growth works, um how you can how each team specific team works, um there's a we have a lot of content, um so feel free to check out our handbook pages and then lastly, um you're always more than welcome to reach out to us.

B

um You can always start by cr if creating an issue and at mentioning us in the issue, and then I know I can speak on behalf of the pms. That would be more than happy to provide feedback or answer any questions to other pms or anybody else. That's interested, I'm sure the engineers are as well, and you can also find us on slack at us under support growth.

B

So thanks everybody and we'll move on to q, a.

A

Thank you so much everyone for kind of a great walkthrough of our current process. um We have a agenda doc. You can feel free to add questions there or if anyone on the call now have any questions feel free to raise them. We will wait for two minutes for questions to come in.

F

I can ask a question I'll type it afterwards.

F

um So one of the things that I noticed that I think it was emily mentioned when you're designing an experiment, you don't want to make any other code changes to that area during the experiment, but since growth launches experiments on top of code that other people are also building on, how do we safeguard that or is it okay to play in the same space? Sometimes?

F

Does that make sense.

A

Yeah, I think, that's a that's a great question, um emily. I want to hear your perspective and sam. I my my perspective is that, first of all um in areas we own or like we have more control, we definitely try to avoid that. Like, for example, 2 pm shouldn't be launching uh two experiments in the exact same area, changing some different things that will mess up with our results.

A

Two. We should ideally try to launch a b test. So in that case, even even though there are external force, that's kind of outside growth, that's doing some changes. Supposedly that change should impact version a and version b, the same way and they're. The only difference we make between version, 8 and version b is the change we want to make. So we can still somehow tell the difference between the changes.

A

um I think the last point is important to keep uh open communication with other teams. Like your team, we we appreciate whenever you share the updates. That's super helpful. Whenever we branch into another team's focus area like we're working on verified adoption, secure adoption, we make sure we have close communication with that team share. Our roadmap and plan so that if there are any potential conflicts, we can identify them early on, but yeah. I think there's no perfect solution. So we we try all those ways try to minimize the impact on this.

C

Because we're using uh feature flags, we have a custom experiment, feature flag type, but for our engineers this is uh they're very familiar with us. When they're looking in the code base, they can see that something's behind a feature flag. So whether it's an experiment or not they're used to that use case and should know, should know how to how to introduce new changes. It does happen.

C

We're certainly not doing anything to stop other teams from changing their code while we're running an experiment, it's just very similar to a development feature flag. Your engineer would be looking at with where they're implementing that change- and I think as healer alluded to, um we would hope that that change is implemented on the control rather than on her variant that we tested our engineers. If we, if you use get blam you'll, see which engineers worked on the experiment, they'd be more than happy to answer any questions.

C

If that came up.

D

um One thing I want to piggyback on, um as hilo mentioned, uh open communication being really important. um I think this is something that we, as we start to kind of increase velocity, we're going to have to make sure that we're not tripping over each other, but one item that was on a previous slide that sam was covering about why you should, or shouldn't run an experiment.

D

One of the reasons why you shouldn't is because it's going to collide with a higher priority experiment, so there might be part in the prioritization or scheduling piece where we look at different opportunities to move different metrics and we just have to make a judgment call on either which goes first or which one is a test and which is just a feature: change or release.

A

Thank you, amanda. Any other questions.

A

Going once going twice going third time, if not um thank you aaron for coming, I will share the video afterwards again. If you have any questions, we can help feel free to hit us on the growth channel, bye.