GitLab PM/UX Progressive Delivery, 13 Oct 2020

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: A/B testing Think Big - Initial meeting

Description

https://gitlab.com/groups/gitlab-org/-/epics/2966

We discuss our initial overall idea of the new A/B testing product category.

A

When we briefly discussed what was uh already there like ford met up with unleash, uh there was a call with the unleashed. Folks later today we got a performance.

A

um A b test is going to be the top goal of 21.. We didn't really discuss that yet, but I take that as uh something I read somewhere and there is uh unleash plans to move into that direction as well as in um they have variants which is kind of like a limited version of fd testing, and we kind of got into some details around uh how that works for them, which is currently restricted by fixed percentages big unless here and this year is stable.

A

So and we are seeing my unleash as uh competitive or as partners instead of competitors, we give them great exposure and contribute back to the open source project. So um are you still there with it or did just the expansion.

B

I close my video to save on bandwidth.

A

Yeah, so let's, let's uh jump into the next point, which is kind of like where we want to take this meeting um ultimately like this was intended as kind of like a brainstorm meeting. uh You uh requested it in my pto management issue and what is your exact like? What are your goals? What is your expectation towards the deliverable coming out of this meeting.

B

So I actually set up like a whole day for you, and I, I think tomorrow, uh to kind of just deep dive into the research outcomes.

B

um Open up some implementation issues uh come up with a think big proposal like similar to what you did with uh flow a in the three-year vision. So I want to come up with a big picture. First, what we want to achieve um a little bit about how it's going to look like before we drill down into the small nitty-gritty details.

B

So if we take uh zoom out for a minute, let's just talk about the goal of zab testing, which is you know, to experiment with different types of uh variants in your code and to get feedback back from these different variants um in order to make a conscious decision of which flow of the code you're actually going to keep over time right and when we're talking about a b testing, we're talking about experiments as a title, and so uh we already did one brainstorming session with the developers and we had some things that came up from there like experiments are bound in time.

B

So so I think, like that's something that you need to define in advance like what's the goal of the experiment, how long you want the experiment to run? Who is going to run the experiment, who's going to run the analysis and who has permissions to change your experiment on production.

A

If you want to ride along with me on the.

B

A

That would be very welcome, but this goes a little bit too fast for me to write through. So what is the goal of the experiment? Who is.

A

I'll, let you finish finishing this sentence.

A

A

And that's interesting right, like the duration of the experiment. um This is like this is not fixed. This is not a fixed period of time. In certain cases, certain circumstances depends mostly on the amount of traffic.

B

Okay, so these are the questions, the big questions that um I think we need to answer before, going to a b testing, and then we have a whole new topic, which is new personas.

B

B

That will be using it, so we have product managers, we have ux designers, we have marketing folks, we have sales.

B

So we also have a bunch of non-tech non-techies.

B

Right so this what.

A

Non-Technical personas, who will be introduced with this in mind,.

B

A

Like it's not would you agree.

B

I didn't understand the question.

A

They're still part of the persona of an a b test.

B

uh Yeah, these are part of the performance for the a b testing which are not our natural enough for the release stage.

A

A

Existing so the idea for tomorrow is that we're gonna kind of uh create like a happy path right, like we're gonna figure out like all right. How is this flow, gonna, gonna work.

B

Yeah so so we have, um I would say we have a definition phase.

B

uh I'm talking about the user flow, okay user flow. um We have the definition phase, then we have ongoing tracking, uh getting metrics and analysis back um digesting.

B

A

On a deflection.

B

Phase um decision.

B

Phase and experiment.

B

And we also discussed uh on the ongoing, we have collaboration collaboration discussion with a team.

B

Editing parameters.

B

Experiments hold on okay.

B

Okay, I'm going back to my office just one minute: I'm taking you with me.

B

Anytime, you can hear my daughter playing the.

B

B

A

Let me know when you're back.

A

B

Okay, I'm back sorry, my my engineer has been giving me problems all week. uh Okay, so where were we.

A

Collaboration on the ongoing tracking phase.

B

B

I'm losing you again, I'm going to try to switch internet. So just give me a minute. Okay,.

B

Are you still there.

A

Let me just keep.

B

A

B

A

That way, we save them.

B

You keep breaking uh breaking up in the middle of the sentence, it's hard for me to understand the question.

A

uh Should we like, I think, it's better if we disable the video like, we don't need it. Okay,.

B

There we go so hopefully it'll be better.

B

B

So we want collaboration, so this is where, in the middle of the experiment, the team has a place to discuss um it has uh if you need to change something in your experiment um like adding a variant or changing percentage. This is where you would do it.

B

Of course, we need to.

A

Adding something to this of the sentence.

B

So again, uh if something when you're monitoring your ongoing experiment, sometimes you need to change the percentages or you need to change your variance or you're you're gonna totally get rid of a specific variance. So this is the place. What I, when I call uh ongoing tracking phase.

B

Then we have getting metrics and analysis, so someone probably someone technical, is going to change something in the code, but like a product manager, someone is going to oversee the metrics and analysis and say: okay, um no one's actually hitting this blue button. Let's get rid of it and then they'll probably open an issue for a developer.

B

The developer is going to be the one that actually does that.

B

Okay, then we have digesting the data, so we need to figure out how we're going to present these analytics. Is it going to be a graph you? Is it going to be api? How are we going to bring back the data, so let's say we're using snowplow like we do um for counting smell?

B

How are we going to bring that data back into gitlab and present it so that it's easily understood, because this is going to be a non-technical persona.

A

Yeah, I think this is one of the the difficult things where, um depending on how you want to approach it like, how are we gonna support like from the research? It is, if I remember correctly, they want to have a single metric. They can kind of go down to to see alright. What is the health of my of my experiment, and this is what we're gonna need to connect to whatever they are using to collect the data right? Are we going to support one single tool or like?

A

Are we going to have an api where they can kind of feed in the data to present it within gitlab like? How do you see that working.

B

So I think we should be agnostic to the analytics tool and provide an api, but I do think that for the first iteration we need to choose one to see that everything is working.

A

Properly so, but in the beginning you said.

B

I think we need to be agnostic, so I think we need to provide an api, and I would test this using snowplow, because that's what we're using currently internally.

A

I'm test using snowplow.

A

Yes, we are using this internally so um when I think back on unit tests, which um have like a standardized xml format, I believe- and if the system then finds such a like an xml format exported as an artifact of the pipeline, it will kind of like see that and say all right. This is like genuine test report. I can present this information regardless.

A

If it's j unit x unit, you know, doesn't really matter like as long as it's like a unit test, it can read that kind of thing, but you know there's something similar with like um the exported data.

B

Yeah, so what I think we need to do is uh kind of get help with the monitor from the monitor team, because they're agnostic to any metric system they use um prometheus heavily, but they can can connect to anyone. So I would assume it's very similar in that sense,.

A

All right uh will you be included into such conversations, I mean like: are they aware of that? We probably will gonna need to make use.

B

Of that, this is just an assumption that I have. I haven't uh spoken to anyone from the monitor team yet about it, but I plan to ask for their help.

B

You can also probably also use the growth team, because the growth team is using snowplow and they are also using feature flags and they are doing some kind of experimentation, so we need to figure out how how they're doing it and then just copy paste. I guess so. Growth team is also a really good use resource.

A

Helpful here, what did I want to say.

A

I know the growth theme is is going like in different ways to kind of get what they want like. I like. I actually look into the notes of my interview with them, but they are kind of leaning on something else else than unreached. As far as I know currently,.

B

They're actually using three they're using unleash they're using launch darkly and they're using flipper.

A

Where is the one that I was thinking about.

A

um Yeah, I think that I think it's it's really helpful to get the growth team in there and then immediately start dog fooding. uh The things that we can dog food.

B

Yeah, but even before dog, putting just like understanding how the metrics come into play and how they would like to view the data. The growth team is like a really good persona for us because they represent our.

A

A

uh What are they trying? So what are they tracking? I actually have a lot of information on that. How do the metrics come into play? I also have a discussion. I've had a discussion with one of the engineers um to have a discussion on like how the growth theme is is using. You know, flipper kind of like how that flow is working from a developer point of view, which is, I think, pretty interesting. It might actually make a lot of sense for you to watch the recording of one of those.

A

I can send it to you later. If you would like to.

B

A

Yes, in terms of the a b testing research, there's still a couple of interviews to be done, which is this is one of them uh at least to be done to be tagged as well. So, um in some of the earlier interviews, my note-taking was less optimal, so I have to redo the note-taking for the other ones. They're all done- and I think especially for if you want to kind of you, know, connect with the growth theme.

A

There's a lot of information there still to be tagged, which is going to be able to answer many of these questions for us immediately. To be honest,.

B

A

All right so a little bit back. I want to go so this is the ongoing tracking phase. So you have a few collaborations that people can, you know, adjust um what is there like? They can add different parameters of the experiment.

A

There is an audit happening, uh they're, getting the metrics analysis to some extent. um So this is interesting like digesting the data. How will it be presented? Do you have your reports.

B

I'm not sure it has to be part of the nbc, but if we're talking about a think big, um some of the interviewers mentioned that they have to export some reports. That prove why an experiment was uh like the results and why they made a decision to to leave something in. So I think it would be really convenient if we could create such um at least graphs to be exported into a spreadsheet or something like that, um so that it can be later used directly from the tool.

A

So if I, if I recall correctly what you said, you want them to export the data into an excel sheet and then they uh well.

B

It doesn't really matter the format it can be excel, it can be pdf, it could be whatever, um but it should contain the data of the analysis, the graph um you know, the name of the experiment, the duration of the experiment, um the results and the final decision of which variant was chosen. At the end of the day,.

A

What do you want to come out of gitlab? You mean.

B

A

So, like an export to a static format that people can present.

B

A

Okay and why why shouldn't it be like dynamic, like in writing: good laugh, they can just pull up the report and see exactly what happened and what the latest and greatest is.

B

um So it depends on your persona. If someone is presenting this to their exec team, the executives are not going to go into gitlab and do this. They need reports but yeah. I it doesn't have to be part of the mvc.

B

It's just something that came up from uh from some of the interviews, and I thought it would be useful.

A

A

Static file format.

A

Explorer, uh I think, digesting the data is kind of like one of those important things like what do you think about? Like you got a lot, you got your main metrics, you could be side metrics um and you got the metrics which are kind of what you're going to base your decision on which are not directly like.

A

So let me give you an example, so say: there's an experiment going on there's a website with two different buttons, like your theoretical example and um we're measuring on clicks, but the click's, not the thing that is gonna persuade us so like one is better than the other right it's like um did. We get a greater conversion because it was part of some kind of e-commerce workflow.

A

So in that sense you would want to say all right flow a indeed. The only difference was this button, but did they actually buy more products right? Did they buy a larger amount of product? Did they spend more money? So then you will have to kind of derive from the clicks and how you know like it's. It's not as easy saying like. Oh and this in this experiment, like 51 people, click the button yeah, but did it actually help them like you know, buy the product, yes or no, so that.

B

Goes back to the first issue, which is defining the experiment was the goal, because I think you need to know what you're measuring in order to make a good decision. I, if, if you're, counting clicks, that should be the with the decision point. But if it's not, then then like, why even bring it forward? If it's not interesting,.

A

um A supportive metric, so I have this in one of the the meeting notes of the of the research like things you want to be measuring at least available, then.

A

Yeah yeah, what what I'm trying to say is that um the main health metric of the experiment is not always exactly the same as the direct metric you are monitoring, so clicks doesn't always mean the health of the project is going well. If there's many clicks but yeah, it is part of the definition phase. I'm just wondering like how much um finesse or granularity are we going to offer to our users in defining like the main metric to be presented, to say all right, it's good or not,.

B

Yeah, I see what you're saying I think for the think big. We need definitely need to take this under consideration, um because I think that you know the number of clicks will tell you how many people actually interact with your new thing, which is interesting. But yes, it doesn't necessarily convert to revenue. So um I guess in the goal setting we would probably need to define the metric the decision metric, which would be revenue, and then you would have a place to define supporting metrics, maybe even in a yaml file.

B

And that way, when you watch the graphs you can probably select which metric you want to view, but you always have like the line for the main goal and all of them to keep you in.

A

Check um what is more like what? What would you, what would you say, is more part of the definition phase. From your point of view,.

B

um So I think I think um you know in order to make a decision or even if we wanted to automate a decision, you need to have a hard metric. That says um you know a clear winner winner. So if we're measuring revenue, you want to see what what converted into the highest revenue.

B

um So you need to find a way to measure and compare the variance and choose the winner and if you're talking about um not a comparison measurement but uh to see I don't know, like number of clicks, went up by 10, that's the end of the experiment or something like that. So there's different different ways to end an experiment. It can be. You reach your goal. It could be duration.

A

um It can be duration uh if it's like a seasonal thing, like a holiday kind of like special or something like that, but.

B

Otherwise, it's.

A

Mostly based on like hey, did we get like significance in our results based on the amount of traffic that we get right.

B

So if you remember the interview we had with booking.com, they mentioned that they have experiments for two weeks. They have a set uh date. Every experiment runs for two weeks and then it's done and they collect the data after two weeks and decide based on that, so it can be also duration, regardless to like a season.

A

It can be without, though I think that booking.com is one of those places where you can say all right. We have. We have enough traffic regardless, and if that is the case, then we don't want to be like. I think that is the case with booking.com. If I remember correctly,.

A

Too much influenced by like, for example, single holiday or a single like day that is different from all the others, so they take two weeks as a standard measurement time, which makes kind of sense, though I wonder how that works in like the christmas holidays.

A

I wonder if that two weeks is indeed a good time format for that, but I don't work. We're gonna come.

B

Maybe you will.

A

Oh, it's who knows amsterdam, it's always easy right.

A

um Let me think so definition phase. Do you also consider um how I always think it's like you know. Part of the discussions happen there as to how we're gonna do this in the product like what the experiment changes are gonna, be like like there's the development part of it as well right.

B

A

Development of the.

A

Experiments it's setting so know what you're measuring um development of the experience, which is setting up tracking and setting up different variants.

A

um And I would actually say you know. Part of this is also the design and discussion leading up to that.

A

Yeah, do you see this being part of like an a b testing issue or kind of like a discussion which is separate from an issue? Do you want that to be in an issue, and then an a b test is kind of like similar is what a future flag is right now.

B

So if we convert feature flags into issues, then an a b test would be an issue.

A

The future flight over.

B

If we convert feature flags into issues, a b testing would follow suit.

A

Also get an issue type.

B

Yeah well, no, it would be a feature flag, but it would be an experiment.

A

So, do you think that a d-test always converts to be a single feature flag or that certain a b test might require multiple feature flags to function and be able to? You know in that sense.

B

uh That's a good question.

B

So in my mind it was always a single feed feature flag and I think we can do something really nice. I know you hate when I go to solutionizing, but uh I think we could do something really nice, where a combination of experiments which are flags could be tied under one epic, um like the mother of experiments.

A

uh So how would you see that working across like a b tests like like I'm thinking of flow here, um so say that you have an ecommerce flow where there is a um you're buying a project, a product? And you have to you know like authenticate with your bank, then, which is a different product and then come back to the original e-commerce website, which kind of leads you further into buying the product and like finishing up making aware that they sent you an email off confirmation.

A

So in this case in this, in this situation, the a b test would go beyond just the initial project, you're developing for and perhaps you're also part of the bank bank. Banks thing. So in that case a single feature flag will not tie to a single experiment like an experiment can span beyond like a single feature. Do you think that is misinterpreting this.

B

B

There that there isn't a difference between a b test and future flags. In that sense, every feature flags relates to a single feature. So I would say a b testing relates to a single feature. Having said that, you can have multiple feature flags that are turned on and off in different strategies in different environments, and you need to manage them all in an instance level or on an environment level.

B

So I think the way I see ib testing is that way where you can have multiple experiments running at any given time and they may have a single result of increased revenue, but each each feature probably contributes to that.

A

Somehow so, how uh let me give a different example just to test to test the waters here say you have a micro surface setup of your application.

A

um So in your group you have like 40 micro services and for this experiment you will need to adjust things in four different projects.

A

Would then the same logic still.

A

B

Yes, um so, regardless of the fact that it can span different projects again, what's interesting is the environment level. So if the projects all deployed the same environment, it's really important to see them all at once, but it's also important to measure one each one of them individually, and I think, especially if you're talking about a microservice architecture, um the ability to um silo one of those experiments is really important, because you can also decide that one of them is is finished and it's achieved this goal, but the others have not.

B

So I I really like the fact that I, like the microsoft service concepts and I like the individual flag concepts, because I think the smaller and the grant more granular you have the more control you have over it.

A

Okay, yeah, I mean it makes sense like.

B

A

Wondering I was perhaps going into the direction of where we have issues and epics and epics live that live at the group level. So to speak and kind of like include multiple issues.

A

It might make sense for a b test to yeah.

A

B

So I think, actually, the way that you said it would be an a b test would be a single issue and an experiment that combines several would be converted into an epic level, something I don't.

A

Even know what it'll be.

B

Called probably not epic, I don't know the name yet, but I think that's a really nice way to to group them together.

A

Okay, that actually makes sense, um and especially as if it spans yeah like a b test, can be so big, on the other hand like if we go group level immediately with a b test, it does mean that you always need to have a group in order to do an aud test, which I would say makes sense. But you know if you want to support a single project with a b test within that single project.

A

You would need a group as well so, um but on the other hand like, I think I think I think this is something that shouldn't be too much of a problem, because most of the companies were targeting with this would be larger than a single project. Right.

B

Yeah, so what's really important- and we didn't mention this in the user flow- is um how to view all the experiments that are currently running in my projects.

B

Slash group, slash instance, slash environment, so starting and filtering again is really important, which brings me back again to the future flags as an issue issue.

A

Yeah sorting and filtering yeah- this is super important, also came out in the.

A

Interviews, okay, so definition of phase know what you're measuring the setting up tracking some of the different variants is anything else you expect for the definition.

B

A

For the definition phase, like you know, like the discussions around what the experiment is, gonna contain like what are you gonna measure uh and then the development of the experiment. Of course,.

B

Yeah, so I think we also should have here. Where are we in the definition know what you're measuring.

B

B

Setting up tracking setting up different.

A

B

B

B

um Then I have like, if we drill.

A

B

Into details, and not look only in the high level, then we also need to take here into account. I'm putting this under the questions section um freeze uh what sorry, what should we do when line and freeze uh or incidents.

B

B

Are you allowed to run new experiments.

B

Edit current ones.

A

Feature flex you're beyond like deployments uh interactions right like it's already.

B

The future flags, we actually um opened an issue to uh disable them when there's an incident going on, because uh you don't know when you're handling an incident. If something is going wrong because someone's playing with a flag or something so we disabled everything we haven't yet there's an issue for it.

A

Is that desirable, as a as a thing like, I would wonder like when we have the feature flag types right, the different types some of them are, you know featured like that, should be around forever. Some are access. Limiting some are, you know short bound, and this would kind of, like just say all right, there's something going on disable all of them. That would mean that half the project will not be running as it was anymore.

B

B

Question don't know if you have a production incident, you just want to have your service up and running experiments kind of take a side. Note part to that.

A

It could be a heavy heavy failure in that case, but yeah there's, probably an upside here for this.

B

A

B

What I'm also wondering is, should experiments.

B

B

The code version control.

B

um So let's say you did an experiment and then you changed it along the way, and now you know it's totally either the users are seeing something totally different. So I'm wondering I don't know if this is a must.

A

I think it is not my my gut says that, because I think you would create a new experiment if you change things alongside the experiment, while it's in progress like if the experiment has started like we shouldn't meddle with the experiment any longer, except for, like you know, percentages and those kind of things right.

A

So I think not, but we can also ask the growth theme right.

A

Put my comment to the other one as well.

B

A

Okay cool, um so we have the definition phase. uh We got the ongoing tracking phase and then we have the decision phase, and this is kind of looping back into that kind of like discussion around the experiment. Brief, I would say yeah: do you want to get back into that that brief or that, like discussion, place where you're going to document what has been decided upon.

B

I think so um I think that the experiment needs to needs to end to okay, so you make a decision back right.

B

B

B

Okay document the decision and when you end the experiment, we need to remove all still code.

B

Well, let's do all code.

B

A

A

You still there yeah um yeah. That seems good uh how about we. We can't like we now have some initial favor and we kind of like make a figma document tomorrow, where we kind of you know create initial uh steps similar as in a three-year vision. We're gonna kind of you know like detail this out a little bit further like how this looks how this like, which subflows there are, um and ideally a little bit like, uh set it up with a job to be done.

A

Format in mind like there's this, this major one and then there's all these sub ones, and then the sub ones have little small ones in between and we kind of make our base our steps based on those uh jobs to be done. With these.

A

Sounds um so let me set up that uh think my document for tomorrow and I'll add it to the media cool.

A

Document um by the way, a small request tomorrow, we also have the three-year vision review and when I was looking back into the document of the three-year vision there were these notes that we kind of like discussed briefly. They were mispositioned, so I was wondering like. Could you do a small review like give it like 15 minutes or 20 minutes of your time and write a little piece at each of the steps of flow? A see all right, hey? This is what I'm thinking. These are my thoughts.

A

These are what I would like to see change and I can make the changes later today and then we can have a productive discussion tomorrow, because it was kind of hard getting into things. After going back for four weeks,.

B

I don't know I'll try, but this is what my my calendar looks like today. So I'll try and I have all the kids home. So I'm not making any promises.

A

uh Yeah I'll try my best to see if I can fix things in the meantime for the three year vision- and I will see you tomorrow,.

B

Great. Thank you. Dimitri.

A

B