GitLab Customer Success Skills Exchange, 27 Mar 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Implementing DataOps with GitLab

Description

Emilie Schario talks about implementing DataOps with GitLab and answers questions from the Customer Success organization at GitLab.

A

Oh everyone and welcome to another exciting installment of the CS skills exchange today is going to be an abbreviated session since we're going to be ending at the top of the hour for the Michael McBride's AMA. But today we're going to be talking about how to implement data apps using get lab, and we have Emily sherry over here to give us an overview and answer any of your bringing questions. So without further ado, Emily.

B

Thanks Chris, hey folks, my name is Emily Sharia I'm gonna get lab for almost two years now. I was the first data analyst at Q, I moved into the data engineering role and now I'm on the chief of staff team. As an internal strategy consultant Chris put together an awesome issue which links to the video that a talk I gave a cute lab commits in San Francisco I'm, not gonna. Give that talk here, but I want to share the general ideas behind that talk right. So the first thing is: what is data ops?

B

We all know what the DevOps figure 8 is right. I, don't need to explain this piece to you, but data Ops builds on top of this. Let me switch marker colors, so data ops is about applying the principles of that we've seen be effective in DevOps to data and the difference. The key difference is that data comes in to this. You treat it you develop in it the way you would DevOps or the way you would build a software application and then what comes out is your analysis.

B

Right so this goes kind of vertical through the DevOps figure-eight and then actually funnels back. There are switch marker colors again, there are what I call three tiers of data analysis.

B

Dub bottom is reporting. These are what I call facts. Reporting is like how many new users signed up on our website or how many pipelines were run last month. It's it's very. Like fact-based questions, you get the information, then the next level is what I call insights. Insights are where you combine two pieces of information from different data sources, so that you actually provide business value.

B

So, in a really practical practical example around your work at gitlab, you might be interested to know like what stages of the product do people use in order to indicate that they're going to keep using a product a year from house. Sorry, Linda, quick lark, so that's insight to pieces to data sources combining to give you valuable information and then the final piece, the top of the pyramid is what I call predictions: we're not talking about fancy, machine learning models or anything like that. It's just saying: okay, based on x and y.

B

Where do we expect to be 12 months vertical, and most data teams are still stuck in this bottom square right here, reporting: that's because they spend a lot of their time. Building analyses to answer simple fact: base questions that win something in their data model changes break, and so they spend all their time maintaining here they never get to move up the stack so the big problems, data teams have our data integrity, data quality and data, reliability and data ops is the way to kind of combat. So that's the short version.

B

I'm gonna jump a ton more links in that issue, but I'll dive into the questions now so DT you've got the first question you yeah it's DT here, maybe not so DT asks. How often does this topic actually come up with prospects for him? It's been once in three years.

B

I get pulled into a call, probably two or three times a month and part of the reason I'm on this is because I can't keep getting bullying to call two or three times a month, and so we're looking to kind of really empower you all with the information you need to confidently have these calls.

B

Any other questions.

A

Looks like a couple: people are thinking, I'm.

C

Going to be easier just to ask here, rather than type it in so our pipelines now for the data team and I'm looking at producing snowflake for a sound like warehouse extracting DDT for transformations. Is this a common flow for DES yeah.

B

Great question so there's one thing: I'll start by answering and that's by saying there are two general approaches to data movement. The first you may have heard of this is ETL right, extract, transform, load and depending on who you're talking to this is gonna, be the norm they talked about. This is a kind of an older way of working and the norm today is ELT, extract, load and transform, and the reason for that is because your business logic will change right.

B

So, let's think of a practical example where an e-commerce company- and we have a website- and we might say that a new user is someone who has never visited our website before period and over time. As you get more data on your users, you're gonna be able to connect that, like there's cellphone visits and their computer visits and then also if they have like I, don't know if you can see, there's a laptop right there right.

B

That's my personal laptop, that's right there, that's my personal laptop and so, as you collecting more data, you're gonna want to be able to see like someone who connects from their work laptop and their personal laptop and their iPad, or multiple cell phones and all that kind of stuff, and so your definition of new user will evolve as you get additional information. So the appeal of the e-elt approach is that when you're business logic changes, you don't need to move all your data again because it's already stored in your data warehouse.

B

There are three major data: warehouses on the market: snowflake bigquery and redshift. They created being Google redshift being Amazon. Those are like are definitely the three most players I've seen them cover. Probably 85% of the conversations I've had with an asterisk being that a big port, a big part of what's left, is teams that are using Postgres for analytical purposes. That's okay to like get started with. It won't scale, as well as an analytical data warehouse, but it's a great place for teams.

B

That kind of already know Postgres technology in terms of is that whole stack standard.

B

The first thing I do is put up a I'm gonna share my screen with a graphic on what that stack is so that oh actually I have it in these slides.

B

I'm gonna share this screen.

B

Okay, so data comes from wherever and it gets loaded into your database and then you do things to it and then it's ready for consumption and then your consumption is held by a bi tool or like a Jupiter notebook. So I get lab. Data comes from a myriad of sources like Salesforce and Zora and product usage and stuff like that it gets loaded into snowflakes, so everything in this section happens in snowflake and then RBI to lists I sense.

B

This is a pretty standard structure for a lot of data teams where, like machine learning, data science happens here but, like this middle part, is pretty similar DBT, which is what this logo is, is kind of B premier, open-source tool that does this. There are a couple. Others, data form is a paid one. Matil ian is another. People have homegrown versions of this, but DBT is is growing in popularity. They've been around for four years.

B

They have several thousand weekly active projects nowadays and are increasingly the default tool when it comes to this stuff did I answer your question. Joe that.

C

Does thank you.

D

Hi Emily I have a question I'm based on the diagram you just showed. Can you illustrate or highlight like where do like merge, requests or like how does get lab factor specifically into that diagram? Yeah.

B

Great question so.

B

Okay, I'm thinking about how to answer it. Okay, so data comes into the data warehouse and you don't store your data in in your repo, but the processes that you do on their data are stored in the repo. So all of kind of this orange and blue sections here happen in SEM.

B

Let's look at what that looks like.

B

So transform DB team, so this is kind of get labs DBT project. By the way this project is public and on the calls, the sales calls that I've been on they've found it really useful to point customers to this after calls so take note of this. This link- but these are this- is our DBT project, which has multiple parts to it. Models are kind of the core part of what DBT does and you'll see that all of these models I'm thinking of a good example.

B

So this is our.

B

Well, as he Salesforce, because it's a pretty canonical example, so Salesforce opportunities has the opportunity. This is the model that underlies the Salesforce opportunity analysis. So you know we take things from the opportunity table. We do some cleaning and renaming and clean some stuff up like this is our transformation. So if we look back, this is the transformation that's being stored in SCM our DBT testing, it's also stored in SCM. So if we go to back to our DBT project, we go to tests here same thing. You can find tests will see snapshots.

B

Documentation so DBT ships with really great documentation. We host ours on github pages, it's a DTD at lab data. Calm. You can kind of see everything that's going on here. It takes a second to load because it's JavaScript, but then we go see that same salesforce model that I just showed you. This FTC opportunity right there and it's documentation around the columns and what they are and and what's going on here and you can see kind of the dag.

B

So this is why DBT is really great in terms of where all of like the rest of yet lab outside of SCM fits in so plan. They get lab data team uses plan just like the engineering teams would ci is a really great way for teams to get started. So when we use merge requests because we are making changes against those things that we just saw, we use a process leveraging. One of the snowflake features called zero copy clones where they make a clone of production environment.

B

This is a Deb environment which, and that's created whenever a merge request is created. It's um I can link you to the get last the ice cream, but that allows the merge requests to run in a clone of production, and that way we can see the data after like the new changes and make sure and compare it state at the. In the merge request to production, to kind of see what the side effects are.

B

So that's one way we use CI. The other thing is that some teams- you see I pipelines for actually running there, they're DBT deployments it's hard when you have like a large DBT deployment. So when I started a cute lab, the data team was three people that then we were able to use, get lab CI to manage our deployments, and that was really great as your deployment gets bigger and your dependency graph or uh it's called a dad, directed a cyclical graph when it looks like this is only things related to Salesforce opportunities.

B

If we were to look at everything um this end, this graph we're gonna update, you can see how big and there you go. So it's big and complicated and there's a lot of different slices to this, and so we use air flow, which is the canonical tool here, but we host air flow using it and I'm happy to link you to the docs on that. Does that answer your question? Kurt.

D

Yeah, it does I think um with I mean you know the question I get, or at least recently God is like how does get lab? Do data science and that's like such a broad topic right and so for me, my thing is like how to filter them down to whether that's that, like triangle graph, you sure like are you doing reporting? Are you doing analysis?

D

Are you doing something more advanced than that I think we could probably go really deep on that diagram you had of like well, you could use a merge request here and you can use incidence like trying to fill that diagram with as many tanooki logos pointing at like how get lab can fit into your data science pipeline.

D

Like that's the story, I'm trying to tell yeah so I think that gives a high level I think you know, data science is new to me and probably new to a lot of other people, so there's still some concepts I'm trying to grok and some of the like the tools and the logos that I'm trying to understand like what they do and why one is better than the other. But yeah it does I think it's.

D

This is like probably not the best enable minute session to cut short, because I think we could probably go super deep on some of this stuff, but at least it's a good starting point. So yeah. Thank you for that. Yeah.

B

So two things that jump on me I jump out to me from that question: one see if you can push to understand what they mean when they say data science, it's a really ambiguous term. Some people use it to mean data analytics like, for example, at lift and stitch fits two very well-known companies. In the data space they say data science, and that includes data analytics. Just like simple, straightforward reporting, that's really different from how we use the term how I even use the term I don't think data science is an umbrella term. I.

B

Think data is an umbrella term and data science is a subset of that. But you know everyone has their own language. So one push to see if you can get a better answer around what they're doing do they mean analysis? Do they mean data engineering, which is the movement of data? Do they mean machine learning? Do they mean advanced statistical methods, which is what I would actually mean by data science, so understanding kind of more of like?

B

What do you mean when you say data science is going to be the better way to figure out how to push the conversation, because if the answer is advanced statistical methods, what they mean is Jupiter notebooks I'm, not a fan of your notebooks and so like. Unless that's what you're? Looking for, you probably shouldn't recommend that, but gait lab does a lot to make that part easy. So that's number one and then number two I'll work with Chris to get a to get another session on the encounter in the future. Yeah.

A

Absolutely and please there's a shameless plug fill in those issues with comments, questions as the more you all give us the more. We can be detailed and do deep dives where it's you know, most beneficial Joe I, see typing today, but happy to to dive deep in any of these different topics and we've touched on today, Joe.

B

I'm getting you that link right.

B

Oh, actually, while I'm doing that, can you share my screen? So people here is like this: is a data structure? Page I'll drop it into the doc next, but you can see here what our diagram for our data structure looks like. There are a number of third party data sources right: Zhora, Zenda, salesforce, NetSuite, we've all heard of these. We use off-the-shelf ETL tools such as stitch and five Tran to move the data into our snowflake data warehouse. So that's what this big boxes and then we do those transformations.

B

We talked about see those little DBT logos there and that triggers kind of what's ready for analysis, and then people write queries in size zones that happen on snowflakes. So this is a really short version. All of the orchestration that happens here happens with air flow, so um I will drop this link in the chat, but that's a good place to start so.

C

I'm in a conversation right now with the data ops team- and they were asking where does get lab work with air flow, sounds like you're saying get lab, can host air flow you're saying oh.

B

I just needed myself accidentally. Sorry, um we use a hosted Cabernets cluster for that and they're. One of the cool things about the data team project, like I said, is that the it's all public so get comm / get lab.

B

Data is a great way to not just say here's what our data team is doing, but also like here's, the code that we're using to run this and it's a really great way to point people where they can kind of not just read the handbook and documentation, but like here's, the exact code for how we run our pipelines and here's what our graphs look like and stuff like that.

B

So in that project, you'll see our airflow image like for the docker image and then the orchestration piece, but I just dropped the link to this data infrastructure page, and that should point you to all the right places.

B

Thanks great we've got three minutes left so.

E

An awkward question it which might be naive but I, was thinking there is this whole ETL which we're running in gate lab and that determines what is the target schema to which you process the raw data? Every time you change it, would you rerun those transformations for all the data, including the historical data? How long would that take? What is the performance implications of this? That's.

B

A great question: if I'm understanding, correctly you're saying like okay data lands here and then we want data to be ready here and it goes through some transformation process and every time we get new data, do we rerun all this new transformation? That's the question right. Yeah.

E

That's the question: yeah.

B

So the answer is, it depends on the volume of data so for some volumes of data, where it's small, like data sources, not get Lancome, for example like smaller data sources, then, yes, we rerun everything every time because they're sort of use and what that is in a database is maybe not relevant, but it's not a performance hit and it's actually better to just run the whole thing in terms of performance. Where we're talking about like quantities of data, we only transform that new slice.

B

We only transform that new slice, and so what we have and what DBT is smart enough to know is that, like, oh only, this bit is new. Let's only transform that bit and it just kind of depends to the bottom of the existing okay.

E

Is that yeah absolutely so.

B

We call this incremental materializations cool.

E

Interesting, let me put note on this one and by the way, I like the way you're doing the drawings. On that part. It's a little of that for my presentations. Thank.

B

You I think it's easier if we're all looking at the same thing, absolutely.

E

It's a splendid idea, mm-hm what.

A

A novel concept right.

E

Yeah in a technology company, it's is such a no big thing, but it works and it for.

B

The record there's also another whiteboard, you can't see it so I just have a little handles for calls, but cool. So we've got a minute left. Oh, it looks like there's more demand, so we'll definitely work on getting something scheduled for the future. Does anyone have like a final question? They want to get out.

F

So if you were to talk to customers that they're on the verge of the starting data science, what's the one thing in get lab, you recommend they start.

B

Yeah so there's like two parts to answer this question: it's actually a super super complicated question, because, if they're saying like hey we're getting started with data science, my first thing is like: do you already have a data analytics organization? Or are you actually getting started with a data analytics organization? Because if that's the case, then my answer to them is gonna, be they need to be start, they need to start with source controlling their transformations, just getting started with git, and so for a long time.

B

A data analyst was someone who was really good at Excel and nowadays a data analyst is someone who's really good at code. So that's where I would start. If the answer is no, they really want to get started with with data science. They are not worried about analytics right now.

B

I would focus on the ways they can leverage, gitlab and Jupiter notebooks together to build analyses, build you know, kind of statistical, the the data science output that they're looking for and version control it so that they understand what changes are happening over time like I've, been on calls before I'm almost embarrassed to admit this out loud, but I've been on calls before where people's case of version control was like screenshotting the code and that's like common in data science yeah those faces, that's how I felt, and that is like just getting started with version control is where most people need to start.

B

Like data scientists, machine learning engineers a lot of times, they're coming from academia where business best practices like version control, is still a foreign casa. So we need to spend time educating on that need to really help them think about how to grow in scale and create reproducible analyses for their org. So, in the same way you think of selling to like developers with you, don't have to convince them with version control. In the same way, you need to teach what version control is and convince them that they need version.

B

We call to have a successful nudity. Thank.

F

You I love that answer: yeah cool.

B

Well, we're over time thanks so much folks, I really enjoyed chatting with you all today. Thank.

A

You so much Emily for coming.

B

A

Us thanks for all the questions, so I'll get with Emily and if you have thoughts on future sessions around data science, please send them our way and we will be happy to get another session scheduled, see you all in the AMA shortly.