GitLab Group Conversation, 13 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Data Team Group Conversation Preview: 2020-08-13

Description

Preview session for the upcoming Data Team Group Conversation livestream scheduled for 2020-08-13. Hear Data Team responsibilities, objectives, and direction from Rob Parker.

https://docs.google.com/presentation/d/1N9Jq4DcLWhkJKLw9tHFr7UIwbZUUR1q_euS9Ez65moQ/edit?usp=sharing

A

Hi everyone, my name, is rob parker. I run the data team. I am the senior director of data and analytics and thanks for joining today's group conversation covering the data team, we all have very important questions. We want to answer. We have questions that are focused on sales and sales performance or customer and customer performance, or maybe questions that span our business functions and divisions.

A

Many of these questions are very important to help gitlab grow, but they're very difficult, if not impossible, for us to answer today and if we think about the questions and the nature of the questions, really what they're driving to it's about two major business workflows that occur at gitlab. The first is lead to cash and operations, questions around sales, performance or marketing's performance of driving, new qualified leads through the marketing funnels or the questions around our product. How is our product performing?

A

What's our adoption of feature x or y or z? Look like? Are our customers consuming our features per their purchased quotas and purchase subscription amounts? So all of these questions are critical to moving gitlab forward, but we also need to be able to trust the data that we're receiving. We don't want to make a thousand or ten thousand or million dollar decision based on bad data, so the data needs to be accurate, reliable. We, of course want to drive the single source of truth so that we don't have different answers from different sources.

A

The data needs to be organized and labeled and we're all looking for that self-service capability data democratization where everybody in the company can have access to the data when they need it. So all of these problems are problems that the data team is working on we're part of the finance organization and we build out and work on solutions like key performance indicators.

A

We work on dimensional modeling, so we drive single source of truth reporting across all of the data surfaces that we provide from our sci-sense dashboards to sql interfaces and so on.

A

Now our short-term direction is to help gitlab be a public company.

A

It's a very hard thing to do within the data space, because there are so many possibilities of delivering data and producing data that that is helpful, may maybe not necessarily completely impactful, or maybe there is a loss of trust, so the level two reference really helps us establish a baseline set of standards for what we want out of our data function over time as we build out level two capabilities and really deliver self-service and single source of truth reporting.

A

This framework helps guide our decisions around future investments or future roadmap development areas within the model.

A

As an example of where we are today we're at level one and some of the things that you might expect are easy for us are actually not easy for us. Recently in q2, we had the opportunity to work on a major cross-functional initiative to generate email lists for handover to marketing, for communicating updates to our customers.

A

These are fairly large lists. We're talking about millions of records, so these aren't lists that are easy to just generate in a spreadsheet plus. The list contains sensitive data, so we have to treat that data with encryption with security in mind so for us, generating these three initial lists took over 15 person days and with a well designed dimensional model in place, which is the direction we're heading.

A

This exercise would have been much more straightforward and cost the data team one to two days of effort so level two with its standardization in the model and reporting and self-service, makes those types of activities much more simple and straightforward.

A

We're rolling out two new major self-service solutions in q3 as part of our ceo, great team, okr, we're rolling on a brand new dashboard developer feature with sci-sense data discovery, as well as a brand new sql developer self-serve capability that will provide an interface into snowflake.

A

Both of these solutions are going to be based on a set of new data models that we're building that are very much subject focused it's very difficult to take a large database that contains over 600 tables and 98 billion rows of data, and just say here. Take this away have fun with it. So the approach we're taking is to focus on specific subject areas such as product geolocation or customer segments, and we're going to provide very robust training materials for that content and then roll those out separately.

A

The nature of the data model that we're building is pluggable so, as we add new subject areas and handbook content, the support analysis of those subject, areas they'll plug into the existing models that we've built and not require us to rebuild that model from scratch.

A

So, let's dive in here for a second. How do we actually make 98 billion rows of data into a self-service offering? So what we've chosen to do is to create a reference solution. We have a link here that you can take a look at in our handbook and establish the reference solution as the standard over the course of q3 we're going to build out two brand new self-service. Subject: area focus solutions with this standard we're going to deliver them to over 25 self-service team members and we're going to take a look at the results.

A

We're going to conduct a retrospective iterate on the handbook layout that we've defined as the reference iterate on the technical solution and then use that new reference baseline to build the next most important subject area to solve for behind the scenes.

A

All of our data flows through similar data flow paths, we land data into our environment and we all organize toward the ultimate dimension destination of the dimensional model, and this is how we achieve single source of truth. If all data is organized the same way in this dimensional model, it supports self-service through the dashboard development capability in sci-sense, because it looks the same you expect customer to be in the customer table. You expect product lists to be in the product table. That's the way we organize it.

A

On top of that self-service through the sql analysis also drives through the same dimensional model. So if you're querying the data through psi, sci-sense or through sql in this dimensional model, you're going to get the same results on top of this entire business flow, we're releasing a framework. We call the trusted data framework that builds in business friendly data validations all along these business processes. So if we expect a very important customer to exist in the final dimensional model, with a certain set of criteria, we can build that in as a test.

A

If we expect a certain number of rows to be integrated from zora during our normal zorro refresh. That also is implemented as part of our trusted data framework. Over time we build out these full suite of trusted data tests and we build out the notion of having improved capabilities and test assertions built across our entire stack.

A

The way we actually deliver. This is through what we call fusion teams. Fusion teams is a new concept for the data team we've just rolled out in early q3, but it's really the vehicle for us to create these very business friendly solutions. If you looked at the data team in the past, we're very organized in silos, we have a finance team, member and an engineering team member, but the way that business works. If you remember, is across those two major business flows, the lead to cash and product release.

A

To adoption so we're orienting our data team in this direction, so we can have a more holistic view of how git lab actually works as a business and we're building full stack solutions across data analysis and sci-sense and sql in our data model layer aligned with these business flows.

A

Of course, like all gitlab teams, we are driven by our okrs. We have a variety of okrs we're focusing on in q3. The call out here around our self-service solutions is okr2 certifying 25 team members in self-service by the end of the quarter. The data team is really excited to work on this and we believe we're going to have a very robust self-service solution and invite any git lab team member who's interested in this. To let us know you can add a comment to the doc.

A

You can add a message on our slack channel and we'll get you involved.

A

So if we fast forward our calendar, a couple of quarters I'd like to look at this in terms of a press release, here's what we're really looking at at the end of our level two build out, we want to help gitlab run faster, based on single source of truth and trusted data using this common enterprise dimensional model- and this is where we see our team heading over the next few quarters.

A

Thanks for listening, you can contribute in a variety of ways. You can provide feedback directly into our google doc about this content. You can jump onto any of our slack channels and ask questions relevant to this, really invite your feedback and comment data is not a one team solution. It's an everyone solution. Everybody has a stake in making data a reality, making self-service reality and helping to scale get labs data acumen thanks for the time see you at the live.

A