GitLab Group Conversation, 14 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Meltano Group Conversation (Public Livestream)

Description

Meltano: https://meltano.com/

Slides: https://docs.google.com/presentation/d/1A4PnW8WCl16mjBtz7gvXfC1tKvt9Syzk3wsehtGplDQ/edit

A

All right we're live on youtube. uh This is the public live stream. For the montana group conversation, my name is dawa mann. I am the general manager for the montana project at gitlab and also the project lead for the open source project, and today I'm going to talk with some of my github colleagues, who I hope will have some questions about.

A

What's been going on at meltdano over the last two months or so since the last group conversation um in the document and also in the notes for this youtube live stream and video there will be a link to a slideshow, and you know the floor is open for questions from the gitlab team.

A

Francis you mentioned, you had some questions already. I don't see them in the doc. Yet, oh there, they are okay question one: do you wanna verbalize? It.

B

Sure yeah, I'm wondering just um kind of what's the end game. How multano fits into gitlab's broader strategy? Is it ever gonna become revenue generating.

A

uh Yeah good question so um for context. Montano currently is a separate business unit within gitlab that is being run as an internal startup and we kind of have the same aspirations for it that that we did for gitlab in the early days when, when sid and others realized that this open source project called gitlab, uh you know was pretty promising and there might be the potential to build a business around it. That could, you know, start making money on an enterprise edition or support contracts, hosted version etc and with moltano we're hoping the same.

A

We're trying to build just a really compelling open source solution for data engineering and elt, and then, as the community grows, and as the you know, the critical mass and you know, as the momentum keeps going up and to the right.

A

We hope that about a year from now, there will be a point at which enough of our users will kind of start clamoring for some kind of paid solution around montana, whether that would be an enterprise edition with additional functionality or a hosted version, or you know some other way of providing professional services, maybe which, of course, is relevant in the elt space, because there's so many connectors for data sources or in some other.

A

You know way of of making revenue that does not line up to something we've done with gitlab already, because we are seeing that we're kind of building a platform that data consultancies and also individual kind of data teams at startups, as well as teams building data tools, they are starting to use meltano for the elt bits of their platforms. So since we're building a platform here where, ultimately, the revenue being made around montana might be bigger than the revenue being made by the montana team.

A

um This this platform aspect might also offer up new opportunities for uh for monetizing things, but um gitlab is not just building multano out of the goodness of its heart as much as we think that the world deserves a tool like this. We also hope that this has the potential of becoming a gitlab 2.0, where over the coming years, this will grow into its own. You know business unit, hopefully with with millions of dollars of revenue. If we are looking two three four years into the future from.

B

Today, awesome, I have the next question too, which is, I noticed, a big burst in slack activity in january. Is that just because the holidays were ending or was there some publicity push or some other trigger.

A

uh I think you're, referring to the the uh graph that I have in the slides there, which is showing that in january, the activity went up but to be clear, the activity went up after having gone down and we still haven't gotten back to the same level we were at before we went down for the holidays.

A

um What I'm seeing is that most people use moltano as part of their jobs because, of course, they're a data engineer, either at a consultancy or in data team, which means that for most of them during the holidays, there just wasn't very much reason to to use the tool or to check in with questions on slack. So we haven't done any any kind of marketing push or push for new users.

A

This was just people getting back to work and then starting to play around with montano again, but I do hope that over the course of the month I'll be able to spend a little bit more time on our website, documentation and other kind of things that will hopefully increase. uh You know new users finding out about us and also joining slack but yeah. The activity you saw is just people coming back from the holidays.

B

Got it and my last question it's kind of technical, but I'm curious because you switched to poetry um how that compares to pip or other dependency management frameworks in python.

A

Yeah, so I I'm not the expert- and this is actually a really great contribution by tobias macy, the host of the data engineering podcast, as well as the python init, something podcast so he's like kind of an expert as far as I'm concerned, and he proposed bringing in poetry for better dependency and build management. Pip by itself is a package manager that you can use to install a package from the pipeli pypi index, but it does not actually help with reproducible environments.

A

It doesn't have a log file, for example, which everyone in gitlab who was used to um you know the ruby ecosystem. They know that there's going to be a gemfile.lock which then has like fixed, pinned versions for every dependency. This doesn't exist with pip. There are a couple of tools that you can use to build something like a log file with requirements.txt, but that is not automatically used in all environments. Poetry adds more mature, build and dependency management tooling around pip, which I think is still responsible. Ultimately for installing the individual.

A

You know, packages and poetry also manages the virtual environments automatically for the the the project that you're installing, in our case moltano and all of its dependencies.

A

uh That's about all I know, but in in general I feel like even a year and a half ago, or so when I got into the python ecosystem because of montano, um the state of the dependency management, tooling and packaging tooling was just not quite what I had come gotten used to from ruby. But I'm happy to see that already, I feel over the last year and a half things have been improving and bringing poetry into montana has definitely made it easier easier for developers as well as users, as well as our ci environment.

A

To have a you know: reproducible consistent set of dependencies and versions.

B

A

Cool taylor you've got the next question.

C

Sure so uh we're starting to explore the area of ml and ai in our new learn stage, um which I can put some links into here, but um have we seen meltano been used as part of ai and ml workflows? Elt is such a key part of getting your your data together in a reasonable form and shape to then go and input it into data science models? Have I, I suspect, there's a go to market here. Have you seen that.

A

um That's a great question so, um like you say, elt is a crucial part of you know the the mlml ai pipeline, because you've got to get your data out of source systems into some kind of consistent format, for you know more processing to happen on that. But in a way the elt is just as much part of a data pipeline meant for analytics or for other kind of data science practices at the end.

A

So since we are mostly attracting the data engineers who are part of you know, managing the elt stack for their teams, um I don't always have insight in what gets done with that data down the line once the data engineering elt step is done, but I know that there are at least a few users who are using it with mlai workflows. One example is tobias macy, who I mentioned earlier, who contributed the poetry integration he works at mit and he shared with me that they are using it to do some.

A

I think processing of data about a course or something that you know some course platform pulling the data out of it, um and there are also other users who have shared that they're using it for mli. But again, the data engineers responsible for setting up elt are usually not the same data scientists or ai experts actually using the data down the line. So they don't really have a reason to share it with me and usually what happens with the data down?

A

The line doesn't have a big impact on what the elt pipeline looks like so it's kind of nice in a way that we are at this, this early step in the pipeline, where we are where multano has a role both in, of course, the booming mlai field, but also in a more traditional analytics and business intelligence use cases, because in all of these cases, the first step of any data pipeline is to get the data out of source systems and that's what we're trying to uh to help out with.

A

um So if at gitlab we are like, you said, setting up um more ml ai, for you know, I don't know what team you refer. I don't remember what team you referred to, but either way our data team at gitlab is already using montana as well uh today, presumably mostly for our own analytics and bi users.

A

uh But if you're gonna set up an mlai workflow then definitely talk to them because they will probably, as far as the data engineering and elt side of things is concerned, be able to uh to help out, and it would be a shame if you duplicate the work and they have experience with montana already and they have contracted with a consultancy to help build the custom connectors for the data sources. We have and I'm sure that they would also be happy to support whatever mlai needs. We have that montana can help out in.

A

B

That answer your question, taylor.

A

Yep perfect um any more questions, I'm seeing a lot of people here with their microphones, muted and their cameras, disabled. But one of you must be curious enough to wonder about something else to do with montano.

D

Hey uh now this is uh chester. um I I I want to I. I noticed that the the key personas of folks that work well uh meltana are like um uh data.

B

D

Folks, who probably work with kind of the plumbing and the data quality and figuring out how to you know, extract data and load it into a different source. So at any point, do they care about visualizing that data, and I because I recall at one point multano- did have a front-end um data visualization component, but that wasn't explored uh uh at least it didn't mature um fully. So um currently, how does milton work with kind of a front-end, client or folks that want to visualize the data.

A

Yeah, I think that's a great question and there's the historical perspective there. As you mentioned previously meltdown, had an analyze. You know ui for basic analysis, kind of point and click based, query generation and graph visualization.

A

That functionality still exists in the multano ui, but it's explicitly not our priority right now, because we realized that this vision that we had of building an end-to-end data platform where a single you know open source tool, could be used by an entire data team to do everything from getting the data out of the source systems, transforming it. You know modeling it so that you can interact with the data from the ui and then also building those reports and dashboards.

A

We realized that this entire story, kind of you know, breaks or or what's the expression, either way it very.

B

A

Depends on that first step being viable that people could actually expect these open source elt components to be good enough to get the data through those stages, and we were seeing that the entire platform, especially what we offered in the beginning, just wasn't there yet for people to even make it to that final stage successfully.

A

So in order to get the the el the extract and load bits of montana to the place where our data team would actually want to use it that production and use it to replace whatever proprietary solution they might be using today, we gotta attract those data engineers that will actually help us make that reality, uh the ones who will actually contribute to these extractors and loaders. So these are gonna be more technical data engineers that are are have experience.

A

Setting up data pipelines have experience, pulling data out of systems at scale and they're going to have to have help, build that that ecosystem of high quality connectors, and only when that is done, will we be able to kind of start looking at the next step in the pipeline, because of course, you know in any pipeline.

A

Every next step depends on the previous one, so in the long term we still hope that notana will offer the full suite of tools that the data team needs to do anything from pulling the data out to visualizing it going from data to dashboard as we call it. But for the time being, the focus is very explicitly on the earlier stages and we are intentionally de-emphasizing.

A

The fact that montana can do anything for the later stages, because the people we need to attract at this point and the people we're really trying to you know, target build a tool for rds, more technical data engineers rather than the data analysts who kind of expect the engineering um you know problem to have been solved already, so we're working with those data engineers to make sure that's the case before getting into the later stages. So right now, uh typically, people would use montano for elt. So then the data ends up in a data.

A

Warehouse snowflake or you know even something simpler like postgres um or you know whatever bigquery, you have so many options at this point and then they would use other tooling down the line to actually connect with that data.

A

Warehouse and do the visualization and montana was totally happy to not be involved there yet, but over time we will hopefully be able to integrate better with open source visualization tools montano since montano, as kind of the the project layer where you just have a repository that defines all of the plugins that you're using your data project, not tanna was able to um to to manage the configuration of the different components. So it can also auto configure certain components down.

A

The line based on choices made earlier we're already using this to automatically configure dbt, which is a transformation tool um to connect with the same database that the loader actually loaded the database into, and you can imagine that we might also, at some point, add metabase or re-dash or superset as a visualization plugin, which we can then automatically configure to look at the right database. That montano knows the loader and transformer loaded data and transform data into, and then you know the the value that matano brings to a team.

A

Even if it's not with our own visualization functionality will still be that they have a consistent way of interacting with these different components and assuming that they are kind of tied together and glued together in a more streamlined way than if you have to manually, set up each every tool and try to integrate them.

D

Okay, yeah yeah, so like I, so the focus is really on. You know, I guess a metric that you could uh say that you're using to assess how well that data engineering pipeline is isn't is evolving is to say you know we're reducing the amount of time it takes to extract data from one source and load it to another source and and configure that pipeline in a much quicker fashion.

A

Well, what we're finding today is that what we're optimizing for is not so much the speed of getting that pipeline up or the speed of actually running it, but rather the fact that um you know if you have an elt pipeline, the the first step. There is getting the data out of source systems and if it's just the database and it's easy, you can write a connector for a couple of databases.

A

But there are, of course, a an insane amount of sas apis out there sas tools that users would want to pull data out of and the proprietary data integration, elt platforms they have their own like library of connectors and because this library is so big and because data engineers and data teams often don't want to take on that burden of maintaining their own connectors.

A

The first thing we need to do to make open source elt viable is just to work with the community to expand that library, so that more and more users will be able to look at montano and its database of supported connectors, which, in our case, are all of the singer taps singers.

A

This specification and framework for open source connectors we want to get to a place where more and more data teams will look at our kind of library of connectors, look at the proprietary library of connectors and come to the conclusion that montana is actually a viable option for replacing those prior proprietary options and until we get there anything else we offer down. The pipeline is not really going to be compelling if they cannot even use us for the first step there. So the focus right now is on building tooling.

A

That allows data teams to work with the existing connectors that exist in this ecosystem of singer taps, but then also to build new high quality tabs for new data sources, new apis using a framework, the singer sdk that we're building, which will then be community maintained, and we are seeing that there is a large interest, especially from data consultancies, who currently have a habit of just telling every one of their clients. Okay, you're, going to have to shell out for this big proprietary data integration platform.

A

But they, of course, are interested in kind of banding together and collaborating on this open source suite of connectors, and we are building the tooling to to allow these data teams to collaborate on those and we're building the tooling in montana, to run those and deploy them. um You know you know: production quality and production skill, so the whole vision of open source end-to-end data tooling, really depends on there being a viable story for the first step in that pipeline.

A

So our focus is to get that done and we're doing that by trying to attract data consultancies and data engineers at individual startups, who are willing to try out.

A

Basically, the existing open source connectors that exist fix them if they run into bugs and build new ones for new data sources, and we are especially seeing interest from um consultancies and companies alike in in in in in africa, southeast asia and latin america, because these are the areas in which the prices that these u.s centric vendors charge are often prohibitively expensive, and for them it's easier to ask one of their team members to spend a few hours building a debt themselves than to just pay for it, while of course, for a lot of u.s.

A

uh You know companies. It would be in the opposite direction, because the prices are less prohibitive for them. So we're seeing we're building this really great community of people who have more time to spend than money basically, and they are really believe in building this open source. Free community maintained library of connectors, uh so that down the line, no one will have to pay for for this anymore and it will truly become a commodity. That's that's our goal.

D

Awesome yeah yeah thanks for explaining that yeah, it makes sense.

A

Cool um if any, more questions come up, I'm going to give you a few seconds to interrupt me and, if not well, I'm seeing people are already dropping off. So then thank you so much everyone for your questions and I'm looking forward to giving an update another update a few months from now and have a great day. Everyone.