Continuous Delivery Foundation cdCon 2021, 6 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scaling Pipelines with Tekton - Andrea Frittoli, IBM

Description

Scaling Pipelines with Tekton - Andrea Frittoli, IBM
Speakers: Andrea Frittoli
There are many dimensions to scalability in the context of continuous delivery (CD). Authoring, maintaining and executing tasks, pipelines and complex CD workflows all have their flavour of scalability challenges. In this talk, we discuss how Tekton tackles each of those areas. We present the scalability challenges we faced when using Tekton to continuously deliver Tekton itself, and the solutions we built and continue to develop to make tasks and pipelines practically reusable, maintainable and scalable at runtime.

For more Continuous Delivery Foundation content, check out our blog: https://cd.foundation/blog/

A

Hello, my name is andrea fritoli, I'm a developer advocate at ibm and today I'm going to talk to you about scaling pipelines with tekto.

A

After a brief introduction, we present scaling pipelines both from an authoring and a running point of view. We'll identify a few bottlenecks draw some conclusion and at the end of the talk there will be some time for q a but let's start with an introduction.

A

Tacton is an open source project. It's a framework for creating ci cd systems and pipelines and tasks that can run in a cloud-native environment.

A

Tacton aims for maximum flexibility, it's not opinionated and it gives workflow authors the full flexibility to define their pipelines. The way they prefer so the tecton is made of. Actually a combination of projects pipeline and triggers are the core components.

A

We have two links like the cli called dkn, a nice dashboard and then open it or operator to manage the installation and upgrades of tecton for discovery um of tasks. We have catalog and hub we'll discuss about them later more details, and we also have several add-ons like the results, project chains and multiple experiments that we are running a bit of history.

A

uh Tecton originally comes from the canadian ecosystem where it was called built and it later evolved into a pipeline, and finally, it moved to its own uh project called tecton and one of the and became one of the founding member of the continuous delivery foundation.

A

So tekton works by extending the kubernetes api with what are called customer source definitions or crds, which are kind of abstractions that can run in kubernetes.

A

The tecton defines a few for the static structure of the pipeline. It introduces the concept of task, which is a combination of sequential steps and a pipeline which is a combination of serial, sequential or parallel tasks.

A

It introduces some concepts like workspaces as well as parameters and results which are used as inputs and outputs of tasks and pipelines.

A

Finally, it introduces some resources for the runtime, so task runs for the execution of tasks and pipeline runs for the execution of pipelines.

A

So, let's dive into authoring with tactile. So let's say that you want to automate a certain task um you will use protect on some steps. Steps are basically off-the-shelf containers container images that you can announce with small scripts in any programming language that you might find useful and then a task is a sequence of such steps that solve a specific problem like running unit tests against a repository or building a container image or deploying an artifact somewhere.

A

So if you work as part of a team in an open source project or in a large organization, then it makes sense for this task to be shared um across the organization which allows for to distribute the maintenance work on this task. So it allows to to scale the work of writing and maintaining the task so that each team does not have to run rewrite their own and reinvent the wheel continuously.

A

So, but for this reusability to really work, there are a few things that are required. First ability to discover tasks, then versioning and best practices are very important as well.

A

So in terms of discovery, tecton provides a few options. Tasks can be installed in a kubernetes cluster and can be run directly from the cluster there and that's a way to distribute tasks to different teams, install them in the cluster that they use.

A

The community maintains a catalog of tasks which solve a number of common typical use cases like building container images, cloning repositories and running different tools that are typical in the devops area.

A

These tasks are not only stored in git in the in catalog, but they are also searchable, uh both from uh via a web ui called tecton hub, as well as via an api which allows then integration with detect on cli, for instance, for searching uh hub for tasks or installing them directly into your cluster from the catalog in terms of versioning.

A

There are a couple of options, so the approach detect on catalog takes is to have tasks stored in different folders so that you have a new folder. A new version, however, when the task is then installed in the cluster, you lose this information.

A

This is because the reason for which we introduced tecton, bundles tecton bundles, allows allows you to store tasks as attachment in a container registry and to consume them directly from there from the tecton controller.

A

So, as tasks are installed stored in a container registry, then they can be referred to using a tag or a sha same exactly the same as container images, and we also have a nice integration with dkn, which is attacked on cli with the bundled command and allows to create bundles by using the push command or retrieving them via the list in terms of best practice best practices.

A

Sorry, we have documentation in the tecton catalog on how to write your tasks, it's very important to use input parameters and results, to declare the inputs and the outputs of your tasks as well as to use workspaces to share data between different tasks or as well to inject secrets or config maps into your tasks.

A

One aspect to keep in mind as well is portability. This is an area that we are also still exploring, because we do have support for multiple architectures for tecton, for running the tecton controller, on top of different architectures and but it's important as well to have tasks that use container images, and that includes scripts, which can be run against multiple architectures, and you may remember, when we introduced tecton, that we talked about task, but also about pipelines.

A

Pipelines are a combination of tasks that can be run sequentially or in parallel and but can pipeline, be reused or shared. Well, if you think of a short pipelines like the one highlighted in green, you might have a very common sequence where you clone a repository and you build a containment container image out of it. Maybe you sign it, and then you push it a container registry and this kind of pipeline can be used as a as a reusable.

A

Block and you could use some things like conditionals that are supported by tecton or optional inputs to make this pipeline even more portable. If you look at this picture, for instance, the clone build push pipeline is great, but in this case you might only want to have build and push because you want to clone the reaper one once and then use it to run multiple things like build and push and test so having a conditional. There will help for large pipelines.

A

However, it gets more difficult because if you define a very large pipeline- and you want people to reuse it, what happens if they need to add a branch to it if they need to swap a block, maybe for instead of running tests, they want to run a different type of test than the one that is defined there and how you deal with access control and different tasks in your pipeline may need to have access to different resources and yeah how you distribute a pipeline.

A

So, as I said you might want, you might have teams that are responsible for environments which are touched by different parts of your pipeline. um They might need to run in different namespaces or even in different clusters or through different platforms. Maybe some part of your pipeline isn't tactile, but maybe you have other parts of the pipeline that maybe already run in other platforms.

A

So um it's a good idea to break down the new pipeline in multiple pieces and you can add small pipelines like the one that I mentioned: clone build boosh and then every team can own the pipeline they're responsible to, but then you still want to have one workflow. So how can you achieve that? So we have one experimental feature which allows you to run pipelines in pipelines, so you can have a pythons where you trigger some pipelines basically, and that is a way to compose the pipelines together.

A

Another option um is cloud events which basically would allow you to trigger pipelines asynchronously- and this takes us to the next part of the presentation where we talk about running pipelines with tekton, and you may run start your pipeline by hand or creating a task run or a pipeline run, but usually you will want to run your pipeline as a result of an event, and so this is where the triggers project comes into into play.

A

It's basically allows you to run pipeline runs or task runs as a response to events which are transported via http.

A

An example of events could be a pull request or an image was published to contain a registry or something was stored in object. Storage, so triggers has several components: event listener, which is basically the sync that receives the events, filters or interceptors. That allows you to define rules on whether to accept or not accept events and maybe alter or add something to the body and then binding and templates which allow you to extract information from the binding and then inject them into the resources through the templates.

A

So the resources that you're going to create so pipeline brands.

A

Tecton, is also able to generate events, both kubernetes events and cloud events which are generated when a pipeline or a task is started. Let's starts running or it stops.

A

So now that we discussed our input and outputs of events capabilities in tecton, let's have a look at our initial pipeline. Again, we could split it down in a kind of composite workflow that allows for separate ownership of the different parts.

A

So you could have a team that handles the production environment that takes care of the pipeline on the right-hand side, which is using captain as a platform. And then you could have a team, a qa team which is responsible for tests running on the staging environment and they might want to add tests more tests dynamically, and they can do that now. If they listen to the cloud events generated by the initial pipeline, they can add more tests without having to change the overall pipeline.

A

They just need to have more triggers so to say um that allowed trigger different tests, so breaking down the pipeline might help with a reusability of pipelines, and it allows also for heterogeneous parts to work together.

A

Still, there are some issues um in this kind of setup and one is the um kind of the overall visibility um on the flow of the the workflow of the pipeline. At the moment it's broken down and different parts are integrated through events.

A

um You will need some tooling um to be able to see what's going on from the beginning, to the end of your overall workflow, which might be going across different platforms, and the other issue is that, because you might have different platforms, they may understand different kind of events, so uh you may have tacked on generating uh cloud events with a certain payload. Then you need to make sure that captain or jenkins understand that payload and understand what what these events are about to solve this kind of problems.

A

The special interest group, a special interest group, was created in the cdf called events, events in cicd and aims to define a standard format for these events in the cicd space.

A

So in terms of then scaling pipelines, while we run them, what are the bottlenecks that we encountered?

A

Well, one might think about very large pipelines. A pipeline is a directed academic graph and you could have a very large one if you don't split it down like discussed and I've seen examples of pipelines with hundreds of nodes, it might be densely connected.

A

So is this a scale issue because um of the kubernetes nature of tectons, because tecton runs on top of kubernetes resources are reconciled, which means that every time a reconcile happens, the dag needs to be recalculated, and if you have hundreds of nodes and connection that could be computationally heavy, but in fact it is not, and it does not pose a scale issue. So we had some issue in the beginning, but we optimized the code for building the dac, and now we are happily running uh very large pipelines with no significant overhead.

A

Very large pipelines still pose like maintainability issues, and there is one scale aspect that needs to be taken into account is that task usually corresponds to a pod in kubernetes, and there is some setup work involved in creating a pod, which means that if you have a pipeline with hundreds of tasks, you will have hundreds of parts, and this overhead is multiplied by 100..

A

So the other aspect of scale at front time is running many pipelines um in parallel, and there are a few things that tecton does here to make sure we don't run into scale scaling issues.

A

um First of all, api calls are the kubernetes api are cached, so we use informers everywhere to make sure we don't don't get rate limited by the kubernetes api.

A

Secondly, we allow for scaling up all of our controllers. We use a mechanism called leader elections, so even for cases where, like the techno controller, there is, um there are resources that needs to be reconciled. We can scale up and have multiple instances of the controller and they will not step um on each other and they can. They are able to share the load in terms of utilization of cluster resources.

A

When we have multiple pipeline running in parallel, we rely on the kubernetes scheduler, so things that could be done, um and we have done, uh I think, are like throttling uh the number of pipelines that you execute. This is not supported natively by tecton, but it can be implemented.

A

Also, it is possible to optimize the execution runtime so the way that tasks are mapped into pods. If we make that optimal, we can reduce the number of the amount of resources used by the cluster and then scale better when we have very large pipelines, so many pipelines running in parallel, an option that we also can see. There is um writing custom scheduler for tecton. That would be optimized to the way tecton nucleus utilizes, kubernetes resources.

A

In practice, um we use tacton for tecton-owned ci cd, um so we run thousands of tasks a month in this kind of environment. We also run a service at ibm and on top of tech contact on as a service which runs windows of containers per month, with no issue at all. In that kind of environment we use profiling as well as upstream and for security too. We don't want to allow people creating pipelines um without any limitation and another issue that we see when running. Multiple pipelines is clustered pollution where multiple resources are created.

A

So the history of the execution, history stays on the cluster and that creates problems all the time and that's for these reasons. We created the text and results project which allows to store the execution history of cluster, and eventually it will allow to clean up resources as well once they are stored in the database of cluster.

A

So, to conclude, we have seen scalability aspect from awful point of view, offering point of view and runtime point of view.

A

It's important and author time to follow best practices to make sure that tasks and pipelines are reusable.

A

Versioning is key for sharing the tasks and it really helps to break down large pipelines into small parts that can be then reused, and then those small pipelines can be recombined together and to make to make the whole workflow a marine time point of view. It's very good news.

A

uh Tekton is cloud natives and it's kind of scalable by nature, things that can help to solve some specific problems that we've identified are execution throttling that can be used to limit the number of pipelines running in panel as well as we are working on optimization from the runtime to resolve the issue with the pod creation overhead.

A

So, in terms of roadmap from a scaling point of view, well, we continue to run to dog food tecton so to run tecton at scale to uh cicd tecton itself, and we have a few uh features that we are working on. Like pipelines and pipelines that I mentioned, uh there is step 60, which is for remote resource resolution, which the idea is to extend the concept of oci bundles and allow tasks to be stored and shared through arbitrary mechanism.

A

We are working on optimizing, the execution of combination of tasks through the decouple task, composition and scheduling, um as well as we want to make bundle mo more powerful by allowing to define which bundles needs to be used at runtime, as opposed to author time as it is today.

A

Also, we plan on title site to implement the cloud event format defined by the sig events when it becomes available. So thank you for listening to the presentation today and we'll have some time for questions in a moment.

A

So this presentation itself is available on github. Here is the link to the slides and please come and join us at tecton. We are very welcoming community and we're looking for contributions. Here are some links that might be helpful. Thank you again for listening today.

A

A

Hello, everyone- um hopefully you can hear me, and so, if you have any more questions, please ask them in the q a- and I will start going through some of the questions that were asked during the presentation, and um so the first question was steps must run in parallel. um Yes, so that's the model that takes uh within a task. Your steps are implemented as different containers in a pot and they are executed in para, england sequence, um so they cannot run in parallel, but we do support uh having a side car.

A

So you could have a part of the container that runs through your task and there was a question about where to learn more about tecton. So we do have some uh scenarios that you can run catacotas scenarios from the tecton.dev website, as well as we have the documentation from all the projects hosted on the website as well, and if you uh dig into the different repositories like technology pipelines, we have examples that we use for testing, but you can also use them for reference. When writing your pipelines.

A

Other good sources of inspiration, if you're writing your own tasks or pipelines, are, of course the catalog itself, where you can look into existing tasks and you can browse them through the hub, which is.

A

I'm just putting in the chat um we also dog food tecton, so we use tecton to build and test and release tecton. So we have a repository, tecton cd, slash plumbing, where um you can see the techno resource that I used for for this for building and releasing tech.

A

Okay, um so tactile and open shift, so there is uh openshift pipelines is a distribution of tactile that you can find on top of open shift and uh tactum versus jenkins, uh kubernetes plugin, so the kubernetes plugin in jenkins, as far as I remember, allows you to run to get kubernetes bots as nodes for your jenkins execution. So if you combine it with other plugins and pipelines or type configuration as code, you could do similar things and, of course, tekken is born in the kubernetes ecosystem.

A

It works as an operator, and I think a great advantage of techno. At least one of the things where it tries to excel is reusability and having a catalog of tasks that you can pick and reuse without starting from scratch.

A

There was one question about different way of cloning, your repository and absolutely yeah. So you the cloning.

A

It's can be done through one of the tasks available in the catalog, but if you have a different type of technology that you're using for a versioning of your code, you can write a new task and you can even share it to the catalogs that other people may use it.

A

um How big has that done been tested? So um in upstream we run the order of thousands of tasks per month.

A

um I know at ibm we have a host attacked on as a service and we run the order of millions and it works pretty well, there are other public offering like relay.sh from puppet that I know about. I don't know the scale numbers for that, but if you have interesting scale stories about titan, I would be really happy to hear about those.

A

And there was a question about dag: can you could you elaborate a bit more on the dhg you mentioned, how do you calculate them and what do you actually use the dags for okay? That's a great question: uh thanks a mill. So um within a pipeline we have tasks and tasks by default. It would be run all in power level at once, but you may use um inputs and outputs if we call parameters for inputs and results for outputs to create dependencies between between tasks.

A

So you could say, for instance, that certain task emits an output and then that output is used as an input to another task or you can define specific dependencies between tasks as well. So you can say this stuff should only run after this. Other task has been executed and so what the techno controller does it takes all this either implicit relationship between inputs and outputs or explicit and puts them together, and it builds a graph directed a cyclic graph. It's a dag and it uses that basically to schedule the tasks.

A

Every time a reconcile cycle is executed. It looks at the current status of running tasks and if there is any new tasks that could be started from the dag, I hope that's the answer, and there was another question about: would you also see pipelines of pipelines as the aegs um yeah? So the way we implemented pipelines as pipelines?

A

Experimental for now is to consider a pipeline as a specific type of task and we use concept of custom tasks, and so in that sense the pipeline also exposes results, and you could have the same kind of relationship between the pipeline task and other tasks. So the pipeline in a pipeline will become part of the eag.

A

There is question from krishna: does argo cd compete with tactics, I wouldn't say so? um Actually they are often combined together, and so you could use, for instance, tecton to run a build or a ci job and then use it to trigger a deployment via argo cd or you could use argo cd to synchronize tecton resources on your cluster. So there are different ways you can combine them together.

A

One of the things that is characteristic of tekton is that it's very non-opinionated, so it's pretty vanilla. It allows you to create your task and the pipelines in any way. You want and build your workflows in any way. You want other resolution of a more opinionated approach that can fit or not fit what you need and because of that, it's often um the case.

A

That uh organization will combine multiple solutions, so we'll use argo for some part of their workflow, maybe tacked on somewhere else, maybe jenkins or flux or captain or yeah you name it so.

A

There is a question from adam when we're starting the pipeline from a point when it failed will it be implemented. Our tests would love it because, after the fixed code issue, they need to run the whole python phase. This is a very great question, adam.

A

In fact, it's a question that we are discussing a lot in the community. There is a tab. We call it tecton announcement proposal about being able to rerun a single, a single task within a pipeline when it failed, um but yeah that we also discussed about the the whole concept of restarting a pipeline from a failure.

A

um It may require um some snapshotting functionality or you might not, depending on how deep we go with the implementation, but it's definitely something that we are interested into and um it would be you or your team would be very welcome. Of course, if you want to join the community and share your use cases, you can do that joining the working groups or create a github discussion or comment on existing tabs in the electrocity community.

A

Repo just send a link in the chat.

A

um And that link by the way is also very good, resourceful technology community to um to find old resources about how to communicate with the community and join slack, uh join the distribution list and so forth.

A

um So there was another question by king: how granular is the scaling? If you have multiple pipelines for different teams? Is it possible to cap the scaling for each team, for instance um right so um right now?

A

What tekton out of the box does it gives you the possibility to start pipelines um running right away or to start an impending mode, and if you start them in pending mode, you can have an external component that will take decisions about when to they can start actually run, and in that way you can control the amount of pipelines that you can run in a specific namespace, for instance, or in a specific cluster depending on your needs.

A

I hope that answers your question.

A

There was a comment from emil. Probably argo workflow is more similar to tecton right. Yes, indeed, I would say there are more similarities between argo, workflow and tecton than argo cd. So there are several similarities. Are the workflow uses um yaml to define uh similar concepts?

A

There are actually some nice youtube videos. I think you can find on with comparison features between argo or argo workflow and tactile.

A

Okay, um I hope I covered all of the questions that were asked so far and if you have any more questions, feel free to reach out.

A

On slack, I'm often on the tecton slack, and also in this, the slides are available and in the slides I included a lot of links and how to connect to tecton and myself. There is one more question. Maybe running side operation on kubernetes could be challenging many test tools like moca and chest for js parallelized test run based on pro securing phone number within the pot. This file contains the node values which will cause over parallelism, and you guys encounter this issue.

A

um That's a very good question. I don't have an answer for that. I have not encountered this issue myself, but I would invite you to if you will open a github discussion on the tectum gita and we could continue the conversation from there and yeah thanks again for joining the the session and yeah have a great conference. Everyone bye.