Cloud Native Computing Foundation Online Programs, 15 Sep 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Introduction to Tetragon

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, I think it's a good time to start this webinar. This introduction to psyllium tetragon welcome all of you who have joined um to once more go over housekeeping and logistics. All of you who have joined you have been automatically muted um to make the experience as inter as as least interruptive as possible. If you have questions, uh feel free to ask them in the zoom chat as a message to everybody, and we will either answer them on air as we have time or answer them in the chats directly.

A

This session is recorded and will make the recording available afterwards.

A

Presenting today will be myself, thomas groff, co-founder and cto of isovalent, as well as john fastband, who has created the tetragon project originally and has also been a long time. Silly maintainer also senior staff software engineer for isovalent.

A

So let's jump in this uh series will be introducing tetragon ebpf based security, observability and runtime enforcement.

A

What we'll cover today is why tetragon look into security observability cover runtime enforcement, look at a lot of examples of what can be done with tetragon and then host a q, a section.

A

So, let's jump right in and get a first overview of tetragon. So what is tetragon tetragon is essentially a agent that can run on any machine any linux machine. This could be a kubernetes worker node. It could also be a non-kubernetes node, essentially any machine, and it will use ebpf to extract security, relevant observability and also provide runtime enforcement.

A

As you can see, there is a lot of different layers at which tetracon can extract value from or do enforce them in, starting from the lower levels, such as data access file, access, the network, a variety of protocol parsers, then name spacing or like name space technology in the kernel, whether these are network namespaces, cpu, mount name spaces as well as capabilities and privileged cape or privileged access.

A

The virtual file system in terms of file access, tcp layers, for example, to uh introspect, tcp sequence, numbers identify sequence, number attacks, as well as the system and process execution layer, but it does not only or it does not only cover the system level. It also covers applications, so we can also, for example, extract function, calls or or function traces look at executed code and so on. Very important tetragon is is transparent, which means no code. Changes are required.

A

All of the observability. All of the enforcement capabilities are provided completely transparently, the all the observability data, all the policies that come in they are integrated with other systems, and you can see many of them listed above metrics, for example, prometheus grafana for a lot of the security relevant events they will typically go into an siem can be streamed by a fluency to auto systems as well as, for example, grafana elasticsearch, as well as open telemetry or the raw json output.

A

Tetragon is part of the psyllium project family. If that is automatically part of the cloud native computing foundation, so it is essentially a independent project from a technical perspective, independent project on the distilling umbrella, but it essentially benefits and is governed by the psyllium open source governance model.

A

Let's jump into why tata gone in terms of runtime security and security observability. What is needed and why we created uh tetragon is because the security has to be done in real time. So when we protect workloads that are running, we need to be able to detect malicious activity in real time. We need to be reporting when malicious events occur and then even better prevent them before they perform any damage and we'll look at a variety of examples on how that can be achieved.

A

This lists some of the activities that we need to monitor in order to detect and report malicious intent, such as network traffic file. I o activity running of executables or process execution, as well as system call activity and changes in the privileges and namespace boundaries.

A

This can be done or has been done in a variety of different ways in the past, so we'll cover essentially why we have created tetragon next by looking at existing solutions and existing of existing approaches and then compare that to tetragon. This includes ld preload, ptrace, secomp, lsm and lsm edpf, as well as other approaches of etf. To perform this type of security.

A

This is probably the oldest or one of the oldest approaches, ld preload. So the ability to load a library into an application without the awareness or without changing that application with ld preload, we can essentially load a library that will inject itself into the application and have all the system calls that the application performs be handled by that library instead of by the kernel itself. This is called a system called proxy or ld preload proxy. This is great, but it can be bypassed.

A

Obviously, if the binary of the application is statically linked, ld preload will have no effect and we lose all visibility as well as any enforcement that is done. There is ineffective. So essentially, this has quickly been abandoned. From that perspective, we can do system call checking when we enter or when system calls enter the kernel at the syscall entry. Examples of this rp trace cycomp, as well as ebpfk, probes or says call based entry or cisco entry based edpf checks.

A

This is already massively better than lb preload, because the application cannot easily bypass the injection of this, but it is uh vulnerable to so-called t-o-c-t-t-o-u attack. So time of check versus time of use, this means that the hook point when the system call or when the ebpf program or that the solution sees the system call is before the last moment. The application can change system call arguments, and you can see this in the picture here.

A

um Essentially, the hook is at the entry, but the system called handling copying the essentially the memory that contains the system called arguments is after this entry point, which means the application could actually create a system called uh present arguments in the system call such as I want to open this file and then the hook point runs it validates and afterwards the application could still change what file it wants to open.

A

There's a couple of references here to conference talks that have listed details around this attack. The most well-known one is probably the phantom attack that has been covered at fcon 29.

A

So what is important is that we need to make this check whatever system call a runtime enforcement check we perform. It needs to be done at the right level to be effective.

A

Some of you may have heard of lsm or linux security modules. This is a relatively old api itself. It allows to do linux, security checks or additional security enforcement at the right level, and it is a stable interface and is a very safe place to make checks, but it is very static and it requires essentially additional kernel modules, then astronomers to load us additional lsm, probes, better known or better suited, is actually ebpf or bpf lsm that allows to use ebpf to make lsm dynamic.

A

This is already a major step forward and actually pretty close to what we want. The problem with this is that it needs a kernel version 5.7 and it is limited to the hook points that lsm itself provides. So if any additional hook points are needed, we again need to change kernel code and the kernel requirement goes up even further.

A

This is essentially why we have created psyllium tetragon, so we want the same properties in terms of safety, security and hook points as ebpf lsm, but we want to have or avoid the reason kernel requirement and we want to add additional hook, points that are not found in lsm as well as have the flexibility to have multiple ebpf programs share state with each other using maps. This is the silo or the database item that you can see on the right here.

A

This allows for multiple coordinated edpf programs to work together, and we will see examples of why that matters later on. It also allows us to do in kernel event filtering, so this is the basis of the high performance observability that ebpf and tetragon can provide.

A

So let's jump right into observability here. What is what is the type of observability that um that tetragon can provide? And I see the first question that came in as well, what will be one of the main differences use of tetragon with the datadog agent running security, runtime security features. That's what can actually address that right away as we go through the observability page here. So the the basis of tetragon is this agent, which uses in-kernel collectors ebpf-based to collect a variety of different observability uh data or a different observability data types process.

A

Execution system call activity, file, access, tcp, metadata name, spacing information capability, changes, privileges, uh privilege, changes, data access on the storage side on the file exercise, a lot of different network activity, visibility functions, including raw layer, 3 layer 4, as well as different protocols because of ebpf smart collector ability. So ebpf has specialized map types and functions such as stack traces and ring buffers and metrics and hash maps. This can be done very very effectively, so we can combine this deep visibility, so you can see across the stack.

A

We can extract visibility from lower levels, network storage, all the way up into the application. We can combine this deep visibility with the transparency, so it's our app agnostic and no changes to the applications needed so far. This is in line with other collectors as well. Many of them also have pretty deep visibility where it where it becomes unique and different is the low overhead.

A

You can see this smart collector item in the kernel portion of the tetragon piece there on the left. All of the filtering. The aggregation is done in kernel, which means we can massively reduce the amount of data, the amount of observability data that is sent from kernel so from the kernel runtime to the tetragon agent. This is the arrow in between kernel and the bigger being on top, and that is typically the biggest overhand.

A

So if we send a lot of observability data from the kernel into the agent in user space that will impose a lot of foreign, so the more filtering the more aggregation we can do in kernel. The lord overhead make a concrete example. It is massively more efficient to, for example, collect metrics such as a rate or a histogram in kernel compared to sending individual events to the user space and accounting for the metric in user space.

A

So that's the main difference to existing or other collectors. Ebpf gives this like foundation this framework to to provide massively low overhead observability um very similar to how perth some of you might heard have heard of the perth uh performance, troubleshooting and um perf trouble or tracing you tracing utility that also uses the same mechanisms to provide high performance visibility more into the the function, call and memory and cpu consumption or memory and cpu usage aspects.

A

Lastly, integrations um all of this visibility is useless if we cannot integrate this into existing systems. um What we currently support is prometheus profana, a variety of sims or sims flu and d, open telemetry as well as elasticsearch, but with the json export and in particular, with the prometheus capabilities. This can, for example, also go into a datadock dashboarding or into a variety of monitoring platforms that cloud providers offer.

A

If we go into a bit of more details and then into examples.

A

First of all, context is everything in terms of security right, so we need to understand as much content as possible and we'll see that as we go into the examples next, because the better the context, the easier it will be to understand for security teams in log files and the more accurate the alerts will be. This means that, based on logs and alerts, we can quicker and easier identify. What is the cause and what is affected? So, let's look at a couple of examples and we'll start very, very basic and then go go further.

A

I'm starting with very basic network interface metrics like how much traffic on what network interface right, boring. But, yes, natural con can do this as well. Let's go further and let's look at, for example, tcp latency. This is already a lot more interesting, transparently, measuring the round trip time for tcp connections combined with dns visibility. So we can see the round trip time over time to a variety of external dns, endpoints or external endpoints, uh and essentially labeled by the dns name. That was used.

A

So we can see the latency to stats.profile.org api.twitter.com, a variety of aws endpoints, and so on already pretty interesting and all of things. This is done completely transparently, so you can identify what connections, what endpoints are subject to, for example, higher round-trip latency, but then also traffic accounting.

A

We can see what, in this example, a dashboard shows which kubernetes part is egressing or transmitting how much traffic this is in case is on a pod level, so I'm just at the pod name level, but this could also be annotated with the label that represents the name space, the region, the availability zone. So you can measure cross region or cross ac traffic easily. With this as well.

A

In another prometheus metric example, we can look at tls and ssl two examples here, for example, matching or extracting the sni name. So what are the the different sni domain names or host names? That connections use, so we can easily see what our apps or what host names or our our apps reaching out to as well as tls, handshake, so understanding what connection or which type of endpoint network endpoint is receiving tls handshakes. We could just annotate this further with, for example, tls version or cipher, we'll see examples of that.

A

Next, as I mentioned, all of this observability can go into an sim such as elastics or splunk, or something else, and then you can query this, so this is an example: query to detect weak or vulnerable tls versions. We can see that we are querying all events where we have tls information, that that implied tls version, 1.0 or 1.1, and then also I want to show things like the process name, the namespace, the pod name, the sni, the port, the ips, the store time, the pid and so on.

A

So we can get rich context while we detect weak or vulnerable use of tls diving deeper into the networking side. This is an example of the networking related events. When a connection happens, so we can observe everything from dns, hdp and tcp. So if you go from the top to the bottom, you see at the very beginning, a process is started. Curl and curl is, is essentially invoked with the argument: psyllium dot io. We can then see the dns resolution.

A

This is in this case this is a kubernetes part, so it will attempt to resolve a variety of different kubernetes service names um to essentially expand this into what could be a kubernetes service name, this all failed, so it does not resolve until we actually go and resolve 7.. I o we see the ip returned. Then we see the connect system call. We see that this is a tcp connection. We see hdp here.

A

We can see that that that cylinder, I o actually returns an http 301 um to essentially redirect us to the https version, and then we see that a socket gets opened. We see the amount of traffic that that was caused on that soccer, both on receive and transmit. So you see a variety of different um observability data here from process execution to dns layer to connect system call itself all the way into http traffic parsing, but then go. We can go more and go more into the security side, for example auditing.

A

What are all the ports on which applications are listening on? So we can query our entire database of tell me all the parts which are listening on particular ports. You can see the result at the bottom, so we see what are the parts with all the labels that are listening on, for example, port, 9080 or port 5333.

A

We see the actual binary, so we see that in one case this is netcat essentially listening on port five, three three, two three. In other case, this is a python application. We can also see who has been invoking this, um so we see that in one case this was directly spawned from a shell. In other case, it was container d shim we can detect dns bypass attempt, so let's say a pod. Instead of talking to cube dns or kubernetes dns attempts to directly talk to an external or outside dns server, we can.

A

I can easily identify such network flows and query them. So in this example, we see that there was a a workload with a set of labels running in the tenant jobs name, space that attempted to directly talk to an egress dns when bypassed or attempted to bypass the cube, dns or kubernetes dns server go further and detect, for example, nmap or network map scans, in this case filtering for a specific value in the user agent field of an http scanner. We can see not only when that scan occurred.

A

What was the user agent, but we also see what was the process name? What was the htg parameters? What was the time and so on? So we have full context into when a particular http nmap scan or happened or occurred, of course, moving away a little bit from the networking side. Tetrocone can also do raw system call and process execution visibility.

A

This case it's showing the raw json output, so on the right, you can see a tracing policy and this tracing policy uh essentially indicates that I want to observe all mount system calls and it also shows what type of arguments we are interested in on the left. You can see that the the a small subset of the full context that we can provide, obviously the process itself with the binary, the current working directory, the uid, the pid, the start time, the pod label information.

A

So what pod name and what name space the power labels, but then also all the way into the container image so container id the image to show off the image uh the docker id, as well as the entire process ancestry. This is just a very small subset of the four contexts that we can provide. Every event contains a massive amount of context that goes along with it.

A

Now it gets even more exciting because we can combine the two together, so we can combine this system called observability with network visibility.

A

This is an example of the ui version of this that actually shows, in this case a cluster, a mini cube, cluster, a namespace tenant jobs, and there is a pod crawler and you can see the entire process ancestry tree of not only the container itself, but also the control plane of kubernetes, including cubelet. You can see the process, executions and which process makes or attempts, or has established what network connections the arrows or the lines.

A

Those are the network connections, so we can see that there is a variety of, in this case a node app uh invoking server.js is reaching out to an external ip to elasticsearch and to api or twitter, and we also see that there is a reverse shell that has been invoked by a net cap. This is the the line at the bottom, which is reaching out to another domain like this blubberish, not a reverse shell, and we can see which individual process made. This request from a networking perspective would be very hard to spot.

A

This reverse shell from a system called perspective only would be very hard. I mean in this case it may be easy, because the attacker was using not a reverse shell in the name, to make the demo easier. In reality, it will be very hard to spot this without this combined visibility.

A

This is showing kubernetes specific example, but this functionality is actually not keeping at a specific in any way. This can this. This works for any process running on the linux machine, detect late process execution, so actually very common. You will have containers or workloads that run a single binary and you want to identify what are containers, what are workloads that have a process or a binary executed sometime after the container was started, and this can this can often reveal a compromised, pod or or container, because this is not what the application tool does.

A

Let's say you have like a single statically, combined, uh binary running ass application. You can easily rule out that this container will never start a process or a binary like 10 seconds or one minute after the container has been started, so you can easily identify hey, let me know which containers have had processes or binaries started, one minute or 30 seconds after the container itself was started.

A

This is very likely actually reveals a compromised uh container or pod, or some other malicious intent, monitoring file access so going down or go moving over to the storage site. This is showing a splunk integration here that shows which part which container, which workload is accessing certain files. In this case, we are we're monitoring in couple files, so such as etsy password ash history, shadow file and we can see which part but also which process is accessing what file and what is the file operation? What is the operation? They are performing.

A

That's just the monitoring side of things. Then we can go further and look at, for example, network policy, compliance and look for what are what connections have been subject to what policies. So we can look at all the the allowed connections and identify what was the policy that was used to allow this traffic and, even more importantly, we can identify what was allowed without any policy at all, for example. So we can clearly we can clearly validate and audit whether we are achieving from a policy perspective what we intended you can observe http and grpc.

A

This is showing example, where we show and detect cross-scripting attempts in the uri, um essentially querying in this case splunk with particular search query that will um request or will show http flows with uh just with the name script in the uri. um In this case, it just surfaced a simple cross, scripting attempt here now switching gears a bit and go into the enforcement side. So we've seen the full width of observability um that we can provide, like from network to file to system call.

A

We can do enforcement on a vast majority of this observability, but before we go into complete examples, a couple of high-level points on how this enforcement works. First of all, it is preventive security to them to the that's like the cornerstone, the the most important aspect of tetragon, so uh essentially preventing malicious um in malicious actions or malicious attempts before they can do damage to the system or to application. This includes the system, but also the network, the file system, as well as application behavior.

A

It is synchronous and we'll get to that. So it is essentially doing this in kernel. In terms of policy, we have a couple of integrations. You can define policies with kubernetes crds. There is a json api as well, or a json configuration method as well as open policy agent that can be used and we're looking to convert or looking to support converting from existing rule sets. Such as falco rule sets or pod security policies as well.

A

So if there are other forms of intent where you essentially already define what your application should be able to do or not, we will look at supporting them in terms of preventive actions from user space. um This is what we are trying to avoid, or this is what what tachogon is not vulnerable to, which means that typically systems that rely on a observability with a user space rule engine are essentially vulnerable to the following.

A

The part or application, or the process is compromised or has malicious intent and performs either an exploit or a malicious attempt in the kernel and changes behavior or attempts something maliciously the observability piece in the kernel. Let's say it's k-pro based or it's second based will export this visibility with a asynchronous notification to the user space agent running running there, and you have a rule engine there. This rule engine will consume this observability and will detect that.

A

Oh, this observability indicates that something bad is going on and will then kill the container or kill the process. This is asynchronously, so it happens essentially after the malicious attempt has already been performed. So while it is strictly better than not doing anything, it can often already be too late in terms of preventive action. What tetragon does instead is doing this filtering and this rule engine in the kernel. So I think one of the question was: how does this compare to falco?

A

This is one of the big differences that, instead of using evpf, primarily from a visibility, extraction, perspective tetragon, does the filtering and the rule engine part in kernel, which means that, as it processes the observability data in the kernel, it can immediately kill the process it can be, and you can even prevent uh the activity itself. So let's say we have a system call that should not be allowed. We will not allow that system call to be executed at all. We will not report that the system call happened and then kill the process in hindsight.

A

Looking at a couple of examples here, this is an example how we can prevent access to a sensitive file, for example, in this case edc shadow. So we have a policy that essentially the policy is not matching. The examples, no worries, so the example is showing how this this is done for to protect uh authorized keys file for ssh. The example is showing this for egc shadow with very similar use case.

A

We want to prevent write access to a particular file and tetragon will immediately kill the process that attempts to write to that file, but obviously all to be doing so just to open the file or read from the file and so on. This is an example where we want to allow reading from a file, but immediately prevent any process or a particular process, or a particular parts, to write to a particular set of files.

A

We can also do things like detecting remounting of the root file system. This is an example how this can be done using the pivot root system. Call, let me check for the question in the chat. What are the available actions other than sick kill? If any? So, obviously, there is an action that can provide visibility itself. There is an action to sick kill and for uh some of the hook points you can essentially prevent or essentially change the return code.

A

So, for example, for a system call you can, you can have the action say, don't execute the system, call and return with an error instead. So essentially, when you are, or when you're operating at a hook point where you can change the verdict, then obviously you want to prevent that photo processing and just return.

A

If you are detecting something where the devotee cannot be changed, then sick kill is, is the best cause of or the best next step and other other option is just to provide essentially an event to users face that particular event or a particular behavior has been spotted.

A

Monitoring and preventing capabilities abuse, so this example is showing uh when the monitoring of capabilities is enabled and we're seeing here, process execution that shows a pod test. Pod um actually using ns, enter to essentially change the the mount, the pid, the network, the uts and the ips name space, and it can do so because it has capsis admin, privileges or capabilities. So it is succeeding. So we normally see that the the ns enter process or command is executed. We can also see that it performs set namespace functionality to change or adjust namespacing context.

A

We then see a bash invoked and we see that we can open and close xtc password. But then, when we try to write to the password file, we can still kill the process. So this is making the important point that we can, that tetragon is not automatically vulnerable or not automatically.

A

Subject to the point that anybody with capsus admin can automatically access any file, for example, so tetragon is independent from that perspective, this shows both obviously the preventing file access again, but also the ability to monitor capability uh changes and capability context of any system calling any runtime behavior observed.

A

I think john already answered this question um so we'll move on and actually start summarizing a little bit. So we've seen a variety of things. At this point, we've seen tetragon be able to provide both observability across the stack. So we've seen file access, we've seen data access, we've seen a variety of network behavior, both from a connectivity perspective, protocol, parsing, http, dns, tls, you've seen capabilities tracing. So what are the capabilities? Is capsis admin? Is it capnet admin? Is it ppf so seeing?

A

What are the capabilities of a particular system call or process execution or some other criminal activity, as well as privilege, escalation, so being able to understand the privilege that a particular system call is subject to or is is equipped with the file access? um You know: we've seen the tcp visibility with the round trip, time, visibility and one of the initial slides, as well as the raw system, call visibility. What are the system calls being made um as well as the process execution, including the process ancestor tree, so understanding?

A

Not only what is my process but who has spawned me and who has spawned the process that spawned that process and so on? We've seen examples of prometheus metrics we've seen the example of grafana dashboards, we've seen in particular the splunk integration we've seen the json output that can be fed with fluency into any system. You want, for example, for example, into an into an elasticsearch cluster. um The tracing in the metrics can also be exported using open telemetry. If there is um desire all of cilium.

A

All of tetragon is available on the following github repo and we'll talk in a couple of minutes about what is has been released as open source and what is available under I surveillance, tetragon enterprise.

A

Before we do that, let's jump back and answer this question: are there any plans in maintaining rule sets, or is this already part of tetragon? um So yes, actually, let's go through this slide because it mentions this, so there is essentially tetragon that is available in the in the and the cilium tetragon repository.

A

What is part of the open source repository is the following from a visibility perspective. The process and system called visibility that we've seen all the layers through there for network visibility and file access monitoring, as well as basic capabilities and name spacing visibility and on the enforcement we can do. The system call based enforcement based on k, probes and trace points in addition to that, isovenant offers a tetragon enterprise distribution.

A

First of all, it's it's a hardened enterprise distribution of tetragon, so it has, for example, extended end-of-life support. We of course offer enterprise support for tetragon as well, but then, in addition to that, it has advanced capabilities, including extended network visibility. This, for example, includes the round trip time or the latency measurement on the tcp side, as well as the dns visibility, the hdp and https visibility with ktls, as well as all of the tls visibility that we've seen.

A

It features the siem, the siem integration directly with splunk splunk caps, the process and street three information, so understanding the full context of who has spawned whom, as well as high performance protocol parsers and extended aggregation and filtering logic on the file access side. While the open source version features file, access, monitoring, the enterprise version can also do file integrity, monitoring with documents, shaw, 256 as well, on the runtime or on the enforcement side.

A

The enterprise version features extended runtime enforcement capabilities that are more automated, so the system call based enforcement in the open source version, cid based or json-based can enforce rules as written. The advanced enterprise edition has additional automation around kubernetes and it has a baseline policy set which can do threat detection for known threats, as well as simplify the installation of enforcement rules as well.

A

We have already a couple of covered a couple of questions, but I see more questions coming in. um Can you run tetragon on a cluster mesh deployment and would applying tetragon policy follow the same principle of cmp, where you need to apply to each cluster manually versus a single touch point?

A

Yes, you can apply tetragona, you can run tetragon in a cluster mesh deployment. Tetragon can be deployed, it can be deployed independently of psyllium. It does not require psyllium to run if selim is running. Tetragon will extract additional visibility from cilium itself, so it will benefit from a silver installation, but it is not required to be there in terms of policy. The policies work exactly the same as the cilium network policy in a cluster mesh context, so you will have to uh install them or load them into individual clusters.

A

On the in the enterprise versions, we have tooling to automatically apply policies across multiple clusters and I think john, you want to answer a question here.

B

Yeah yeah I'll just answer it.

C

B

Was a question about if we use the trace points versus k-probes, um because our examples are k-probes, so the tetragon base also knows how to do trace points. So if you want, you could do trace points, but the problem with trace points is they need to be at the cisco level or in specific spots already in the kernel. So we do use them in some places and tetragon will try to smartly use them in the right places and and use k probes where I can.

B

Our our policies might say k pros, but under the cover that the sort of tetragon agent is trying to find the best uh best sort of mechanism that your kernel can support to do the filters so on newer, kernels you'll even get um some of the more fancy.

C

Hook points that that are sort of more efficient, so I think that should cover that question.

A

Awesome great um before we go to the q, a um if you want to learn more um tetragon is covered in the security observability for evpf, booklet or report. We have done with o'reilly, there's actually a bigger book coming on ebpf, but for now you can um you can freely download this security observable the vdpf that actually gives an introduction to chat again and give some of the background why we have created tetragon as well.

A

Tetragon is also featured in the enterprise hands-on lapse, um which gives you a way to try out tetragon and with instruct actually get your get your hands dirty with tetragon without having to install it yourself. So it essentially sets up a sandbox environment for you, so you can try and try try out tetragon and play around with it as well as you can use the the virtual summer school that starts july. 2019, um it's an entire day focused on tetragon service, smash and variety of other topics.

A

um The link is in the slides and we'll make it available to all attendees afterwards as well. If you are interested, you can sign up, and I see a couple of questions are already coming in. um Let's see if anybody has questions uh feel free to ask them in the chat and we'll be happy to answer, I see. Cornelia is also posting all the links. That's great, a comment from matthias a pack of easy installable security rules is missing like uh like in falco. Yes, uh in fact, we we don't necessarily want to recreate everything.

A

So, as mentioned, uh we are currently in implementing a file called rules. Translators you can actually bring your falco rule sets whether this is the existing falco rule set. That is in the repository. You may have your own and enforce them with tetragon um the benefit there is that you will essentially benefit from the real-time enforcement behavior of touchagon, so you can enforce income with in-kernel capabilities instead of going to user space.

A

Let me check if there are other questions, if you have more questions, feel free to post them to the chat. I also check the q a section if somebody asked anything there.

A

Also, john, if you want to add anything uh to to any of the points, uh feel free to do so as well.

B

Oh yeah, I think I mean just if I I extend the kind of the low level details of the trace point for escape probe. I just say that some folks at the isabella are working on.

B

You know further reducing the overhead of some of these these hooks, um but for a tetragon, it's not a for most of the tetragon hooks they're out of the hot pathway, which is sort of the advantage of doing these networking hooks versus sort of inline um methods like if you were to think of like uh s flow or net flow, where you grab every packet and then try to analyze the data. So tetragon works with the at the socket layer and in the kernel. um So a lot of these things are not hot path items.

B

So overhead is kind of minimalized. It's not a per packet cost. It's a per connection, cost.

A

Yep, I also see another question that came in that's not strictly tetragonal related, but we can, of course also answer that. So I'm interested in a reaction on this and then a reference uh to the blog post uh from from buoyant or linker d on side car proxies in service mesh. um So for those without the context, we have released a service mesh uh part as part of psyllium in a battle level last december, and we are marking our the salem service mesh.

A

Ga with 1.12 coming out in about two to three weeks um and the the big difference of psyllium service mesh is that it offers, in addition to the existing histogram integration, it does offer a non-cycle or a site called free version of a service mesh that allows to run either a per node proxy allows some of the service mesh functionality to be done entirely in evpf, without sidecar, or to run the proxy in a different granularity, for example, per name space or per service account, and there is debate going on whether this is uh whether this is the right model or what is the right model?

A

What is the better model, and this blog post pointed out several questions or several aspects? I think there is a lot of good content in that blog post, some of which- which I don't necessarily agree with. So I think, from a multi-tenancy perspective, psyllium has been running um in a per node proxy configuration for years very successful in very large deployments.

A

I think the claim that this is a lot of hard work and impossible is is a bit weird, because we are running in that configuration um since years successfully, and I think there is another angle which is very, very interesting.

A

The the claim or the the aspect that a per node proxy is dangerous from a perspective of having a single proxy share, multiple secrets, which is actually something that we agree with, but we've found what we believe is a better solution which is to actually extract the mtls portion outside of the datapath proxy entirely and make it separate. There is a blog post on this that has been released and we can link to it, which essentially shows a model where the mutual authentication is done with a separate user space agent.

A

That could be per node or it could also be a very minimalistic sidecar per pod, and it means that the secret itself is or the keys or whatever the the the way the authentication is performed is not in the data path at all, which means that, even if you run, for example, a web assembly extension or you are running onway, which has complex http processing, even if that proxy gets compromised, it does not compromise your secret.

A

So we see that as a very ideal solution in terms of security from a mutual authentication perspective, um so I think that's also not necessarily a valid um argument against the sidecar free model. That said, um we are not um in a position where we're saying nobody should be running cycle free proxies at all.

A

In fact, we've done the istio integration first and have been running that for years uh with users, so that's still kind of the first implementation we have done and then based on a lot of user feedback who has or have requested, can you find a way to run or provide service mesh functionality without a site called implement this additional way of running service mesh? I hope this was a kind of sufficient answer to this. I don't think there is necessarily a right or wrong.

A

uh We are trying to operate as much as possible on user feedback and implement and provide what our users uh are asking us for uh we're not preventing or trying to prevent anybody from running a sidecar. If that's the model, they would like to run.

A

Another question is prometheus exporting supported in the open source flavor. Yes, it is so the metrics and premium export is supported. The enterprise version does have additional visibility as uh as laid out here in terms of dns http https tls. The process ancestor tree as well as some of the high performance protocol parsers and some of the network visibility is extended, but the metrics themselves. The metric export, is all in open source.

A

Don't know if this is the best place to ask, but I would like to know where does tetragon write its events to I'm using export finally modify and then reading it with tail? Follow uh this works, but I'm sure there is a better way. I'm running tetragon natively on kubernetes. uh Maybe sean. Can you answer that question briefly? Yeah.

B

Yeah um yeah so.

C

Many of our users will use.

B

Fluentd and then export this into their sim, whatever whatever that happens to be, you can also use fluentd just to aggregate the logs and dump them somewhere else, which is what I do a lot of times in development. um If you just have a lot of nodes, you want to see all the logs aggregated there's, so those are sort of the common uh I say: use cases that are in production, there's also a grpc endpoint you can attach to.

B

We use it in mostly for testing at this point, but I mean you're welcome to hook to it and and sort of stream the events out as well. If that's that's interesting,.

C

A

Another question: is there any sort of admin ui, maybe integrated with hubble? um There is integration with hubble ui, which means that we have not released this. Yet um we will release it soon. Integration, where the visibility from a runtime perspective will be essentially visualized in hubble ui as part of the the hubble existing hubble ui.

A

All of these events can also be fed into timescape, which is part of our enterprise, offering it's a time series database, where you can essentially collect all of this observability and store it persistently, and then query it and again run hubble ui. On top of that, so the time series database actually offers plain hubble api, so you can run the the hubble observe cli, the api, the hubble ui and all the hubble tooling.

A

On top of the timescape time series database again, we will be looking into policy into runtime policy management as well from a centralized perspective. Right now we have automation with a variety of automation, tools like cf engine, puppet ansible and so on, but we will be looking at providing something similar, as we have done with the network policy editor for the runtime site as well.

A

Now I see two questions in the q, a section: what's the expected resource usage of tetragon per node, it runs as a daemon set right. So, yes, it runs as a demon set. So there's an agent running on each node. The overhead will very much depend on the tracing policies that you load and the aggregation that you configure.

A

So if we go back here um because of the the the flexibility of evpf, we can do a lot of aggregation in the call so depending on whether you want to see every single system call that is being made or whether you want to see, for example, only namespace changes or only access to certain files. The overhead will be very different, very or will differ. It can be anywhere from one percent to 25.

A

I would say so. It really depends on how much point you want to see and what what granularity do you want to aggregate and see certain sensitive events, or you want to have a full system call um system called lock? What's the pricing's pricing licensing model of tantragon enterprise tetragon enterprise is part of um selling enterprise from from isovalent, and we will embed this into the price, so it's very similar to salem enterprise. It's a per node subscription that is at the base with a scale discount as your infrastructure grows.

A

Of course, as I mentioned, you can run um tetragon completely independently of sodium, so you can, of course, also purchase tetragon enterprise separately. If you run both, we will of course give you a discount.

A

Let me see if there are any more questions. We have a couple of minutes left. So if you have more, questions left feel free to post them.

A

But I think we have covered all the questions that were posted. So let me maybe repeat the uh the follow-ups here again so ebpf report the booklet great way to get involved and read more about tetragon, the hands-on lab with instruct great way within minutes. You will have essentially a sandbox kubernetes environment with tetragon installed and you can try out uh tetragon but also other aspects of southern enterprise and then the virtual summer school day on july 19, where we will host tetragon as well as service mesh.

A

On top of that, there is a tetragon slack channel on the psyllium slack. So if you go to psyllium dot io, you will find a button to dislike. You can join the slack or slack server. There is a tetragon channel with all the tetragon developers on and, most importantly, if you want to get involved outside or in addition to using tetragon feel free to contribute, tetragon open source repository, github, slash students, tetragon, um we very much encourage contributions in all forms doesn't have to be code contributions, but also, let us know what features.

A

Would you like to see? We already got some feedback today. Rule sets uh we would love to have a discussion. What type of rule sets? What integrations do you want us to implement, for example, part security policies automatically that are getting deprecated? Do you want us to support something other than the falco rule set, and so on with that, I would like to thank everybody for attending this webinar.

A

If you have more questions that you forgot to ask, feel free to ask on slack, feel free to reach out to me on twitter to me or john we're happy to answer there as well, and thank you a lot thanks a lot. Everybody.