Kong Kong Summit 2020, 29 Nov 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Maturing with Open Source: Cultivating an API Ecosystem for Enterprise Success – UnitedHealth Group

Description

Learn more about Kong: https://bit.ly/2I2DypS

Building a thriving API ecosystem is more than offering a stable and performant gateway solution; it’s about the people, processes and practices which contribute to long-term success. Since Kong Summit 2019, our experiment in open source has grown to support over 90% of UnitedHealth Group’s RESTful traffic. The path to success within this Fortune 10 healthcare giant has been paved with intriguing challenges, unexpected hazards and valuable lessons. Watch this Kong Summit 2020 session to learn our roadmap for how to build a thriving API ecosystem, engineered for long term success, with Kong at its core.

A

Hi, I'm jeremy justice, a senior software engineer now in my fifth year at optum, primarily working in the api and api gateway space. I'm also proud to be a kong champion based out of greenville south carolina.

B

And hi I'm ross sabrisha, I was hired by optimum about four years ago. I've been working in the gateway space ever since I'm a kong champion and I'm based out of denver. Let's kick this thing off by talking a little bit about optum optum is a healthcare technology company. It's part of united health group and one of its important missions is to provide the tech infrastructure for this fortune.

B

7 healthcare giant- and I mean you- can see this screen there's over 300 000 employees in this organization, thousands of apis countless integrations, and since our presentation last year, kong and optum has seen some remarkable growth in adoption within the space. So let's start there around this time.

B

Last year the open source, conbase gateway platform at optum, hosted about 1900 proxies and handled 375 million transactions per month, as you've, probably already predicted from those awfully revealing y axis 2020 saw a tenfold increase in both metrics to more than 11 000 proxies and 4.5 billion transactions per month that works out to about 150 million per day.

B

Now. Surely growth at this scale must involve an equally precipitous increase in the work required to add capacity for it all right, I mean this very scientific graph shows the breakdown of feedback and advice I received when the platform is really starting to ramp up concerns about capacity were by far and away the most common topic. So for those folks I'd like to share the complete record of everything we did to add this new capacity buckle up.

B

And we're done, it should probably come as little surprise to this audience that, due to the highly resource, efficient and container-friendly nature of kong itself, as well as the other components in our stack, adding another pod in each environment was all we had to do from a capacity perspective to accommodate this new load.

B

But, as we've certainly learned and as I'm sure many of you know, growing a platform takes more than capacity. And with that said, there is no better topic to start up the conversation about what remains than creating advocates through expanding functionality.

A

When attempting to grow a new platform within a large enterprise, having advocates around the organization can play huge dividends towards overall confidence in the platform and increase your customer adoption, one of the best ways to create advocates is to deliver features requested by those teams, even if they weren't planned or road mapped and the flexibility of kong came in handy to meet those demands.

A

Three examples we faced this year involved asks from the enterprises, a larger company effort and others directly from consumers for features they needed.

A

One of the first asks came from a large enterprise effort to add a new centralized log. Sync, the gateway seemed like a good application as a central point to capture security related metadata for api transactions. First asking each individual api provider and their variety of application stacks to implement the log sync themselves.

A

In this case, I had the unique opportunity to not just write a kong logging plugin but to also add tls support to the underlying resty client library. I wanted my kong plugin to leverage this effort exemplified how kong is already built on very strong foundations.

A

Nginx is the well-known, highly performant modern web server, and then, on top of that, you have open resty and the assortment of rusty dependencies and underlying libraries that we can use at our disposal.

A

Now we also faced a tougher challenge where a vendor client app had no flexibility and needed to authenticate with the gateway over mutual tls.

A

Our existing ingress architecture could not even support mutual tls against the gateway, because the tls handshake occurred in front of the gateway as well as we had a web application firewall that performed security checks on inbound traffic.

A

This task required adding a fresh rework of our ingress to enable the handshake to happen on the gateway, as well as security payload scanning much like kong. We wanted an open source solution to help with threat protection as well. We were able to leverage the mod security v3 open source web application firewall, along with the oauth core rule, set to manage a set of well-defined attack, vector checks directly from our kong nodes.

A

This was a great experience because it opened our team to more of the application security side of engineering, and there were some really great community members from both the rule set and waf teams. The new ingress approach for mutual tls also enabled direct visibility to threats and the blocks we faced on api transactions now. Lastly, a long time ask from customers was how to support multiple auth patterns on a given proxy service.

A

Now, in the api gateway space, there are really two kinds of authentication patterns we offer to customers, programmatic implementations, which involve server-to-server communication and then user-based authentication, where a transaction originates from a client in a web browser.

A

Now, for a long time, kong had offered an anonymous user pattern that enabled kong to run multiple programmatic authentication patterns against a given proxy, but one consequence there was needing a con consumer resource in the context with the required acl group set. This caused problems for a use case.

A

Many customers wanted where they desired the proxy to support both programmatic and user-based authentication, but because our user-based authentication came from third-party identity providers, it meant that there was no con consumer resource to be had in these types of interactions, but then came a really neat pr from oppo of the kong engineering team, which enabled programmatically setting the authenticated groups to the context of the transaction, bypassing the need for a con consumer resource at all. This elegant solution enabled great flexibility to the authentication, plugins and patterns that kong per proxy could support.

A

Now, an important aspect to running a healthy, large scale, open source platform is constant: community involvement, providing feedback to the core development team and prioritizing that a platform stays up to date, running the latest stable releases in your environments.

A

One of the critical reasons enterprises are uncomfortable leveraging open source technologies is due to the lack of a contractual vendor support lifeline throughout the upgrade process. Handling unintended behavior seems too big a risk not worth taking on one's own.

A

Well, we're here to say that now, after a period of three years and 22 plus kong upgrades in multiple environments, the process is really just not that scary. Almost all of the upgrades went smoothly all except one which we'll touch on shortly and exemplifies, even when things go a bit sideways. Recovery with open source technology is not a process to be afraid of breaking down the kong upgrade process. It's engineered to be simple and impact free. First, starting with the running version of kong taking traffic.

A

We then would fire up a new version: kong, not taking traffic initially, but execute a kong migrations up command from this node. This modifies the database to a state that supports both kong versions at runtime. After this step, you can begin supporting traffic on the newer version of kong node.

A

Once satisfied with the behavior seen on the newer version of your kong node, you can execute a con migrations finish to wrap up a few finalized database adjustments, specifically for full support on the newer version of kong. It is at this stage that you want to only send traffic to your newer versions of kong and ensure that the older kong version instance nodes are taken out of rotation.

A

Now a situation did occur on the 2.1.x series which caught us a bit by surprise.

A

We have three kong environments, a dev, sandbox environment, a stage which is customer facing, and then our production environments the upgrade, went off without a problem in dev. So we had confidence. That stage would face no issues. Then, once we got around to upgrading stage and the migrations up had seemingly completed, we ran a migrations finish now. That's when we got the air that no louis kong programmer likes to see stack traces of the dreaded attempt to index a nil value.

A

This was a harbinger error to the air that we started, seeing an impact around 10 percent of our stage traffic and would continue to do so almost until the next full day. Let's break down this real world scenario to gain more confidence in the recovery, from a situation in the open source space and the mistakes we made, so you don't have to make the same well step. One for us is to try to capture and ship logs and screenshots of some of the areas we see directly to the community in the kong repo.

A

This is the fastest way to ensure current code. Problems are going to get looked at by those most familiar with the changes and to help prevent the rest of the community from facing similar issues, then step two is to have a strong recovery process.

A

Now this is where our biggest mistakes were made in situations like this, we rely on our database backup and restore process to bail us out. Now this process was very underutilized and, as it turns out fairly immature, a restoration process was incapable of restoring data to a cluster that had already gone a schema change, since the backup was taken like exactly what happens during a car migration.

A

While we initially did fix this, the time that we found ourselves manually, editing, database records and key space schemas to complete the car migrations manually, trust me: this is not a situation. You want to find yourself in begging a kong engineer for some guidance via zoom, while leadership is heavily engaged. Let's just say that the database, backup and restoration should always be the go-to first option for fast remediation of a problematic upgrade.

A

Now. The major benefits of staying up to date will not only help your organization leverage, the latest and greatest features kong have to offer, but the kong product itself benefits greatly through lots of diverse use cases and all the functional testing that goes on from its community of users.

A

The sooner the community updates, the sooner obscure introduced bugs, are found and fixed, even after the kong application.

A

Even after using the kong application for over three years, we still get critical enhancements that meet our needs and use cases. One such example of this could be seen this year by tebow's contribution that enables dynamic upstream keep live pools. This is what helped us to route to certain types of secure apis that shared the same initial ip address and port for their ingress, with all of our growth and scale over the last year.

A

Without this feature, we may have been unable to continue to leverage kong efficiently, so this was a huge benefit to the technology, and this was a fairly recent release on the kong 2.1.x series.

A

Another great example of how staying latest has helped us overcome problems was a code contribution by a community member named harish throughout 2019 on the kong 1.x series. We were hit by an odd bug that constantly caused cassandra to return a null pointer exception when kong resources were being rebuilt. Frequently due to changes in configuration and us nor the kong team had the answer to what was causing the problem for close to eight months, then it seemed out of nowhere.

A

Harish from the open source community realized a really low level variable scope problem in the khan kassandra paging, and this fix was released on kong 2.0.2.

A

So you know any of you calling cassandra users out there who need to query enough resources to cause cassandra paging. um You know if you're on the elderly versions of kong1.x or potentially on the ancient versions of calling 0.x series, then you're gonna want this fix to. You know be able to perform these proper cassandra paging techniques while running your kong instance.

A

Now the other added benefits of staying up to date on the latest versions. Is your kong upgrades stay smaller and change scope, the bigger jumps you must take to get from the latest version. You know you're going to run into lots and lots of potential problems as you jump migration to migration to migration. It's much better to take things in small scope.

A

I'd also say that support for the legacy version of open source copies of kong is going to be minimal. um You know what it takes to run and maintain um all the features and stuff that kong releases you're going to have people come to you and you're going to say: hey, I'm running, you know version 0.14.x, I'm facing these issues. People are going to want you to bounce back and get back to the latest version, because it's very likely that the features have been fixed in the latest releases.

A

So you know, following these best practices to stay up to date, is going to have a positive impact on the stability performance and feature set of your platform and significantly contribute to customer.

B

B

Okay, another great way to contribute to customer confidence, promote adoption and create advocates for your platform is to empower customers through effective operational support. Let's justify that statement for a quick moment and examine why we need operational support. I mean why do we really need it? We have a git ops based self-service model. We have detailed event, logging and metrics. We have mature documentation which covers everything from those topics to the security patterns we support and troubleshooting tips. Why do we need op support?

B

Well, there's really, three reasons. The first I like to call integration consultations. These are basically just inquiries on topics we may not addre directly address in our docs things like consultation on a custom flow participation in a client security audit.

B

These types of discussions. Secondly, troubleshooting, because no matter how many faqs you write tips you provide or how complete you're logging. Sometimes the customer is just going to need, help to figure out what's wrong and, lastly, and least frequently needed non-standard proxy management. These are effectively manual work orders to provision rare changes. We don't support through self-service now, we've kind of touched on the. Why? Let's dive into the how, by looking at the support flows, we manage to deliver support for all of these needs. We'll start over here with non-standard proxy management.

B

We have a fairly dedicated intake request for these in the form of a specific git issue with a template. Clients will submit this issue to make their modification request. An engineer from our team will be assigned, and that engineer is responsible for fulfilling the request and closing the issue, pretty straightforward, skipping troubleshooting. For a moment, uh we have three support flows for our integration, consultation intake. We support these consultations through email, our weekly gateway office hours call and because it's sometimes just unavoidable dedicated meetings.

B

Now. Finally, we come to troubleshooting. This is by far and away the most common reason for engaging support from the gateway team. We handle these requests differently, based on priority in the low priority group. We have non-production issues and production build out issues. We will typically respond for requests for troubleshooting assistance through email office hours and most commonly through our internal chat, app flow dock if you're not familiar with flow dock, it's very similar to slack and then in the high priority category. We have live production issues.

B

These are treated very seriously and triaged on a moderated war room and we have 24 7 on-call paging support for this purpose. So this is the full picture of our operational requirements and the support flows. We have for them. Let's quickly review, how we manage these support flows because, as you can see, this has the potential to be a little bit disorganized.

B

We'll start with the 24 7 paging groups. Most of you are going to be familiar with a system like this. We nominate two engineers from our teams in weekly rotations. One of them is the primary on-call. One of them is the backup on-call seems pretty straightforward so far, but remember that primary on-call they'll have some additional responsibilities during their week, which are critical to the success of the system.

B

I'm not going to spend too much time talking about dedicated meetings other than to say that they should be avoided whenever possible, and we tend to reply to these invites with directions to other support flows. They just don't tend to be the most effective use of engineering resources.

B

Emails are similarly unremarkable, but it's worth noting that our primary on-call is responsible for our inbox during their on-call week. Now our flow dock is a weekly meeting open our office hours is a weekly meeting open for 90 minutes where all questions, comments and feedback are welcome. It's a great way to cut down on those dedicated meetings we want to avoid and to take the pulse of our community. The primary on-call is also responsible for running the office hours call on their week.

B

I got a little bit ahead of myself a minute ago, but here we are at flowdoc.

B

This is our most active support channel by far, and it really represents some of the best and worst that our model has to offer on one hand, customers can get quick feedback on their questions. The community has the opportunity to participate and it enables a dialogue without being as interruptive as a meeting would be. On the other hand, short inquiries don't always stay short. Topics can often stray from gateway integrations to more general engineering, support topics like certificate management and the line between valuable service and cumbersome time.

B

Sync is not always clear either way, it's the primary on-call's responsibility to sort that out during their week, as is it, is their responsibility to handle those manual work order, get issues I talked about now, I'm at risk of losing credibility with some of you. If I fail to mention direct one-to-one pings, I could make a full documentary about how to deal with these things, but for the sake of time I will just say that we try to send all direct one-to-one pings to our official support flows and leave it at that.

B

So, let's see how our system has coped with the scale. You recall this past year saw a 10x or 1 000 increase in the proxies on our platform. Here you can see the size of the team rep responsible for providing operational support. In addition to gateway engineering responsibilities, we've gone from a total of seven to ten or an increase of about 40 percent, so for an increase in 1 000 in traffic, a scale of 40 percent in operations seems fairly reasonable. The system is very much a work in process process.

B

It's by no means perfect, but it has scaled all right.

B

Okay, we started this talk by saying growing. A platform takes more than capacity for this next section. We're going to need to be bold, we're going to need to think big picture and recognize a unique opportunity that our growth affords us, because growing an environment takes more than a platform.

B

Remember this graph from the start, showing our increase in traffic volume over the past year. This is expressed in transactions per month, but it's a little bit more meaningful. If we use a different metric expressed as a proportion of the company's restful traffic, we can see that com now handles over 90 percent of all rest api traffic in the company. It's largely the only component common to it all, and this inherently gives us the reach to affect almost the api gate. Almost the whole api space with the support of leadership, it gives us more.

B

It gives us the reach to govern the api space and to understand that value. Let me talk you through a quick example. Let me introduce you to company x. Company x is a large organization with multiple api development teams. The apis produced by these teams share no consistent design frameworks. Have no common quality standards, often employ unorthodox or insufficient security models and have no common means of discovery or documentation for the unfortunate clients of company x.

B

This creates a confusing frustrating, sometimes dangerous and often exhausting integration process, and, at the end of the day, nobody wants to do business with company x. This is because company x does not have an effective api governance process.

B

So let's talk about how to do it right. Good governance starts with good guidelines. So we'll begin by discussing how to think about our guidelines, our rules, because, folks, the quality of your rules, is what's going to make or break your attempted api governance. We want to offer governance on both the technical product and the design process. Our rules should be kept. Current and useful, a great way to do that is to document them visibly.

B

Nobody is going to benefit from your process if they have to go. 15 clicks deep in a sharepoint to find the docs. This goes hand in hand with opening them to contributions. Another great way to keep your rules current and useful.

B

We shouldn't be afraid to give specific and detailed instructions, nor should we be afraid to have a lot of rules if they're all valuable.

B

Now, in a truly splendid example of a fully open source rule set and one which we based much of our process on are the rules from the e-commerce companies orlando who, unfortunately, I was unable to obtain permission from to share screenshots from their site, but nonetheless, I will link for those interested, and I highly recommend you check it out if you're looking to design a rule set for yourself now, if you thought that was all the detail was going to give you and how to make your rules you'd be sorely mistaken.

B

Optum has 131 rules within its api governance process. These are divided into these 12 categories.

B

Now I can't cover all the rules, but we are going to spend a little bit of time here and examine some selected examples of rules from a few of these categories, we'll start off nice and simple with process guidelines, we have a rule called follow the api first principle, and this just says that developers are expected to design their api before coding its implementation.

B

Next, we have a rule called the provide specification. You've been using open api v3 rule. This just says that the design that the api providers have to do. This must be done with an open api v3 seems pretty straightforward enough. Let's move to something a little bit more restful and talk about some of our resource structure rules. We have rules about keeping our urls verb free, so using slash messages instead of something like retrieve messages.

B

We have rules about how to structure our resources and sub-resources to keep the paths restful. Not all of our rules are mandatory. Some of them are optional, but we do have a couple of should rules like limiting the number of individual resource types and limiting the number of sub-resources.

B

Nobody wants to call a url with 45 sub-resources in the chain sticking with the restful theme. Let's examine a couple rules around the use of hdb methods and status codes. We require that our developers use http methods correctly. This just means gets for reads posts for creates. Put in patch for updates deletes for deletes. You get the idea.

B

We require that our developers use standard http codes, no defining your own custom status codes for the application. We also say that, in addition to using standard http codes, we must use the most specific status code. Not every success is a 200.. Sometimes we have 201. Sometimes we have 204 204s, you get the idea, and finally, we shouldn't mix our success and error components. These need to be separate structures.

B

Let's talk a little bit about the rules we have for performance, we start off with a should, because we make a lot of recommendations about how to make your api performance that aren't necessarily required.

B

These docs have to do with compression the use of query filters: the implementation of pagination, if you happen, to write an api which leverages the somewhat error-prone browser caching feature. You must document that fact.

B

We have rules that specifically govern not just the design and perform a performance of our apis at design time, but also the performance of our apis. At runtime we say that all apis must have 95th percentile latency less than some amount of milliseconds. Unfortunately, that is secret information. We actually enforce this with the gateway timeout at the kong layer and finally, we have limits on the payload sizes, which all of our apis will accept. 50 megabytes is somewhat arbitrary.

B

You could go a little bit higher, but we don't want to have apis attempting to cram gigabytes of information into a single http packet. That is a recipe for disaster.

B

In the interest of time, I'm going to skip over the nomenclature and taxonomy examples. These basically have to do with how to construct the api path on the apis. We expose and I'm going to skip right into how to enforce governance effectively. We need to make this process easy to follow and mandatory for it to have effect for it to give us the environmental benefits we want.

B

We started out with a manual governance process. It wasn't much, but it was a starting point and it was a way to introduce the process at optum. It was driven through an archaic system. We had manual reviews for api. Taxonomy developers would be required to schedule separate security scans, which sometimes would take weeks to run. Needless to say, this led to inconsistent enforcement. Long delivery raise lots of frustration, lots of exceptions, talk about exceptions more in a minute, but nowadays we have an automated governance process.

B

This is a get ops based spec driven model where the open api spec is effectively the key to the key to start governance on a particular api. This is directly linked with our provisioning for the purposes of enforcement. If your api has not gone through governance, we say if your api has not been certified, you will not be allowed to create a production proxy on kong and optum.

B

This automated governance process also links directly with our documentation and discovery hubs. This makes it easy for apis to see greater levels of reuse than they ever would have, and discovery has never been less of a problem. Apis can be certified start to finish in just minutes with this system.

B

Finally, I'd like to talk about exceptions, because, although it might be tempting to say that we shouldn't allow exceptions for some of these rules, failing to support an exception process is deeply counterproductive to our end goals, because all we're really doing is forcing teams to circumnavigate the governance process to meet their business requirements.

B

Our exception process should be simple, but not automated. We shouldn't get used to the idea of making exceptions and having that manual. Interaction really does provide a teaching moment, sometimes on why the exceptions are needed in the first place.

B

In order for the exceptions to be valuable and to work long term, we need to have accountability baked into this process. All of our exceptions are associated with a specific person. By name, we have a specific remediation plan and date, so how we gonna fix the problem when we gonna fix it by and to just give that little extra bit of organizational enforcement. We also loop in the vp of the api development team, requesting the exception for their acknowledgement and approval.

B

Finally, we do have exceptions from exceptions. This is effectively just security. We will not compromise here. For any reason, all apis exposed through the company will have world-class industry level security and there's really no bones to be made about that. Any other deadline can take a back seat in in exchange for this benefit alrighty. So we're about done here. Let's wrap up with a couple highlights. First growing a platform in a large enterprise requires flexibility, so be flexible and deliver in useful features. That will delight your customers and create advocates for you.

B

Even if you haven't planned to implement those features beforehand, we should be fearlessly pursuing updates and upgrades to keep the platform stable, competitive and to increase user confidence. This is doubly important for open source based applications. We need to empower our customers with effective operational support and, if we do this, our platform will have undeniable value.

B

Finally, to take that value to a completely other level, we can establish an environment of excellence through api governance and the whole organization will benefit and thrive, and that's it thanks. Everybody you've been a very patient.

B

B