solo.io SoloCon 2022, 10 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SoloCon 2022 [Lightning Talk]: CARFAX - Gloo Edge at Scale

Description

For more great content, visit https://solocon.io

SoloCon 2022:
[Lightning Talk] CARFAX: Gloo Edge at Scale

Speakers:
Mark Portofe
Director, Platform Engineering, CARFAX

Sebastian Chacko
Solutions Architect, CARFAX

Abstract:
In this lightning talk, join CARFAX Director of Platform Engineering Mark Portofe and Solutions Architect Sebastian Chacko to learn more about using Gloo Edge at scale.

Track:
Edge and API Gateway

A

Hi everyone and welcome to today's lightning talks, I'm samantha kim I'm part of the marketing team here at solo. I o I'm excited to welcome our lightning talk speakers. So please help me extend a warm welcome to mark and sebastian from carfax who will be sharing information about how they are using glue edge at scale over to you mark and sebastian.

B

Hey everyone, my name is mark, we'll be talking about uh glue edge at scale and uh just some quick uh introductions here. uh My name is mark portafi. I'm director of platform engineering at carfax and with me today is sebastian chaco who's, the solutions architect of our platform engineering department at carfax.

B

We handle a lot of our aws infrastructure for carfax.

B

So, just a little bit about the agenda. What we'll be talking today we'll go through I'll. Tell you a little bit about carfax the carfax.com domain, which is our main consumer-facing domain a little bit of how we went up to aws and our usage of glue sebastian will cover our high-level architecture and how we went about choosing an ingress controller for our architecture and then we'll talk a little bit about our lessons learned and what's next so first a little bit about carfax. Some of you may have heard of us.

B

You might see the car fox on tv and heard our slogan show me that the carfax uh we have a mission of providing millions of uh millions of people with information on how to shop, buy, sell maintenance, their car with a lot more confidence. To do that, we actually have a ton of data that stands behind our products and services over a hundred thousand hundred thirty thousand data sources, as well as over 120 billion records regarding vehicles and automobiles, our customers. They can be general consumers, dealers, automotive manufacturers, banks, insurance companies.

B

So we have a wide array of uh customer types today, we'll be talking focusing mostly about carfax.com, which focuses on the general consumer, and then we have various locations as well. uh In the us, canada, europe, carfax.com services, predominantly the u.s market space, with maybe a little bit of traffic coming in from canada as well.

B

A little bit about carfax.com, you can see our header there, our footer some of the products and services that we we do support uh things such as used car listings, a vehicle history report, uh information on how to maintain or service your car. uh The one thing that's unique about the customer. The consumer space on our customer facing side is that, with some of those television ads that we run the the traffic can get a little spiky uh depending upon when we run that tv ad.

B

What the viewership is, it might be during a big nfl, playoff game it might be during the olympics, etc. We can see noticeable spikes in our traffic coming in, so we've got to be able to make sure that we can scale to to those needs right now, we're ranked 353rd in the us by traffic volume per sem rush, and that equates to roughly 2 billion requests, or so a little over 2 billion requests coming into blue edge on our architecture.

B

So, as you can tell for us being a consumer-facing app, we have to really focus on speed, reliability, availability, scalability. uh All the standard uh terms you might think of with a high traffic website.

B

A little bit about uh how we got uh up to aws and started leveraging glue uh for our needs. uh We started back in 2017.

B

The first application that we migrated up to aws was actually our carfax blog and uh originally that was a beanstalk application and we migrated it to uh to kubernetes uh recently, but we started back then in 2017 and we progressed by migrating other applications up to carfax.com as well, such as our used car listings, our our home page, our car research pages. But one thing to note at that point: in time we were in a hybrid state. We had on-prem data service data centers, and then we were migrating up to aws.

B

So we had traffic split between aws and our on-prem data centers. uh In september of 2019, we actually began rolling out microservices leveraging glue edge and we started to convert over to glue edge. We were on traffic 1.0 at that point in time and we started to migrate over as of september 2020. We actually had 100 of our traffic routing through aws and, ultimately, in june of 2021, we had fully migrated over to glue edge from traffic, so we completed that migration last year and then this fall.

B

We started to actually poc and start to prototype our service mesh, which we're looking at glue mesh right now with that said, I'm going to turn the the mic over to sebastian and he can talk through our high-level architecture.

C

Thank you mark. So uh once again, I'm sebastian I'm solutions architect for the platform engineering department here at carfax. I want to apologize in advance for my voice. I'm fighting off a little bit of a cold, and you might hear me coughing so uh sorry about that.

C

So uh in this slide, I'm going to talk a little bit about the architecture diagram for carfax.com that we have, and so uh our kubernetes environment, which encompasses glue edge, is spread across two aws regions, us eastward and usb s2, and we have one eks cluster in each of those regions. So when a user wants to get to carfax.com the first entry point that they land on is cloudfront, which is aws cdn product and for apps that have caching configured at that layer.

C

They get immediate response and for apps that are uh configured to as an origin from s3 a cloud front loads that from s3 and serves it back to the user. uh The the piece where glue edge comes in is for dynamic, dynamic content, where uh cloudfront will make that call to one of our clusters in eastern oregon, depending on your location, routing, depending on where you are, and that lands us or that lands.

C

The customer on our external facing load, balancer behind, which is uh running blue edge in our eks environment, and that takes care of all the routing from there. Next slide pretty smart. So I know.

A

C

Mark talked a little bit about how our journey and how we landed on aws or how we migrated uh for the past two three years. So uh the the first piece that we had to solve is what is our compute uh environment or provider that we're going to use. We evaluated options that were available at the time bean stock, ec2, ecs, etc, and we ended up landing on kubernetes because we felt that gives us the best value once we are done with that.

C

The next thing, the most important decision, was choosing an ingress controller, because that eventually that gets the person or the customer to where they want to go. So that was our biggest decision at the time and we evaluated the choices that were available and we landed on traffic, which is another uh ingress controller like uh glo mesh. It provided simple path-based routing at the time, fit our initial use case, and we were good with that. But, as our kubernetes footprint grew, we quickly outgrew the capabilities the that traffic had at the time.

C

uh So we had apps coming on board that were spas, ssrs, apis, etc. So we needed a controller that could provide uh granular pathways routing, as well as like redirection and, most importantly, api gateway functionalities like earth rate, limiting firewalling ability to hit non kubernetes targets etc and be able to handle the the load that we're throwing at it. So we were not. We were using aws api, but we still are, but we wanted something that uh integrates a little bit better with our environment than that.

C

So that's where we landed on blue edge, which is the most feature-rich technology technologically advanced option that we evaluated at the time so uh fast forwarding today we're serving two billion plus requests per month through bluewords, like mark mentioned before, uh mark next slide, please.

C

So uh this slide. We're gonna talk a little bit about the lessons learned. uh The journey has been has taken a while, so we had definitely had some positives and lessons learned throughout it. So positives. First thing the solo support is, is super awesome like they provide us a dedicated slack channel and a slack workspace where we can like not only ask like issues and help for help with debugging issues, but also like implementation level. Questions like when we're adding a new feature or something we can go ahead and ask them.

C

How do we add this? What is the value helm value for this and stuff and they're super helpful with that which is awesome, and then the traffic scale and chaos testing? I know we mentioned this a couple of times before the traffic scale. uh This domain runs a lot of traffic and glue was able to handle everything that we throw over through added and it was able to provide slightly better performance than we had with our traffic 1.0 implementation, our previous ingress controller and then uh chaos testing.

C

That's something! We've tried to incorporate into our deployment workflow for uh big infrastructure releases, it's a manual at this time. So basically we go in to our cluster and pick out this component and try and fail different different portions of this. So with with blue ads, we went in and failed different components of it and it was able to handle uh almost every scenario we throughout it, with with no issues uh moving over to the lessons learned discovery, so blue edge discovery is an awesome awesome tool to get you going.

C

So it automatically discovers all the kubernetes uh uh resources within your uh cluster and creates upstreams automatically. But what we found was it's not super production ready. So when we all of our other resources, are created using argo cd and when we had these up streams that are automatically discovered and other resources created through our go. There were conflicts there and there's also like a resource uh use bug in in discovery. So uh solo recommended that we don't use it in production.

C

So we switched over to using our creating the glue, add job streams using uh again argo cd, yaml and that that that's worked for as well, and the next was deployments versus daemon set. So glue can be deployed in a couple of different configurations. The first one is as a kubernetes deployment or a daemon set uh in our previous implementation with traffic.

C

We were running it as a payment set, uh particularly for so that we could run it on specific instances and did not have that bogged down by other resources running audit so that the the the ingress gateways or the ingress pieces isolated from the rest of the cluster and uh and uh won't be uh a cause of failure. So uh we we went down the same route with blue edge uh and uh reviewed it solo and they they recommended. We stick with daemon sets with our current architecture.

C

uh Next live view, smart and so what's next so uh glue ads like I talked before, we went to where we uh started using blue edge for all of these api gateway type features which we are starting to explore right now, like one of the things that our teams are really excited about, is the blue edge and the lambda integration, so that teams don't are not isolated into only running kubernetes services behind lued.

C

So we can just use the same api gateway to also front lambda applications and team teams have that flexibility so which is awesome and then another thing we're really excited about blue edge. Having is the uh glue oidc uh integration, where we can just add authentication to any page we want uh without, uh I really even have to make any app changes, so it it just. You can just configure ydc with an sso provider and your page voila is authenticated.

C

So that's awesome and then the last thing we're uh we're working on or hoping to work on for this year is blue mesh. So one of the things that's missing been missing from our environment is a proper service mesh. So uh we uh towards the last uh to in last year we evaluated a few different options. We looked at link rd aws mesh istio, and we felt these does are our best choice and but we definitely felt that istio is a big beast to compact by itself.

C

So uh when blue mesh came out, that was really exciting for us, because that abstracts away a lot of the management piece for istio, so we're really excited to uh to try and get that out this year.

C

That being said, I think that's all for our presentation today. Our lightning talk today. So, thank you so much for coming and I hope this was insightful for you guys.