Kong Kong Summit 2019 Sessions, 20 Nov 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Operation and Custom Plugins for High Availability

Description

When a cluster crashes and is not able to function normally, how do you failover requests from the failed cluster to another healthy Kong cluster to reduce downtime as much as possible and ensure an optimal user experience? In this Kong Summit 2019 session, Yahoo! Japan Site Reliability Engineer, Jun Cui explains how organizations can deploy a huge Kong cluster and operate them for high availability using Yahoo! Japan as a case example.

A

Hello, everyone I'm June from Yahoo Japan cooperation, so it's my honor to be here and give a presentation to all of you. The topic of my presentation is operation and custom plugins for high availability, yahoo, japan.

A

So at first please, let me introduce myself, I set through a lability engineer at Piarco, japan responsible for the development and maintenance of multi-tenant api gateway and supporting api idea to a users company-wide. Currently, our team is working on internal and external API idea to a platform using Kong. Thank you.

A

So my hobby is playing playing games such as PlayStation 4 game Nintendo's to each game recently, I'm really into Monster Hunter word I swarm. The screenshot shows I delivered the final blow to monster.

A

So here is today's agenda. The first I will explain the usage states of Yahoo Japan and talk about how we use comb then followed by how we use ansible to automatically deploy Kong server and share some custom plugins, which we developed for company-wide users.

A

Finally, I will talk about our issues and the future work.

A

So let's look at con usage search in Yahoo Japan, so Yahoo Japan is one of the biggest photo sites in Japan, so we offer over 100 web-based services, including search engine, auction, news, weather sports, email shopping and the soil, and this number is growing.

A

So, in order to to handle those services more than 1 and 50,000 bill, metal servers are running 24 hours a day, note that they are all bare metal, not virtual. So over 150,000 servers procs process more than 17 4 billion page views per month, 2018, as you can see a who Japan is virtual site, so we provide API gateway platform with Kong Enterprise Edition. The here is the timetable of API gateway platform.

A

Since October last year, on the first day of October last year, we released internal and external API gateway in production environment, so we also updated con two times in March and May this year. Currently, we finished updating a gateway to 0-3 6-1. Last week,.

A

Sure is, configurations has been created so more than 220 workspaces, 519 services, 690 groups have been created in one cluster pure last month. The requests second is more than 7000 on marriage and that will exceed to 12,000 on one cluster. Ok I think you have caught the scare of Kong in Yahoo Japan from what I said earlier. So.

A

Next I'd like to talk about how we use Kong and what role played by Chrome Yahoo Japan, so we offer to Kahn cluster as a palliative, a platform for back-end developers, the GS therapy, which is global server load balancer, is a network equipment which the network traffic to a group of data centers in different locations.

A

In this case, back-end developers could to choose con clusters depending on their availability requirements. For example, some services are provided in East East I sent her only when other some services are provided in both East and West to the center.

A

So we also provide two cone clusters for front-end developers. The front-end developers offer yahoo japan services to external clients, which is an the user and the cone proceeds external request to internal front end servers.

A

So this is the overview of coach teacher of yahoo japan, so our team provide for con clusters to company when developers which contain 240 condos in total, so access from external clients received by GSA be at first but gia therapy distributes requests to the cone for front-end which located in the east and the West data centers.

A

At this time, gsfe selects cone cluster that data center is closer to the external client and request would be received by select cone. Cluster also request could be deceived by a conch last 30. Actually, it depends on configuration by developers I.

A

Also want to show how we fare over requests from shared server to another hershey server by active stem by plugging and I will talk about active standby in detail in custom. Plugins part con cluster at the east data center proceeds request to front-end server in general, and this front-end server, also at East data center and con chicks up streams hers on a regular basis. If one of front-end server is not able to function normally active stem by plugging could overall, the request to the other healthy front.

A

End server, which had at West data center cohn for back end also can do the same thing. So in this case we could continue offering our services, even if one of front-end or back-end server has been crashed. Of course, proceeds will be returned to the original F front-end or back-end server after failed server has been recovered.

A

So in conclusion, we have 240 nodes in total for Kong clusters. So here comes the question: how would you organize and deploy so many nodes automatically.

A

So we use ansible to solve this problem, so in this session, I will talk about deploy, Kong with ansible, so I'm not sure unseeable is the best one to deploy Kong. So if you have it have better solutions, I'd like to discuss with you after presentation.

A

First, let me introduce complicated, Yahoo Japan internal network briefly, so if we want to log into one production server, we need to access to springboard server using one-time password at first then access to production server from springboard. So when we deploy Kong, we need to access to every con node from local PC and install necessary packages and to environment settings. As you remember, we have 240 nodes, so we use ansible to implement automatic deploy.

A

So ansible is an IT automation. It can configure systems deploy software using ansible playbook playbook can describe a set of steps in general IT process such as package install and environment settings. So we put playbook on kids, but the seniors we can't use get at production environments directly. So we have to pick source code to RPM packages and publish to auto factory using screwdriver. Then we can install RPM package medium.

A

So when playbook was updated, new playbook would be published to Aki factory by screwdriver, and we assess h2 antibody ploy, server and install the latest playbook then astute ansible playbook, and it can help us to deploy counted multiple production server without any authentications.

A

So this is example for setting group variables configuration of unstable, so original con config file would be updated by what we configured here, such as kong, log level, 2 arrow and custom plugins listed here, would be installed by unstable. What's more, we also can arrange ssl certificate. She other required packages, so all of them could deploy with us. Go.

A

So now now I would like to share three custom plugins.

A

The first one is traffic abuse prevention, so this plug-in sends required information to traffic abuse prevention, server from consumer and check if request meets the requirements. For example, if developer decided, each user could access only 10 times in one minute, they're from allowance access would be denied by this plugin so that could protect from toast attack.

A

Perhaps you think this function is the same with rate limiting. Yes, they are similar but really meeting plugin to not satisfy our requirements.

A

We need to check from yahoo ID or user agent header or access UI, r or other attribute when rate-limiting plugin can do IP or header or is for later for check.

A

A

The second class and the second custom plugin, is sorry page.

A

So this figure shows the typical cons behavior in normal. If all the proceeds Tanisha knows become unhealthy state con can't block traffic to upper stream and return of fixity response immediately without proceeding request. Sorry page plug-in could respond. Customized content and status code to end users instead of fixed response and developers could customize that for each end, point.

A

In the third plug-in is active standby, so cone would closely request to API node of first cluster in normal. If all the process destination a panels become a hair state, a to standby plug-in, could switch or request to APA knodel. Second cluster, of course, after one of a payload of first class cluster has been recovered, proceeds will be returned to first cluster, so also we think this function could implement by canary bees. So when we develop this plug-in.

A

Here is a comparison of configuration required by canary and active standby. So, as you can see, canary have much more parameters need to be filled. Users spend more time to read or understand documents to implement active standby functionality. Using this plug-in accused ember has only one parameter, so it is easier to understand and more simple uses then can already this plug-in, but this plug-in is for active standby only so the functionality is not as rich as can our.

A

Finally, I would talk about our future work as future work on API gateway platform for major tasks to take home withh over current issues. The first issue is developers need to see matrix state of their own service. We are going to provide revenue service to developers using combat holes.

A

One more thing is actually two months ago someone kissed Yahoo Japan services on Twitter, the front-end server responded internal error message directly to the end-user, so which is chimera, so he or she tweeted, hey Yahoo. What is Cornero? Who is gorilla, so the second issue is display Conneaut, when the users is not expected, especially for front-end developers, and we will develop a custom plugin for front-end developers to show customers a page to end users instead of Canario.

A

The last issue is deployment. Still is a big word for our team, because we still need to SSH and see below, deploy, server and install the latest playbook, and we can only deploy 5015 knows most at one time, so we need to do the same operation more than 16 times to deploy Ono's.

A

So we plan to implement automatic deployment. The ideal form is, we only need to arrange a source code on kid. After we updated master branch. Screwdriver would help us finish energy, which can reduce maintenance cost.

A

Finally, we are planning to increase condos to improve performance in future.

A

A

Sorry, the one more thing I also would like to show show you our magic theta on revonnah.

A

So here is the data for last 24 hours, so the peak the request per second is about 15 50,000.

A

So is: is there anything you want to, you want to see or.

A

So also our senior manager and Technic theater here. So if we have any question we can discuss now.

A

Yeah Cassandra, yes,.

A

A

1010 nose so ten noses for one one: Kong cluster.

A

A

Cool. Thank you.