15 Nov 2022
Walkthrough and retrospective of the work that the Scalability:Projections team did to migrate redis rate limiting from running on VMs to running in kubernetes.
- 3 participants
- 22 minutes
18 Mar 2022
As part of our investigation into a WAL archiving saturation incident (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6581) we got into an ad-hoc profiling session, and general introduction into CPU profiling.
Participants:
- Matt Smiley
- Igor Wiedler
- Alexander Sosna
- Biren Shah
Participants:
- Matt Smiley
- Igor Wiedler
- Alexander Sosna
- Biren Shah
- 4 participants
- 38 minutes
17 Mar 2022
Liam and Marin discuss creation of self-serving platform in Infrastructure and how this aligns with existing design platform
- 2 participants
- 21 minutes
5 Oct 2021
Links:
- Documentation: https://docs.gitlab.com/ee/development/stage_group_dashboards.html
- All dashboards: https://dashboards.gitlab.net/dashboards/f/stage-groups/stage-groups (Private)
- Public (Limited to 7 days) GitLab SLA dashboard: https://dashboards.gitlab.com/d/general-slas/general-slas?orgId=1
- Documentation: https://docs.gitlab.com/ee/development/stage_group_dashboards.html
- All dashboards: https://dashboards.gitlab.net/dashboards/f/stage-groups/stage-groups (Private)
- Public (Limited to 7 days) GitLab SLA dashboard: https://dashboards.gitlab.com/d/general-slas/general-slas?orgId=1
- 1 participant
- 3 minutes
14 Jul 2021
Stan, Matt, Andrew, Jason, Marin and others discuss some corrective actions following on from a production incident: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5158
- 7 participants
- 42 minutes
22 Jun 2021
2 min video showing budget attribution for the Purchase group.
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1114
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1114
- 1 participant
- 2 minutes
13 May 2021
APAC Scalability team Demo - Quang-Minh shows sidekiq routing rules in omnibus + helmcharts, and compression of sidekiq payloads
- 2 participants
- 20 minutes
12 May 2021
- 6 participants
- 30 minutes
5 Apr 2021
A quick run through of the redis sidekiq scalability test harness from https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/956
It is crude, but has given us good numbers.
It is crude, but has given us good numbers.
- 1 participant
- 13 minutes
1 Apr 2021
See also Craig's demo: https://www.youtube.com/watch?v=NuamleKHRDA
- 6 participants
- 51 minutes
23 Dec 2020
The group dashboards project we're currently working on:
- https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/360
Grafana folder where these dashboards are stored:
- https://dashboards.gitlab.net/dashboards/f/stage-groups/stage-groups
- https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/360
Grafana folder where these dashboards are stored:
- https://dashboards.gitlab.net/dashboards/f/stage-groups/stage-groups
- 1 participant
- 3 minutes
16 Nov 2020
An introduction to the Infrastructure team division and main responsibilities at GitLab.
Official Team Structure documentation: https://about.gitlab.com/handbook/engineering/infrastructure/team/
More about Infrastructure at GitLab: https://about.gitlab.com/handbook/engineering/infrastructure/
Official Team Structure documentation: https://about.gitlab.com/handbook/engineering/infrastructure/team/
More about Infrastructure at GitLab: https://about.gitlab.com/handbook/engineering/infrastructure/
- 1 participant
- 9 minutes
2 Oct 2020
Andrew shows Bob how we automatically generate recording rules from high cardinality metrics, and how to include a new feature_category label in that.
- 2 participants
- 38 minutes
18 Aug 2020
A quick demo of https://gitlab.com/gitlab-com/runbooks/-/merge_requests/2684, which allows SREs to quickly navigate between different observability systems, such as Kibana, Bigquery, Stackdriver and Sentry. The aim is to reduce the MTTD for incidents, helping to drive up the availability of GitLab.com.
- 1 participant
- 5 minutes
12 Aug 2020
A chat about https://gitlab.com/groups/gitlab-org/-/epics/3980
- 3 participants
- 26 minutes
13 May 2020
A demo for the new connection pool metrics recorded in GitLab (https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/153), because the setup to test this is quite involved
- 1 participant
- 3 minutes
22 Apr 2020
Discussion follows on from the issue https://gitlab.com/gitlab-com/www-gitlab-com/-/issues/7201 "Prepare on OKR for improving SpeedIndex on a benchmark of URLs compared to similar URLs on GitHub"
-----------------------------------------------------------
@andr3 (the Scalability Team's frontend counterpart) and I had a great call about this topic today: https://youtu.be/e2iccdgrY5s
Some points from the call:
1. Optimisations to GitLab.com's SpeedIndex benchmark would mostly fall on frontend teams
1. Breaking our Javascript bundles down into smaller components to reduce compile times
1. Ensuring that Javascript bundles are effectively cached between releases (ie, production deploy doesn't invalidate cache)
1. Is there more performance that we can squeeze out by taking advantage of our new Cloudflare setup?
1. Is a target SpeedIndex of 1000 a reasonable goal? @andr3 think's its possible
1. Serverside Rendering GitLab: https://gitlab.com/gitlab-org/gitlab/-/issues/215365
1. Managing frontend performance and bringing into Prometheus
-----------------------------------------------------------
@andr3 (the Scalability Team's frontend counterpart) and I had a great call about this topic today: https://youtu.be/e2iccdgrY5s
Some points from the call:
1. Optimisations to GitLab.com's SpeedIndex benchmark would mostly fall on frontend teams
1. Breaking our Javascript bundles down into smaller components to reduce compile times
1. Ensuring that Javascript bundles are effectively cached between releases (ie, production deploy doesn't invalidate cache)
1. Is there more performance that we can squeeze out by taking advantage of our new Cloudflare setup?
1. Is a target SpeedIndex of 1000 a reasonable goal? @andr3 think's its possible
1. Serverside Rendering GitLab: https://gitlab.com/gitlab-org/gitlab/-/issues/215365
1. Managing frontend performance and bringing into Prometheus
- 2 participants
- 46 minutes
2 Mar 2020
- 4 participants
- 35 minutes
26 Feb 2020
Discussion between Grant & Jason, related to the self-managed scalability workgroup's design of reference architecture using the Cloud Native GitLab Helm charts.
We covered things like:
- Why *not* Omnibus in Kubernetes
- Separate of components by concern within the Helm charts
- Scaling workloads vertically and/or horizontally
- pre-scaling at minimum 50% or more expected load, and maximum to 110% (straight to 100% for tests)
We covered things like:
- Why *not* Omnibus in Kubernetes
- Separate of components by concern within the Helm charts
- Scaling workloads vertically and/or horizontally
- pre-scaling at minimum 50% or more expected load, and maximum to 110% (straight to 100% for tests)
- 2 participants
- 54 minutes
20 Feb 2020
A demo of https://gitlab.com/gitlab-com/runbooks/-/merge_requests/1930 which automatically generates Kibana Searches and Visualizations from Grafana, using Jsonnet and the Grafonnet library.
- 1 participant
- 4 minutes
1 Jan 2020
Speaker: Andrew Newdigate
GitLab.com’s monolithic Rails application experiences high week-on-week traffic growth. To ensure availability, GitLab’s Infrastructure team track and plan ahead in order to avoid hitting capacity limits in the application, whether these limits be CPU, database connection pools, memory, storage or any number of other finite resources. Hitting these limits could result in hours, or days, of degraded service while workarounds are put in place. With this in mind, the team set about building a set of tools on top of Prometheus recording rules and alerts to provide them with the information they need to be sufficiently forewarned, up to a month in advance, of potential resource saturation issues. If you’ve ever felt that you’re reactively responding to resource saturation issues, this session will provide practical examples of how we’re building a framework for resource planning into our SRE team workflow. We’ll be presenting our open-source solution and explaining how it works for us.
Slides: https://promcon.io/2019-munich/slides/practical-capacity-planning-using-prometheus.pdf
GitLab.com’s monolithic Rails application experiences high week-on-week traffic growth. To ensure availability, GitLab’s Infrastructure team track and plan ahead in order to avoid hitting capacity limits in the application, whether these limits be CPU, database connection pools, memory, storage or any number of other finite resources. Hitting these limits could result in hours, or days, of degraded service while workarounds are put in place. With this in mind, the team set about building a set of tools on top of Prometheus recording rules and alerts to provide them with the information they need to be sufficiently forewarned, up to a month in advance, of potential resource saturation issues. If you’ve ever felt that you’re reactively responding to resource saturation issues, this session will provide practical examples of how we’re building a framework for resource planning into our SRE team workflow. We’ll be presenting our open-source solution and explaining how it works for us.
Slides: https://promcon.io/2019-munich/slides/practical-capacity-planning-using-prometheus.pdf
- 7 participants
- 28 minutes
4 Dec 2019
Hordur and Andrew discuss how AutoDevOps can be better monitored using the key metrics framework used for monitoring the components of GitLab.com.
This follows on a outage in the feature https://gitlab.com/gitlab-org/configure/general/issues/9
This follows on a outage in the feature https://gitlab.com/gitlab-org/configure/general/issues/9
- 2 participants
- 50 minutes
25 Nov 2019
A really quick video that demonstrates how to use the Grafana Explore user-interface to drill-down into the visualisations in Grafana, for deeper adhoc analysis.
- 1 participant
- 1 minute
15 Nov 2019
Andrew takes Marin through GitLab.com's SLO framework.
Some topics covered include:
* Symptom-based Alerting vs. Caused-based Alerting, RED Method Monitoring, USE Method Monitoring
* How we calculate the SLI, SLA, SLO for each service
* How to use our Grafana graphs to visualise the SLA trend for each service
Some topics covered include:
* Symptom-based Alerting vs. Caused-based Alerting, RED Method Monitoring, USE Method Monitoring
* How we calculate the SLI, SLA, SLO for each service
* How to use our Grafana graphs to visualise the SLA trend for each service
- 3 participants
- 34 minutes