Ceph Cephalocon Barcelona 2019, 24 May 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Keynote: State of the Cephalopod - Sage Weil, Co-Creator, Chief Architect & Ceph Project Leader

Description

Keynote: State of the Cephalopod - Sage Weil, Co-Creator, Chief Architect & Ceph Project Leader, Red Hat

A welcome to Cephalocon Barcelona, and an update from the Ceph project leader on recent developments, current priorities, and other activity in the Ceph community.

About Sage Weil
Red Hat
Ceph Project Leader
Madison, WI, USA
Twitter Tweet
Sage helped build the initial prototype of Ceph at the University of California, Santa Cruz as part of his graduate thesis. Since then he has led the open source project with the goal of bringing a reliable, robust, scalable, and high performance storage system to the free software community.

A

Welcome everyone. Thank you for coming. Welcome to Barcelona welcome to cephalic on. We are delighted to have you here. My name is sage, while I work for Red Hat I'm, the Ceph project lead and I'm delighted to have everybody here today, I'm just like about staff.

A

So the first thing a little bit of bookkeeping I want to thank all of the sponsors that have made today's event possible and so I'd like to thank especially the platinum sponsors, Intel, Red, Hat, soft iron and Sousa, also, the gold sponsors which include my Super Micro, Western, Digital and ZTE. We also had two silver sponsors digitalocean and a boom, and also start-up sponsors, embedded and, um and so first I just want to thank all of the sponsors for making this event possible, give them a quick round. Thank you.

A

I also want to say thank the the Ceph foundation, which also put in a lot of effort into planning and organizing this event and also supporting it and the stuff event or the Ceph foundation is a new organization that we just launched about six months ago with thirty-one member organizations with the mission of supporting the Ceph project and the staff community and we're delighted to finally have sort of a cohesive set organization to support the project. The foundation was launched with 13 premiere members.

A

Many of them are sponsors here today, also 10 general members and also eight associate members, so the associate members being nonprofit or governmental institutions that are engaged in the Ceph community, that I've joined the foundation and also since the foundation launched six months ago, we've had three new members join as well, two more general members and also the University of Michigan as an associate member. So we're very excited to have the stuff foundation now to help bring the community together and to support all that its activities and supporting the cephalic on event.

A

I also want to say thank the Linux Foundation, both for their help in building and launching the Ceph foundation and also specifically in putting together this event to this wouldn't've been possible without the efforts of all the people involved with Linux Foundation in planning this event. So thank you to all of them as well.

A

And we're almost done with all I think use the program committee about a dozen people from five different companies that reviewed all the submissions for cephalic on to choose the talk, so you'll be hearing over the next few days. Thank you to them, and most of all, I just want to say, welcome, I'm, so happy that all of you come I'm gonna like to thank you for joining us here in Barcelona today. So, as most of you probably know, this is not the first steffel con.

A

This is a second cephalic on event that we've done the first one. The inaugural cephalic on event was last year in Beijing, and it was. It was amazing. It was an amazing event, especially for coming on and doing sort of the first big multi-day stuff event, so we're very excited they took this awesome picture during the keynote of the whole whole crowd. We'd like to continue that tradition and do a picture today as well of everybody here.

A

So this is just your heads-up warning at the end of my slides in about 20-30 minutes, we're gonna do a shot into the audience. So, for some reason you don't want to be part of that photo. You can sneak out when I stop talking we're very interested in your feedback. What you think about the conference, what you liked, what you didn't like!

A

So there's an ether pad that you can go to with the long and short URL or, if you've, any questions you can just flag down, one of the Linux Foundation event staff or any of us and give us feedback. You can you can text this number anonymously? You can email Mike and there are lots of different people, ways that you can.

A

Let us know what you think of the event and in particular we're planning in tomorrow morning's keynotes to do a sort of a town, hall, Q&A type event and we're very interested in what kinds of questions you guys have for for us for the for the course F developers and and so on. So if you want to put any of those questions on that etherpad, we can ask them in sort of a town hall, Town, Hall style format event tomorrow morning, as well.

A

They're also going to be some birds of a feather sessions this evening after the lightning talks. So there's a white board outside. If you want to sign up but down topics, we have the this space well into the evening, so as much time in space as we need to discuss whatever we think is interesting. I also want to point out that, and the cephalic on has been sort of a sort of a yearly event as the cadence for establishing, but we also have staff days that happen much more regularly. These are more regional events.

A

There's single day events and they're spread all over the world. The next few coming up are in the Netherlands in July, there's going to be one at CERN in September and there's gonna be one in Poland and October and others are being discussed in plans. So this isn't the only way to find out more about staff and connect with the community, and there are also set days and they're, also informal, meetups and so on. So anyway, that's sort of all the administrivia welcome.

A

So it feels like the the first thing in any Ceph event or staff talk is to answer the basic question. What is SEF- and that usually involves me dragging up this slide- that probably everybody in the audience has seen before stuff is a unified storage system.

A

It provides object, storage, that's s3, compatible and swift compatible via the rgw component, provides block storage for virtual disks, VR BD and a distributed file system, vs ffs, all of which are backed by a single storage platform called ray dos that handles all the reliability and data distribution, and so on so I've been showing this slide and various forms of it for a long time, but it when I dragged it into this presentation. I took a little bit of a trip down memory.

A

Lane got a bit nostalgic and went back and looked at all of my old presentations that I had done since the beginning and I thought thought I'd share a few earlier versions of this slide. This is the very first Ceph talk that I ever gave, which was at OSD I in 2004, when the original staff paper was published, no 2006 I.

A

Guess that's right there, and at that point we described SEF as a reliable high-performance distributed file system with excellent scalability, which is really reminder that SEF began not as this unified, all-purpose storage platform, but really as a distributed file system and the choices that we made were focused really on that that scalability, but I think we were lucky and that we made a series of design choices that made it a general-purpose and reusable platform that made it possible to also implement other protocols as well and over the next several years.

A

Sap was open source and we sort of built that flooding community. We did that in 2011 the slides looked something like this: they talked about the file system, RBD the block device, raitis gateway for object, storage and also liberate us. These sort of low-level access to that underlying object, store that that underpins at all, which is a real step forward.

A

At this point, sort of the overall story of stuff was complete, but the slides got better in 2012, I started stacking little blocks and simplifying the messaging in 2012 when we launched in tank sort of the first startup that really I think made Seth hit an upward trajectory and we had new branding.

A

We had our new color scheme, which we're still mostly using today that evolved a bit and sort of the corporate ink tank branding was a little bit different and slightly simplified and eventually got simplified a bit until we ended up with what we have today. What's less of a word salad, but that's really talking about what's F is technically. There are a lot of sort of key phrases that you would use to summarize what it is.

A

You know it's a it's: a unified storage system with file blocking object, it's a storage platform, software-defined storage, which is a big term a couple years ago. Unless so today, as part of the branding exercise, we always like to call it the future of storage and some people refer to it. As the Linux of storage, sort of a good metaphor, analogy, I guess I don't mind but I think a more interesting question to ask: isn't necessarily what stuff is technically what it does, but why it exists.

A

What's what's its purpose and what motivates the engineers and users in the community around it.

A

Excuse me and I think there are a couple key ideas that I want to get across. The first and foremost is that Ceph is free and open source software. That means that the software is free to download is free to use, and that means that it's available on the set of features that used to be enterprise features are available to any organization, even with modest means, and that in turn gives you freedom from vendor lock-in, because it's open source, you have the freedom to innovate. So you can look at the source code.

A

You can figure out how it works. What it's doing you can fix bugs make improvements, and you can share those improvements with everyone else and build that whole community around the project.

A

Excuse me, all right, second of all, stuff is reliable. It exists to create a reliable, durable storage service out of fundamentally unreliable components. That means that has no single point of failure and we distribute data using replication and racial coding so that the loss of a single storage device won't compromise your data.

A

But the other piece of that is that, as a rule, we generally favor consistency and correctness over performance when you're building a distributed system is really easy to cut corners and make things go really fast, but where we want to build something that organizations and people can rely on. That is not going to lose your data, and so we're always going to take the path. That's going to make sure that your data is safe and you're, going to get a correct result from the system and third Ceph is designed to be scalable.

A

It's built to create a bright and elastic infrastructure that allows you, your storage system to grow or to shrink, as your storage needs change over time. It allows you to add or remove hardware from the cluster while the system is online and being used, and it allows you to do online rolling software upgrades so that you can put the system in a production environment with real workloads and make those transitions to newer versions. As you move forward and we scale in several different ways, we can scale up to leverage bigger, faster hardware.

A

We can scale out within a single cluster or site by adding additional nodes, and we also introduce features that allow you to federated lester's together and replicate across clusters. So you can scale with the your organization and your multi datacenter footprint, so it's cephalic on in in Beijing I presented when I was doing the keynote they're presented the the four sort of key priorities that were motivating the developments within the Ceph community and those were usability and management performance, integrating with the container ecosystem and multi-site and hybrid cloud features.

A

So I'm gonna talk a little bit about our latest release, Nautilus and what's new and Nautilus in the context of those four priorities that we set out a year ago. So the sub-project releases new versions of Ceph every nine months. We have a sort of a new release scheduled cadence, we back port for two releases and you can upgrade up to two releases at a time. So Nautilus came out in February or actually March, because we were a couple weeks late, but we meant to come out in February and we're working towards our Octopus release.

A

Incidentally, this this nine month release schedule is something that we decided a couple years ago, but were is an active point of discussion within the community, whether we want to continue with that that nine month schedule, which means that on average, you have to upgrade every 18 months or where they want to expand that to a 12 month schedule one year, so you upgrade every two years, so we're very interested in everybody's feedback and thoughts about that here at this conference.

A

So first effort, one of those key priorities was around easy to use and management, and this is really sort of trying to dispel the the the reputation that Seth gained and it's early days about being a complicated and hard to use system. And this is largely your own fault. Seth was built by system administrators. It was built to be scalable and fast and performing and flexible and all that stuff, but it wasn't.

A

We we really neglected and easy to use and making stuff simple to use, and so a lot of effort has gone in to staff over the last two or three years to to change that. So the biggest thing that's happening in Seth in this vein is the the new set dashboard it's built into Seth and it's awesome and that a couple sort of key things have happened in the dashboard space. The first thing is that there's a community convergence on a single dashboard implementation.

A

So before this there were some different self-management progress projects that all existed outside of Ceph. They were sort of bolted. On the side, our efforts were fragmented across several different vendors and communities, and so for the first time everybody is collaborating on a single implementation, and then implementation is part of the core ACEF product itself, which means that every time you install SEF the dashboard is there. You just have to basically turn it on and it's much more tightly integrated with the system itself, so that dashboard provides things like metrics and monitoring.

A

It provides basic management, functions, being able to create tools and manage devices all those sort of basic functions, and additionally, we're working towards being able to take sort of that last step and manage the hardware itself. So you can manage the deployment, expand.

A

The cluster replace disks, that sort of thing via the GUI and as part of that effort, sort of the second exciting piece that came together in Nautilus is what's called the orchestrator API or something we're affectionately, referring to as the orchestrator sandwich, and the idea here is that we've defined an abstract interface in inside the Ceph manager. Daemon that lets SEF reach out to the deployment tool, orchestration framework that was used to deploy SEF itself, so, whether that's rooks, F, ansible, deep sea based on salt or sort of a bare-bones as a sage Orchestrator.

A

We can reach out through this API, to deploy and manage the actual set demons that are comprising the SEF system, and so those abstract deployment functions include things like fetching: the node inventory, what disks and devices are available which nodes are participating, the cluster creating and destroying daemon instances that are deployed on those nodes and blinking LEDs on your storage enclosures. The goal, then is.

A

We can then build a unified, CLI or GUI experience that has a consistent set of commands to do things like list devices deploy no Steve's replace disks, add demons that are the same regardless of whether you're using ansible salt, rook, SSH or anything else. So Nautilus includes the basic framework and a partial implementation of this, and one of the key focus areas for octopus is to push this forward so that we can sort of have a complete end-to-end management solution.

A

um So a good. A good example of what this allows us to do is another new feature in Davos, which is managed SEF of s and FS gateways, and the idea here is to take a set of n FS Ganesha demons, which are NFS gateways that Reax ports ffs via NFS, for they can be run in an active-active configuration I'm. A lot of time was invested during the novice cycle and before actually to make these demons essentially stateless, so they store all their configuration.

A

Rados they're super easy to deploy, there's not a lot of complexity there and to make the clustered failover semantics correct, so that the NFS grace periods are handled correctly, but did the nice thing. The key thing here is that these Ganesha daemons are fully managed via the orchestrator interface.

A

So when you issue the SEF commands via the CLI, our dashboard to create mice, ffs subvolume and set it set up an NFS export, the deployment of those demons in configuration is fully automated via the rook interface at least, and there's and there's a blog about this on set calm and in fact, a talk at this conference.

A

I believe with more more detail on that one of the other exciting features in Nala was PG, auto scaling, so the PG num, the starting factor for pools, has historically been one of the bits of black magic that you need to know in order to configure SEF, there's limited and confusing guidance on what values supposed to choose. I. Think half of this room probably knows exactly what I'm talking about and the other half. If you don't know what PG gnomes are I think that's a case in point, because you really.

A

This is something that users really shouldn't need to know about or worry about. If they're managing a storage cluster, so in Nautilus we can finally reduce the PG non value, music and Nautilus. We can finally reduce the PG and M value as opposed to only increase it. So if you make a mistake, you can adjust it later, but, more importantly, there's the ability to automate the the management of that peach enum choice by the cluster itself, based on what your actual utilization of the system is.

A

So the system can look at how much data you're storing in the different pools and how many devices are available and it can basically choose what it thinks. The best value is either based on that usage or hints that you provide about how much data you're going to store and then the system can their issue warnings if it thinks that your current PG none choice is wrong or it can actually just automatically implement those choices.

A

So you can just take it totally: hands off hands off you, so for people who've complained about stuff being complicated and having all these weird knobs. This is the sort of the biggest one in the basket, and we finally knocked it off the list as far as making it possible to just simplify the use and deployment of subsystems.

A

One of the other exciting features in Nautilus is the device Health metrics capability. So all the demons and stuff now that consume raw storage devices, now inspect the hardware to find out what the specific physical devices that they're consuming reaching through all the device, mapper and other layers, and so they can identify the device by vendor model and serial number report all that up through system. So you can simply inspect what devices are being used. But on top of that, we've added the ability to do failure prediction so exporting this.

A

The PowerPoint didn't quite work, I apologize for that using Google slides so there the system can now do failure. Prediction based on health metrics that it's automatically scraping from all the devices. So, whether it's smart or the envy me equivalent, we can predict whether the device is going to fail either using a local mode, which is a pre trained model.

A

That's embedded in the SEF cluster so now have no external dependencies and we can predict it, but we think a device is going to fail soon or there's also a cloud mode that will reach out to a SAS service, to have a user more accurate to get a more accurate prediction about whether your devices are gonna fail and then the cluster can either issue health warnings that it thinks devices are gonna fail and you should do something about it or again.

A

You can have it automatically mitigate those issues by automatically marking those devices out preemptively, improving your overall reliability and data durability.

A

One of the last sort of items in this usability and management category is simplifying configuration management, so there's actually a mimic that we introduced the ability to centrally store all of the stuff configuration files in the south monitor database. So you didn't have to deal with managing the SAP comp files that are distributed across all your different nodes, but if we improve that significantly Nautilus and you can now manage that be of both the CLI and the GUI in that we've also improved the documentation of those options.

A

That's built into SEF itself, so you can use the CLI commands and the GUI tool tips or whatever to just see what all the configuration options do, which allows. Essentially, all these up options configuration options to be adjusted easily in real time without restarting demons and distributing clapping files around and so on. So also in that vein, we've added the several options to SEF that just simplifies general operations.

A

For example, you can now just tell the OSD demons how much memory they should use and they'll figure out what size of what caches they need to adjust in order to reach that memory target envelope. For those of us who are trying to tune stuff clusters and Numa nodes, that's vastly simplified used to have to do all this weird fiddling with Newman control, to pin OS D's to particular nodes.

A

But now you can easily inspect what knows what the attachment is for the network device and the storage device that OS T's are consuming, and if you want to pin a particular daemon, you can do that, be a single configuration option and similarly, on our BD, there is a lot of simplification and streamlining of the image options you can set them on a purple basis and RBD mirror, which is a multi cluster multi-site synchronization ability is now stateless.

A

So all the configuration is stored in the cluster just much easier to deploy into use and finally I'm. So the last item in this management usability category is a often requested feature that finally landed in all this called our beady top. So there's a whole bunch of instrumentation that went into ratos that allows us to sample the request stream and identify which clients are you doing the most iOS. This is only surfaced in one sort of only in the our Beauty case so far at least, but now you can very easily see which are BD.

A

Images are using up most of the eye ops in the cluster and, as we progress towards octopus, we're going to be servicing similar capabilities for seth a fast and rgw if we're a dos itself as well. So what we're really excited about? Finally, making this this step forward and making stuff cluster is easier to introspect and to manage.

A

So that's so that's usability, which is sort of our first theme and the second theme priority for a nautilus was around the container ecosystem, and this is really all about kubernetes, as evidenced by the fact that we're here in Barcelona adjacent to keep calm and the Ceph community cares about kubernetes in two different ways. First, we want to make sure that staff can provide storage to kubernetes because any scale out software application infrastructure is also going to need scale out storage to go with it.

A

But we're also interested in running stuff clusters in kubernetes, because kubernetes is increasingly being used as an infrastructure layer and using kubernetes to manage stuff itself, which is in itself a complicated distributed system, can simplify, set deployment and management.

A

So the way that we're doing this is with a project called rook, so the Ceph community is all in on rook as a robust operator force, F and kubernetes rook mates. It makes it extremely easy to get stuff up and running and deployed, and rook essentially fills two distinct roles. I'm. The first is that it intelligently manage your manages your staff daemons, so it can deploy SEF.

A

It can make intelligent decisions about managing your SEF deployment by when adding removing monitors to make sure it preserves quorum, scheduling all your stateless daemons across different nodes, all the things that are served tedious as a human operator working, do for you and just simplify your overall operating experience, but that second thing that rip does is: it provides a kubernetes way of interacting with your storage resources, so it has declarative, CR DS for deploying clusters, creating pools to creating file systems and so on.

A

It manages ass, SCSI plugin, which does all your persistent volume attachments for you and coming soon. It's also going to be able to dynamically provision. Object, storage, buckets in a similar sort of way. In fact, there's a talk on that here at this conference, so rook has an enthusiastic user community.

A

It's a CN CF project and they're going to be several rook talks here at this conference and in fact, tomorrow, there's going to be a rook, hands-on tutorial that you should check out if you're interested and if you're staying for Q con there rook talks there as well version one point: I was just released and so we're very excited to be supporting and working with the RIC community. But the whole role does not Cooper Nettie's.

A

A lot of people are going to be using containers without using kubernetes, and we want to support those users as well. I mean there a couple reasons why you would actually want to do this so deploying stuff demons inside containers has a couple of advantages. It means you can do granular per diem and upgrades. You can simplify reasoning about the the variance in this various district dependencies, different libraries that we use even the problem. We've had past bugs related to versions of TC, malloc and strange combinations with different operating system distributions. So using container images.

A

That's a sort of picker dependencies and test a single, a single software stack, so the stuff container project exists for this purpose, to create a standard upstream SEF container that all these surfer projects can use, whether it's rook or SEF, ansible or something else. Stephan's will in fact has the ability to deploy step clusters on bare metal in the usual way using system D, but it can also run those same demons inside docker containers, I'm using system units to drive them, and we plan to teach other orchestration tools to do something similar.

A

For example, the SSH Orchestrator, which is sort of going to be the moral successor of the stuff deploy tool via the orchestrator api, is going to be able to deploy, set daemons.

A

This way, this way as well, and so the last item inserted this container ecosystem category there's another kubernetes related feature, so radius gateway has a new pub/sub capability, which we've integrated with the key native serverless framework for kubernetes, which means that when you do interactions with the rgw bucket you're doing gets inputs, rgw can generate an event stream and you can integrate that with key native.

A

So you can kick off a function as a service function to do some operation based on that event, there's going to be a talk about this also here at stuff look on, and so we encourage you to check that out.

A

Sort of a third priority for Nautilus was around multi-site and hybrid cloud capabilities. um This one probably got the the least amount of attention during this last cycle, but we did make some progress in the greatest gateway, multi-site capabilities category. There are a couple new features.

A

One is an archive zone which allows you to use the Rado Skateway multi-site replication to have create what we call an archive zone which essentially archives every copy of every object, whether it was overwritten, your foot for backup, archival type purposes, there's also a new cloud sync capability, which does one-way replication between rgw buckets to an external object, store, my guess, 3 and then there's a lot of ongoing, cleanup, refactoring and planning around Rados gateway in general.

A

That's leading up to our sort of next big push, which is a version 3 sort of a next next step forward, with our overall multi-site capabilities that you'll we'll talk a bit about a little bit more later tomorrow. I'm sort of in the same vein, not strictly multi cluster but related, is the s3 lifecycle policy API that raitis gateway now implements which allows you to set automated tearing policy the standard essary api that allows you to move objects between different tiers of storage and even archival, and eventually in the future. It'll.

A

Do things like steering out to external storage, which brings us to our last category, which is performance, so people always want their storage to go faster, and so this is something that we constantly need to be looking at as the hardware in the industry evolves and people are deploying stuff across new types of systems.

A

So a lot of work has gone into this area. A lot of it has been around blue store, which is the new storage back in force. F continually improving the performance. There I mean Nautilus. We have a new alligator implementation, with better fragmentation. Behavior, more predictable memory, utilization, more predictable, CPU utilization. There have also been some key improvements in rocks TV that we've gotten simply by updating to the latest version of rocks. Db, that's embedded in blue store, particularly around read ahead and oh a map, performance and iteration compaction.

A

These actually may be back ported to mimic and an luminous, but we're not quite sure yet and other optimizations on the radius gateway side. There's a new web front-end called beast: that's replacing civet web that gives better performance and efficiency. This is sort of one more step in the on growing. Our GW work to refactor to a more synchronous model as I sort of push push forward down that path.

A

But the last the last thing I want to talk about, I want to mention and the sepia and sort of cluster.

A

So this is a cluster of nodes that were donated by Intel 2016 that have been the basis for a lot of our performance testing within the Ceph community and they sit in the Ceph community test lab, which is a big shared environment where all the stuff developers across all the different organizations participating in the project are working, and this cluster has been instrumental in doing a lot of the work of the last couple of years. But it's it's dated it's three years old.

A

The cpu architecture is a couple revisions back and so we're very excited today to announce that we have a new test. Cluster called officinalis that has been donated by Intel. That is just going into the lab over the next few weeks. So this is going to be 10 new nodes of 1u servers, they're new but balanced, which is a sort of a rare commodity. These days, which means that both sockets have equal storage and network attached to them.

A

100 gig networking, eventually they're, going to be expanded with the OP team dims, and these are these- are donated by Intel we're working closely in collaboration with Red Hat to spec out the hardware and the nodes themselves are built by by QCT. So we're very excited and very grateful to have these these coming into the community lab some of the exciting specs they're gonna.

A

Have you know top shelf Xeon processors, two sockets they're gonna have high-end obtain SSDs for metadata a couple of those and then they're going to have a bunch of capacity or it oriented SSDs for storage, multiple NICs, the nodes themselves are built by QCT. These are one of the somewhat one of the few nodes you can get out there that are actually Numa balanced, so you can have so the network and the storage are balanced across both both sockets and gobs of ram.

A

So we're very excited to have these these joining us in the community lab and one of the key purposes for these notes is to support a new project. That's kicked off over the last year, or so your so-called project crimson, and the motivation here is a realization that an improving set performance. It's not just about how many ops we can get because stuff is a scale, a system we can set almost any target and we just sort of add more hardware, add more CPU, add more nodes and we can reach any performance target.

A

But what's actually important is figuring out how many ifs we can deliver per CPU core, because the CPU tends to be our limiting factor these days and it's really more about efficiency and cost and how cheaply we can deliver the same number of AI ops and the challenge we have is that the current staff code base is based on a traditional multi-threaded program model.

A

That's been standard over the past two decades, but we're finding that the context switching inherent in that model is simply too expensive when storage devices are becoming so fast and so Project crimson is essentially a complete rewrite of the I/o path and staff using a framework called C star.

A

The idea being that you essentially pre-allocate, which CPU cores you're going to use you create a single processing thread for each core, and then you explicitly shard all your data structures across those core so that each core can run lock-free independently and there are no locks, no blocking all IO is via polling. All communication is via asynchronous, lock. Post message.

A

Passing essentially makes you sort of restructure, rethink your software in a different way, but you can get much much better performance and you can also integrate with tools projects like D, P, D K and s P D K. So you can also bring the the network driver into user space and bypass all the kernel overhead and similarly bring in the storage driver as well, so that you've essentially have the entire stack of software running in user space with the the IO to the network to the processing in one all-in-one process.

A

So we're making great progress with this there's several talks here at cephalic on about project crimson. So if you're interested I encourage you to check those out.

A

So I mentioned that we had those four priorities that we set back in Beijing. A year ago we had a discussion with developers here yesterday and we had an open discussion about whether those were the right four priorities or whether whether they may said weather or something missing and I, think the key thing that came up that wasn't really on the list was quality and it's not because quality isn't important.

A

In fact, if you ask any user, the quality and stability of Ceph is probably one of the most important things more important than performance, more important than multi-site capabilities and so on, and so we've sort of reframed or thinking a little bit about what our priorities are. um But they're not really priorities anymore, because it's really like everything that we have to be doing right. We have to be thinking about usability.

A

We had to be thinking about quality performance, multi-site and an ecosystem in general, and so it sort of the beginning of a conversation changing the way that I talk about and think about stuff and so I'm going to talk a little bit about more about this tomorrow.

A

But I bring up quality because one of the exciting pieces that we did for Nautilus very clearly does fall into that category. And that's the ability to capture crash reports so used to be that when any SEF daemon hit a bug, it would dump a bunch of information into the log and then system do you would notice the crash and it would restart the daemon. And since F is a fault tolerant, resilient system.

A

Usually the human operator wouldn't even notice unless they happen to be go looking at the logs or or had some other way to notice that there was that minor blip. So now, whenever a SEF daemon crashes, it generates an explicit crash report that a lot of our locks F. Those are regularly scraped and reported back to the cluster. And we now have a sort of a central record within the cluster of all the different crashes, along with a bunch of metadata about them.

A

What version of the software was running and what the stack trace was, what node it was on what the operating system was all that sort of stuff.

A

Additionally, there is an telemetry feature that isn't actually new and Nautilus is new and mimicked, but it's sort of updated and revised and Nautilus and those telemetry that telemetry capability, if, as a user you opt in, will phone home basic information about your cluster to the stuff developers, we can tell what versions of sefar being deployed in the field, how big the clusters are, what features are being used and also if there, any crashes that are being experienced in the field will get automatic, Matic reports that give us critical metadata about where it crashed and what version of the software it was running.

A

So any set operators out there right who upgrade to Nautilus I strongly encourage you to. You know review what the telemetry is and make sure it conforms to your concerns about privacy and so forth, and then turn it on because I stuff developers. This is going to give us critical information. That's gonna tell us what's going well with Ceph and what's when things are going wrong, so that's that's. Basically it again we're very interested in hearing your questions and feedback.

A

You can note down this URL and enter any feedback on the pad and ask any questions that will then bring up tomorrow during the town hall. But thank you very much for coming and welcome to cephalic on.