Red Hat OpenShift The Level Up Hour | Red Hat Livestreaming, 16 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: The Level Up Hour (S1E19): Containers, Data Science and Replication

Description

Matt Micene (@cleverbeard), of Red Hat Tech PM, will join us for a topic that is close to his heart. There are two aspects really: a) how to make data science easier; b) how to improve the Reproducibility of data science experiments. We both think containers are an answer.

How Red Hat Enterprise Linux (RHEL) users and admins can benefit their organizations and improve their careers by learning how to use containers, Kubernetes, and Red Hat OpenShift.
Learn more at https://red.ht/leveluphour

A

Good morning, good afternoon, good evening, wherever you're handling from welcome to another episode of the level up hour here on uh openshift tv, I am chris short host of open shift tv and I'm joined by two other people, which you can't see right now. Sadly they're in a little box off the side. So I'm gonna turn around and fix that. But I will let the illustrious langdon white uh introduce himself while I'm fixing that real, quick.

B

So hopefully y'all can hear me, uh even if you can't see me- and you know in the grand scheme of things- isn't that better, um so I'm langdon white uh we're doing the level up hour and uh we'll talk a little bit about the show in general um and but we have a special guest today, uh matt maisini uh matt. Do you want to introduce yourself real, quick.

C

Sure uh matt maisini, uh so I am, uh I think we settled on technologists in the real business unit. I I I've done a bunch of different things for the rel bu.

B

I struggled with your intro on the show last week of like. I think this is his job title.

C

It's the joy of not having an official job title since I started doing research uh for our market intelligence team, focusing on like technology things. So I think that makes me a technologist. I I don't actually know what other people think I do so yeah. I look at. I look at technologies and emerging technologies and you know how they might come to market and how they might impact linux things and other.

B

Stuff well title or no, I think of you as a technologist, so there's.

C

B

um All right so uh kind of getting right into it. Let me uh let me throw up the slides, uh because you know what it's not a online presentation of any kind without slides, um so.

A

It's funny there's like a like a no slide kind of preference for people these days. Oh.

B

A

We use them just as like props right, like totally like this should just kind of be floating somewhere, but well.

B

I can show you even when I'm giving like presentations like you, know, kind of formal talks, whatever I really do use the I think it's generally referred to as like the y combinator style of slides. Now.

A

Where it's basically.

B

Just like a big image and maybe a couple of keywords.

A

B

um And then talk to it and people are always like. Can I have the slides for the talk afterwards and I'm like sure why.

A

But they're worthless. It's just a bunch of pictures.

B

Right um so I have actually on occasion- and I actually do recommend this is like if you're really like into the topic or whatever do two sets, do a set of slides that you're gonna present with and a set of slides that actually can stand on their own.

A

And then, when.

C

A

B

For the slides give them the second set or give them both, you know, but uh yeah, because you know you don't want to present slides with 87 different bullets. um No, so on that note matt sorry, oh.

C

The the quick hint for that, because it sounds like a lot of work- is uh once you've made your kawasaki slides, not y combinator. He was around long.

B

C

He literally wrote the books on this. um Take your speaker, notes and.

B

Literally, make slides out of your speaker.

C

Notes, that's the easiest way to to get that second set of slides, because that's usually what you want people to they want to be able to refer back. To is what you said, and most of us are usually pretty good about that data being there. Even if it's not actually things, we say out loud right, yeah.

A

That's true, and generally speaking, I take those speaker notes and either they were a blog post before it became a talk or they turn into a blog post or some kind of article somewhere um that, like reusing, that content in multiple venues is totally totally acceptable and often like a great way to bring more eyes to it.

B

Yeah yeah. No, I definitely hear that um you know like there's something to be said for having kind of almost like a series around a particular topic.

B

If anybody wants to get into the speaking thing um so uh so hello to, uh let's see we have uh det conan kudo, uh who tends to change his name a lot um and uh someone refers to matt as a futurologist. uh I assume they're, referring to uh matt.

A

um Yes, they are referring to matt. There.

B

Yeah yeah, uh all right, so the level of power. This is what we do, uh okay, so about the show. um So you know this show is about um kind of like I don't. I hesitate to say like introduction to containers.

B

It's not that as much as um trying to show why containers are kind of useful all the time, uh particularly for people who it may not occur to them, that uh they might want to use containers so like containers, are very much focused or like all the advertising and everything else or marketing is around developers, developers developers right to throw back to microsoft a number of years ago, um but in fact, uh they're quite useful in lots of other situations, and so the things we've covered on the show are like making a tools container that you can use in your data center or talking about the toolbox container that was developed by some red hatters for fedora.

B

uh You know or like deploying like your own next cloud is what we've done: the past few episodes, um and today what we're going to talk about is um doing data science uh with containers or arguably science in general, because one of the things that you want in science is capital r reproducibility, um because you want your peers to be able to review your stuff. uh So much like you know, quality assurance in the software world.

B

You need to be able to replicate something somewhere else, so that you can prove that it wasn't some anomaly of your environment that uh that made your science turn out the way you wanted it to or the way it did even um so follow us on twitter, I'm langdon with a one and chris short, is chris short and matt. Mycenae is cleverbeard, even though I didn't put him on the slide, I probably should have.

B

um You can join us to chat on our discord um and uh you know we're we're kind of there all the time, uh but that's one channel also to participate in the live stream show as well uh with the magic of restream. Most of the time it's replicated across all the different channels.

A

Most of the time, yes, today, it's working very well.

B

Is it isn't it? Yes, awesome all right, so um I you know screwed up my intro and did the intro of today versus the intro of the show first, um but you know whatever so today is containers data science and replication. um Last week I hadn't finished the show notes uh in time. So here are the show notes for both episodes. uh Last week's episode right now and the episode before that. um So obviously, if you just go to episodes, you can find them all. uh I even updated. That's.

A

But I have bookmarked be honest with you.

B

Oh yeah right right um so there's that um and I think that's it for slides, because if we move on, we will be giving away internet points.

A

um I'm gonna need to wait on that for a minute.

B

Oh yeah yeah, I mean those are a big deal, so I'm going to stop sharing and kind of, not hand it over kind of hand it over to to matt and kind of say. Can you give us a little background on like how containers can kind of help us in this problem space or or even why you got into this problem space.

C

Yeah, I think you know your intro to replication is probably a good place. To start right is like what what's one of the fundamental problems in that big r replication, big, our reproducibility, that we hit in science that we don't necessarily hit in sort of typical application development.

C

um So, let's, let's start there right if, if we've got an app we've built and we're looking to replicate it like either we're typically looking at like kubernetes rebel cassettes like I need 47 copies of this particular binary and it's uh you know, configs to spin up so that I can handle load or something along those lines. um Or you know at the other end uh you're talking about minor variations that don't really matter like if we're talking about compiling code, minor differences in the environment.

C

As long as the application is the same like we don't really care that they're, not 100, binary, exact right, like yeah.

A

You don't need that shot of match. Most of the time yeah.

C

Exactly we're looking for outcomes and in the science world and I think a lot of people got exposure to this this year. In particular, there's been a lot of news about whether or not like study or watch study is, is reproducible and provable and is the data there and what does the math really say.

B

Say for like a vaccine, maybe yeah.

C

Exactly there's there's been a strange thing happening for the past nine months. Right, you know. Suddenly this is this is sort of mainstream news, quite literally, um and so it's.

B

C

Easier to talk about state today than it was uh three years ago at devconf, uh because this wasn't uh on everyone's talk about was this study? Reproducible um right right so typically would sound like you go back. I don't know 2010, probably if you people who publish scientific studies, they're, usually papers right literally pdfs with you know, you did math, you have tables, you probably write out. Your generic formula in you know actual mathematical symbols and there's descriptions, maybe of how they were used, and then your final, like here's, the fancy table.

C

That shows me my log graph of whatever was being studied and that sort of was it right. If somebody wanted to reproduce that they had to hopefully read all of your paper understand, it figure out your algorithm actually from your formula and then hopefully you may be lying, didn't put the data somewhere. That might be shareable, but hopefully it wasn't proprietary under some sort of nda from a big pharma company. So, like there's all sorts of issues with like once, you publish the paper, how do you actually go about fact? Checking it like?

C

How do you do it right? If I have your code, on the other hand, right there's a big step there in what I can do to check your assumptions and how you got from point a to point b. So I said I think it's 2010, maybe 2011. There starts this conversation about what they're calling executable papers.

B

Oh that's interesting. I haven't heard this before.

C

Yeah, so it sort of follows: along lies, you're, if you're familiar with uh jupiter notebooks.

A

C

Right so project jupiter is probably 15 years old, I think um yeah and that that notebook concept right it combines the ability to write, prose and then show code examples and have those examples execute and then have the results like in a nice little web page based on some of that work and some other things that were going on in science.

C

At the time uh this executable paper, executable research paper, starts popping up in in various different places, and the idea is we have a lot of science, that's backed by a lot of math that carries a lot of data, the best way to get to a state where it's easy to peer review. These things is publish it all together in one format, so that somebody can walk along and look at your code and go.

C

ah You know what you that algorithm looks like it replicates this formula, but it doesn't because you did this wrong or because we know that there's a flaw here and also your data is there. So you can just like, oh that table click yep that table actually that code. With that data reproduces that table. It wasn't. You know a publishing error.

B

So so, actually kind of, at least for me raises a question that I often have about this and you kind of alluded to it. A little bit.

C

B

Let's say the data is, you know available like it's not proprietary or whatever? um Is there any sort of best practices emerging about sharing that data without you know, basically mailing me a hard drive uh because it's you know, terabytes of content um and because that's that's often a problem both in this kind of science, replication area, but also in kind of just general software development, especially data science,.

C

Yeah, I don't know that. There's any singular answer to that. Yet because.

B

C

The other side of the issue right is: if we start talking about containers and go back to that example, most folks are only they only care about. You know the binary and the config files right. It's that push to. I want my six meg go container. That does one thing and that's all it does the ones built around data science and other science support without the data right, we're talking about fairly complex python environments, our environments, just from getting the libraries in place right.

C

So there's a lot of different ways that people have gone about like okay, we'll just jam it in the we'll jam it into the container. That way, we know it's versioned and replicated right because that's that's an important part. um Other folks have looked at that uh large file support in uh like online version control systems like using lfa and github, or something along those lines right right, there's even specialty sorts of hubs.

C

If you will that, like jupiter hub um and there's shoe others that I cannot think of the name of right now that are designed to sort of allow you to publish your things there and then let other people pull it down, um binder sort of does that with with jupyter notebooks and data and then allows people to to get access to them. So I don't think that one's solved yet because it's the typical data issue right the price problem with any data like no, it doesn't matter.

C

What you're talking about is inertia right, right, you've got.

A

This massive data that that data set and it's you know whole mass. If you will um has its own gravity- and this is a problem that I've tried to address in my past and had some success. But I mean really it's it's a hard problem to solve.

B

Well, this is um one of the things that I I find really interesting, so the former cto of joyant, who I can never remember his name he's now at brian kendrill,.

A

B

Yeah um has a bunch of interesting like articles about basically moving your compute to the data rather than moving data to compute. You know.

A

That makes more sense, yeah.

B

It does, but it's not how we do software.

A

B

um And it does work a lot better if you can just set that petabyte of data somewhere right and then move the compute to it uh and that's, uh I think, a really kind of much more interesting model and I'd really like to see you know kind of software development kind of adopt that in general, um but so kind of going back to uh you, know kind of this replication in the science space. You know. Typically, scientists are not programmers or not.

B

You know not trained as programmers, so is there infrastructure in place that helps them in a sense or provides frameworks, or you know whatever I mean r basically exists in some senses because of science, um but that's that's pretty that's pretty low level for for the kind of frameworks or support. We normally look for to help uh non-programmers program in a sense right.

C

Yeah, because the folks who are are doing this level of science, they will very well be our experts or python experts, but they're, not python module experts right and as an.

B

C

I am not going to ever call myself a professional uh in working with with natural language programming, which is like the things I'm currently looking at and poking at.

C

I probably have five or six projects that vary by one or two libraries and I'm trying out a different technique that I learned online or trying to thief something from someone's blog post to try to learn something so they're almost identical, but for you know one library, one set of modules that I need- uh and I probably have six or seven then different virtual environments in python- and this happened to me just this week.

C

I went to pick one up and move it and I'm using pip end, but I've got a piplock file and I dump it to a requirements.txt and I pull it over and like pip install like that, should work right. It's python doesn't work somehow something got version locked to a version that doesn't work with it. I'm using python36.

C

It picks up a version that wants python38 like how does that happen right right? I love python. I use it all the time like how does it happen to me? It happened with.

B

Some poor other developer.

C

Scientists who's trying to figure out, let alone do I use condo. Do I use upstream? Do I use uh you know the distribution.

B

Version yeah package.

C

There's so many different options and then you layer on well, do I use a notebook and then which notebook and then do I use r and which, like the distribution, issues of the languages alone yeah, so there are folks out there that are looking at like okay, like uh jupiter, does put out. Containers that have this is the kitchen sink for data science yeah and it's good. It works yeah.

B

I've, actually, I think I've actually used that before um yeah the um one of the things actually that's super useful about the python community in particular, is that um they're friendly in a sense with their downstream distributions as much as possible. So, unlike a lot of the languages, um you know when they are kind of designing the packaging format, they're, actually intentionally choosing making choices that make it easier to distribute in a linux distribution.

B

um So you know, unlike a lot of languages like this, is one of the things that people don't really understand is like. Why can't we get an rpm say from you know a python module or a ruby module or whatever, and a lot of it has to do with, um like you know, kind of just automatically, I mean, but a lot of it actually has to do with the metadata being missing and stupid stuff, like the license on that particular module.

B

An rpm requires the license information, um and, if it's not on that ruby module, which at least that was the classic example- I don't know if that's still true, but they don't distribute it in gems. So you can't, you have to manually, go and look it up um and say you know it's.

B

I think npm had the same problem, whereas python has been progressive like every time, there's a change to something like rpm or whatever they're, trying to make sure that all the metadata exists that you might require as a downstream distribution to automate that distribution as much as possible, which is super nice, um so yeah, so jp data actually brings up a really good point of you know, isn't pip how you install stuff with python akin to yum or dnf.

B

Well, that's the whole problem is that they're, not a kin they're, two different solutions to a similar or the same problem that do not interact with each other, and this is actually something that's been looked at for a long time in dnf, for example, can we make dnf essentially a wrapper around all the other packaging tools right? So apparently the only way that you can get your like. True, like uh my new programming, language, creds or whatever, is write a package manager. So uh you know so every single language has a different one.

B

uh Lots of you know now we have different ones on different platforms as well. Like I mean, there's a there's at least two or three now for windows, um there's at least two for mac os as well. You know, then we have all the bsd ones and we have our bsd has ones, um and then we have you know debian, and we have uh you know, kind of the rel and it's it's friends uh with rpm.

B

None of these work together. um None of these model work in the same way, they're dependency management, they're dependency solving, are all different, um even though this is actually going back to science. This is actually a well-understood problem that has been solved uh with math and proved, um but instead of just saying, hey.

A

C

B

Universal installer, uh apparently it's cool to write your own package.

C

Well, yeah, and that was actually uh the solution to my problem, was I I was actually switching between two different python package managers really because I was using pip ends in one environment which not only deals with pip, but also sets up the virtual environment, yep and raw pip in the container, because I was like well, I don't need pip-band for all this. It's just a container right uh and somehow uh in the pip file lock for pip end.

C

That's where it inserted this weird version string, which I don't even I haven't gone back to look at the the uh that actual virtual environment. But the solution was, let pip be smarter than me right and do the right thing and now it works. I was like okay, we'll just we'll just remove the version requirement that locked on that file in the requirements text and lo and behold. Pip is designed to solve for this and some sort correctly, and I have a working environment in the container. So yeah.

B

Well, I mean- and that's that's often the big problem right as long as like, if you're not mixing and matching it often works so you'll have python developers, for example, who will not install anything coming from anywhere besides upstream pip. The downside to that is that, if you um is that you know kind of going back to that version, locking problem um you know there's. This is one of those really arguable positions. It's like and one of the things actually, we were trying to solve.

B

Actually with modularity and app streams is I want guarantees that the version I worked with will still work. So in other words, so I go inversion lock everything um you can do this in basically every package manager, somehow you can do it with you know dnf. You can do with apt-get, um uh you know, but the problem with version locking is you get exactly what you version lock to well, that means no security updates.

B

That means no patches, for you know like security, bugs or whatever, or even just bugs in general, that won't necessarily or or shouldn't impact your your running application. um This was my classic example of this. Is php um back in the day all the linux distributions upgraded to 5.3?

B

I think it was yeah um and if you did that, wordpress and drupal wouldn't run um because they were, they only worked with 5.2 with them, and so you also have different uh uh like kind of rules depending on the language, depending on the environment, about what things will be backwards, compatible et cetera, et cetera. um So that was a bit of a digression into one of my pet peeves of package managers, um but.

C

It's actually an interesting way to bring it back, though too is because the replica, the replication that we talked about in reproducibility that we talked about earlier since we're not talking about long-running exposed attack, surface style, web applications.

C

It's exactly that level of version, locking that we need to make sure we have available so that, oh, I don't know if I'm trying to reproduce an experiment that chris built and he likes the open, blas linear, linear, algebra, set up for for scipy and numpy. But I prefer atlas that when I take his code and his data right, we can't just stop at code and data because we don't want version, mismatches or developer choice in and what's my implementation to actually impact that outcome right.

C

So these things really do need to be like super specifically version locked and then to complicate it further. We go back to like our title of data science, like let's look at machine learning and artificial intelligence packages right that the data science, one that we just talked about from jupiter hub. That's I'm going to call that, like traditional data science, that's it's mostly math libraries and scikit-learn and those sorts of things. um It's got some machine learning, but it's not pi flow or tensorflow, or some of these larger abstractions around deep learning and the rest.

C

um So once you start trying to coordinate 15 different sets of requirements and libraries and modules and they're accompanying data sets and corpuses and lexicons, and like not the experiment, data just the things, it needs to run. To know that ah I shouldn't count the word the every time it shows up as significant.

B

And what is a stop word right, yeah.

C

Yeah exactly yeah: this is stop words, I was. I was trying not to use the stop. Word jargon. Thank you.

B

But yes, um I I did a bunch of data science in this space, so yeah.

C

So a bunch of our data scientists, actually at red hat, had been working on this problem in our so we've got a center of excellence built around ai um and they've been working on this thing that is called the open data hub and its entire purpose in life is to solve all these issues for you by letting them take care of all those problems and giving you a button in an operator that you can just push the button and get a data science environment, and it's always going to be the same, and you can control how it upgrades and and what capabilities it has that are always going to be identical and no matter who runs it where they run it.

C

They just need to push the button, get their environment and then attach it to some data and start going right. So there's like the one end is there's one off onesie twosie people building, here's a catalog of containers, of six different options and then there's the here's. This all-singing, all-dancing, orchestrated massive beast of an operator that can do magic for you.

B

Well and uh well, replicated replicable magic in a sense.

C

Yes, identical magic.

B

Right right, um yeah interesting, I actually didn't realize that was one of the kind of goals of open data hub, um but that is that is definitely interesting. um One of the things I think uh you know kind of my own experience that we're we're also seeing is um universities kind of starting to recognize this problem and collaborating more in order to uh with other universities in order to try to resolve uh some of these challenges.

B

um So, like a project you know, matt, I think you're familiar with, but one that red hat's involved with is the massachusetts open cloud which is you know uh I just blanked on the word uh but association of group. You know.

A

B

um Oh, it's actually yeah, uh so collaboration of, like I think it's like boston university mit, I think harvard's involved. You know like a whole bunch of universities um and they basically put a whole bunch of hardware in a data center and they're starting there and what's interesting is the project itself is its own science, as well as supporting other people's science.

B

So what they're trying to do is make an open stack or some sort of basically public or cloud that you can share hardware into and get hardware out of when you need it, um and so you know, the problem with your typical cloud right is that you have to go, buy it kind of on demand and if you have existing assets, um you know especially, you know. Big iron assets there's no way to kind of have them participate or when they're slow share them with others.

B

um You know, and so that's one of the problems that organizations trying to solve, but while they're solving that they're also offering to you, know that primarily seem to be focused on uh like biology research for whatever reason, and so there's a bunch of biologists who are doing a lot of uh you, know kind of research work, big data type work on their platform, uh while they're they're moving bits around, uh which I think is kind of entertaining.

A

So there's a question here from nl hackem, as we've decided to call them. um So what's the difference. So what's the science of data science, if the outcome of the same algorithm can vary with the version of python you're using shouldn't two plus two always equal four.

C

Yes, two plus two should always equal four uh for large enough quantities of four um or small enough quantities of two, uh but so usually it's not it's.

A

C

Yeah, it's not so much that the math varies. It's that the implementation of the algorithm. So there's we get into the into the nitpicking details between the formula and the algorithm right. The formula is the math written on the whiteboard, and that is what it is, and the algorithm is how you implement it and so implementation details, depending on what you're doing and for data science, it's a little bit different uh because there we're looking for things like uh when we talk about some of these differences.

C

Folks are typically talking about like training, speeds and accuracy levels right. So how uh how accurate is a model over another model could be a variation in some of the math uh packs that they use right their implementations of certain kinds of algorithms.

C

So for data science, it's it's a little more, uh not a! I read your paper, I recoded your stuff and I got a different answer. So much as it is, I took your model. I took your training data and it wasn't as accurate as you claimed, or it took 47 times as long. So therefore, that's that's not good, because that's not what you said. You did in your uh in your paper.

B

Well- and one of the points I would make too is uh like michelle orlondi: maybe um it brings up in the chat uh when we do math on computers. We make trade-offs, especially when those numbers are really really big and rounding errors, and so their example right is four followed by a whole bunch of zeros and then one you know, rounding errors are a big deal where you take. The rounding error. Pane also has an impact. um The other another one that comes up, which is kind of weirdly interesting, is most random.

B

Number generators on a computer are not actually random, and so when, if you use a you know, uh what do they call pseudorandom uh totally blank if you steal random inputs uh that can actually have an effect on your outcome because you actually are because of the lack of randomness. You actually get some variation in the result, um so there's so like. If we were doing all this work by hand, it would be very replicable.

B

It would also take two or three hundred years.

B

um You know and that's the problem right and one of the reasons I would say that there's also a big push for quantum computing in this kind of space, because quantum computing can be significantly more accurate in a very, very odd sort of way, because it can actually model the true numbers involved, rather than our approximations on using a binary uh system kind of way down underneath um and if we, uh if we didn't, if we weren't getting too deep into computer science, um you know this is this: is one of the big pushes for why quantum computing is so appealing, especially in the science world, um so yeah.

C

Well and I'm gonna get the I'm gonna blame python, even though I don't specifically recall that this was the language, um but on the on the the narrow end of that rounding error issues, I I think it was python the switch from two three to from two to three.

C

They changed how rounding picked where it was going on simple math, so you could have something that two plus two was four in python two, but two plus two was five in python: three, because it was actually.

A

Two point: five: nine, six, seven right.

C

It was at the rounding point where in python two it would round down and then python 3 it would round up so yeah and again, I'm blaming python. But I don't actually remember that it was python, but it was. It was a pretty common language. It was on a major version boundary like that yeah I mean, even even on the short end, like simply a programmer's like yeah. We now have floats so I can round longer, like I've got better access to 64-bit hardware.

C

So I can, I can round differently and suddenly from a major version of a language to another. It's it's doing. Math different.

B

Yeah yeah yeah, I I don't like math math on computers is, is really really hard if you've never done kind of real math. When you're doing two plus two equals four, uh you know it's. It's not too big a deal. um It's funny right because you know when I was in college um and then shortly thereafter. I worked in r d. um You know I did you know the math in college and I did a lot of that stuff in computer science, land right and then I went and worked in an r d.

B

You know center for a while. I mean used a lot of that math or whatever from then on. My math is counting by one and knowing when to stop, um and you know uh and that's about it right uh and so it's kind of amusing.

B

uh You know you do all this kind of really high level math um and how to implement that. Even in software uh in computer science, you know classes or whatever. But if you go on to be a software developer, you actually don't, generally speaking, uh have to do very much of this. So the it's kind of roaring back with the push.

B

Excuse me with the push on data science, um and so a lot of that you know, math is being dredged up and all the problems with math uh on computers is is kind of coming back.

B

um Did we have any other questions in the chat.

A

um I mean the we've been having a general conversation about. You know, bioinformatics genomics data science, that kind of thing um we talked about protein folding a little bit and the recent advancement there kind of thing, which is awesome um for folks that haven't heard uh was deepmind, did, has made some quote unprecedented progress on protein folding, and it's like when you think about protein folding and how that can cure cancer or genetic diseases. It's you know having any kind of advancement.

A

There is a big gain, especially if deep mind can then take that and then put it into all the people that are using things like folding at.

C

A

That algorithm, that gets better and can solve more problems faster kind of deal so, which is pretty awesome a.

B

Protein folding is one of those.

A

B

That's super interesting for me because, um like do you remember, there was a kid a couple of years ago, like a kid like somewhere between 12 and 18 ish, who kind of won the prize for doing the most protein folding. um You know and.

A

B

Know is not a scientist per se, but has learned a ton about you, know, kind of genomics and that kind of stuff, uh from kind of working with the protein folding algorithms and tools, which also reminds me of another story.

B

When I was doing math, I did a lot of like math. That was not straight up computer science math, because I thought it was fun um and one of them one of the spaces. Is this space called tilings, uh which is where, like what kind of shapes? Can you put on a plane that will cover the plane completely yeah? Well right exactly- um and uh this has uh like this- is a problem space for that's really interesting when you're trying to like coat things in you know material.

B

For example, um you know like the shuttle with uh tiles, for you know, heat uh dissipation, um but this woman, who was like you, know, a housewife. You know in her like 30s or 40s, or something like that uh discovered something like seven of the 12 just doing them on her kitchen table when, in her like free time,.

A

B

You know uh which you know it's that kind of story where I really like, when, when the barrier to entry into science and math is so low uh where people can just participate, I think that's really awesome.

C

I think that's the uh that's the upside of some of these. You know things like kaggle and uh binder and putting these things out there in the public that are easy to grab and run on a on pretty much any laptop right. So we get back to the.

B

The other promise.

C

Of containers that transportability concept right right, it's written once I can run it anywhere, even though that's probably like trademarked by son- and I shouldn't say that out loud unless I'm talking about java.

B

That's um that's right.

A

Everywhere I had a java champion on yesterday we're good.

C

Most of my professional system, administration career, is all javascript, so I have. I have feelings, um but the uh you know the idea of being able to do for anyone to join in. Like I don't know, if anyone's participating in kaggle, it's actually a really. It's actually built a really good community.

C

There's a lot of people out there that uh you know as a newbie data scientists, there's a lot of other, very accomplished folks who are doing these challenges to to to put these to do this work and it's it's a very much a it's a competition, but at the same time it's hey. I put this out there, and this is my code like what do you think and you'll get comments from very, very experienced folks saying? Well, this is an interesting approach, but have you looked at xyz or this is off?

C

You should check these things so they're building communities around this sort of stuff. um Sorry.

B

On the other side, I'm not familiar with kaggle.

C

With a k, l, a g g, l, e dot com.

C

Yeah, I know red hat has actually put out a couple of challenges around, um uh I think was one of the ones I looked at was like sentiment, analysis and some uh you know in some tweets, so some of them are are uh just community driven challenges. Others are actually businesses putting out like bounties on hey. We need an approach to do this sort of analysis on this data and we don't have the resources that sort of thing.

C

um So it's a very interesting place to go, lurk and learn, um and then the other one is it's sort of the the idea that you find that you brought up about uh compute locality. I think that's sort of an old school thing in the high performance compute environment that I used to deal with from a customer standpoint and a long long time ago, um in a galaxy, far far away far far away, and that was one of the like.

C

I I have these big recollections like back when, when we were talking to these folks about building data lakes, big data- you know in the in the stone ages- um and you know the idea of having a sephora cluster cluster that you ran compute on like it was like.

C

That would be amazing if you could ever figure out how to do that, because the traditional model of install and application didn't work correctly, but in a modern world you know fire up a couple of containers on a data lake like this is a much different sort of uh sort of proposition and uh to be able to actually do those sorts of things in a controlled fashion without completely destroying data and nodes and all that sort of management right. So right, yeah.

B

Right, I mean management in general. Of of that entire kind of like platform is often the challenge.

B

You know how do you you know and performance too right, I mean you know with data lakes, for example, like you know, when you're doing software development against data lake you're insane, if you're actually hitting the data lake directly, what you're actually doing is usually building a data mart in front of the lake uh to basically initial do the initial data massage uh so that when it, when your application tries to read the content, uh it's been massaged in such a way that it's easier for the application to consume so that you actually get performance numbers that are in the milliseconds versus in the minutes.

B

um You know, because a well-designed.

A

B

Lake is that, like a.

A

Data cache I mean yeah. Actually, that's what I mean. That's really what.

B

A data mart is yeah or, or you know it's one of those things like lots of things in software. There's 87 different definitions, depending on who you ask, but, generally speaking, a data mark is considered a you know, sometimes a right through cache, but uh you know often just a read-only cache of a lake so because the way data is stored in a data lake generally speaking, is not a good design for that data for the applications that would consume that data.

B

um So yeah it's uh it's kind of an interesting problem like again something I I used to work in. uh You know a long time ago.

B

So um so we did have kind of the example that we were talking about. Matt you and I before the show. um Is it worth kind of going through that or or talking about um the kind of uh container tool chain or deal that you had been working on.

C

uh I mean we could it's a little bit uh old and dated at this point in time. It's probably it was a for a presentation I did in 2017 at devconf in brno. um That's.

A

Still pretty new.

C

Well, you would think so um until you go look at that, contain science and realize it's it's ancient and doesn't build anymore because well, containers have moved on since the first time.

C

uh Three years is a long time in the container space. The the idea, though, I think is, is along the lines of what we've been talking about. Is how do you? uh How do you build for for the people who uh who are the the scientists and doing the experiments? How do you build a set of tools that are a set of toolboxes that they can use consistently without having to be experts themselves in packaging and dependency management right?

C

um And it's the same sort of challenge that uh you know the jupiter hub images that we talked about? Are there I'm currently poking around with vs code, because they recently drop some support for podman into their um container remote execution, plugin module extension, that's what they call them, they call them extensions. um You pick a word.

B

C

So I think you know as an example as an exemplar of like what the challenge space is. It's it's still a valid repo um as an actual workable example, not so much anymore, uh but it it essentially set up a uh the literature. The linear algebra example that I brought up earlier is that you know somebody wants to see how the different implementations of that are available for python. For numpy to do linear, algebra. You can easily set up two different.

C

You know you set up your base container and then layer on two different uh new layers that set up that, and then the data scientists or the scientists can pick from those on top. And how do you maintain that stack underneath for them so that they don't have to.

A

Yeah narendra was asking what the traditional data mart would look like langdon. uh Would it be like something you would use sql to access nosql graphql, I mean you know. How would you like talk to this data mark.

B

So it really depends on the system, um as I kind of respond to the chat, it has a lot to do with the design or the goal of the of the tool or the the implementation rather than the product set. You would choose the product based on the goal. So um you know redis, is you know, name value pairs, um so very, very good if you have very disparate data, uh but it's it's kind of in that construct, so you know think like environment variables, but lots and lots of them.

B

um So if you're, if you have that kind of data, then redis is a good choice. um If um you know, if you take uh what we used to call document stores that are often referred to as nosql databases now like mongodb, for example, mongodb is really really good if you need a big block of data based on a small piece of data, so especially if that block of data, the structure of it is not that important.

B

um So this a great example- I was actually talking about this with my brother, who works for a museum museum websites, are good or can be a good place for something like a because you say hey, I want art, exhibit x right and you get back this whole block of images of it. You know curator's descriptions why it's important. You know the history of it all the stuff, the individual pieces. The structure of that data is not that important you're not going to have duplications across like the curators comments across pieces of art right.

B

So there's no deduplication there. So you just kind of pull that whole block back, whereas like a traditional sql database is really good when the data is structured and what I actually often recommend is people start their development of their application with something like a mongodb.

B

When you don't know what the structure is going to be or where the structure is going to fall and then basically migrate data into a sql database. As you understand the structure of what you're actually looking at, when you talk about a data mart, I would say in my experience: they are traditionally um sql databases, um because the the lake is well structured generally speaking, um but you want to do some manipulation. So, for example, um you know most, you know a true. This goes back to going to computer science.

B

When you talk about sql uh one of the things they tell, you is always put it in fourth, normal form which basically reduces the duplication of data across your data set by as much as humanly possible. But that means you have a ton of different tables. So, for example, your customer record might be made up of like 57 different tables.

B

All with you know, five columns each because that's the way the data is structured and that's the way you get the most deduplication, but for a front-end application who needs to display the customer profile, that's very expensive to go and query that entire 57 set of tables. For that one record really. So what you do in a data mart is actually create like a view right in the sql sense, except maybe not literally implemented this way.

B

Usually it's it's a there's, an old term for it's like an implemented view or something, but basically where you've already crashed the 57 tables together. Now you have a customer record in one table, but it has lots of duplicated data, which is fine because you're only pulling out a piece of your lake, uh so you know you're talking, you know a gig right. Instead of you know five petabytes, um and so that's that's kind of what the idea behind marting is.

B

um I would say you know, and then basically it's like you know going back to the museum example. Your lake might be all the different art that you have. You can actually think of a physical museum, uh usually something like 20 or less of the art is actually on display. Most of it is in their basement um and they rotate it through because they don't have enough physical space to show it all off at once same kind of idea.

B

So, like you know, your lake might be have all of the things that you've ever heard of any special exhibits. You've ever had things that were on loan from all these other places. You know et cetera, et cetera, whereas your mar in front of that, even if it's the mart is now mongodb, let's say, is just the stuff that is currently things that you're showing on the website. So that's another example, so that was, I don't know. Does that answer that question um that was kind of long.

B

It's a complicated subject.

A

I would assume yeah it's complicated, that's as good as we're gonna get. I think right.

B

It's it's a space where, if you play in it uh it comes like. Oh, like it gets really clear, very fast. um Like you know, like the first time, I tried to pull a query against a 57 table customer record. I discovered very very fast that I wanted my response times to be not minutes right um so yeah, so it makes it much more much easier once you actually participate in the space here, because it's uh like 10 of and say.

A

Yeah we got some points to give away.

B

Gotta, do some sweet, sweet internet points um and I will share the screen.

A

Let me play screen sharing music.

A

B

Kind of hilarious um really, I would think it's more similar to jeopardy music.

A

No, I mean it's copy written right, like that's true, yeah,.

B

um But so is happy birthday. We use that a lot.

A

And who was it that said, they discovered and tried to copyright some copyright troll, like they discovered supposedly every rhythm, that you could possibly do a while back anyways yeah, so I probably just violated somebody's copyright just by coming up with something in my.

B

A

B

Yeah well, and then there was the other person who wrote a new happy birthday. Yes, it wasn't proprietary.

A

But then the lawsuit was one to like make happy birthday normal again, but everybody had moved on.

A

Anyways points points.

B

All right, so sweet, sweet internet points uh for anybody who's new to the show uh we uh award points for coming to the show or for doing a pr on our um repo. uh You know- or uh you know, filing issues, there's actually like there's an activities page on the uh repo itself. um uh Let's see, can I copy pieces from anywhere now that I'm not allowed to actually open the slides independently of the uh presentation um but netherlands hackum 2900 points narendev with 2400 points.

B

Noah friction 2300 joe fuzz still holding on with 1800, um and then we have a couple of uh people with a few episodes under their belt.

A

B

And uh jp dade, we know you're here all the time.

B

So yeah, uh you know the only the only way you can get points is by collecting them. I'm uh trying to get the uh the cut and peaceable version.

A

Yeah I was about to, I got a bigger slide.

B

And so uh yeah, so you know collect the internet points. uh We think they're uh fun and entertaining way to kind of say you know hey. I came to the show I'm engaged with the show someday. Maybe we will have prizes for the internet points um at this point. We just like to keep them away and we like to say: hey thanks, netherland hackham thanks narendev thanks noah thanks joe fuzz thanks jp dade for coming to the show, and we really appreciate your questions.

B

We appreciate you being here because this would not be any fun with no audience. um The other thing I want to highlight uh again this week um was some brand new people with sweet, sweet internet points um and I'm gonna just.

A

B

These um but suck a lot, maybe um darko.

A

Dark owls, yeah yeah.

B

Yeah and then I was reading this as maddow, but maybe it's supposed to be meadow.

A

B

Shouldn't it shouldn't it be m a three d-o-w or.

A

M-E-A, it's, I think it's mea.

B

Okay, all right! So, oh that, however, uh just fyi that is the wrong code, um that is from a prior episode. I didn't realize.

A

I'd put the code on this.

B

One too, um so it is not updated, so don't use that one use the one in the chat. um That's from some other random episode, which you can go watch yourself and if you go and watch past episodes you can still submit for the points for future uh ah see. So somebody.

A

B

um So thanks to the new people, uh we hope you come back. We hope you uh apply for some inter some more internet points. um You know we have some people at the top who have been at the top for a long time. We really need them to. uh You know, have some competition. um You.

A

B

Are always dark courses, though, because if you fill out the form and put private in for your public name, I will not mention you in any way, um and so you never know if there is a dark horse out there.

A

Cool so matt. What do you think of the sweet, sweet internet points.

C

I mean you know: internet points are always good and hoarding is a is a very valid and uh and useful thing to do, especially in especially in the dark times of 2020. uh You know, I think, grabbing as many as you can and keeping them for uh for future, something like a squirrel, it's winter we should. We should do that. Yeah. Definitely.

B

Yeah, so to borrow from another organization, you know gotta catch them all. Is that what you're saying.

C

Sure I was thinking you know stuff him in a tree, bury him in a hole. You know yeah yeah, typical dragon and squirrel things. It's weird how dragons and squirrels do the same thing like I don't know anyone's noticed that before, but you know that is, squirrels are basically dragons without right.

B

I live in the city, so um you know. As far as I'm concerned, you know, uh squirrels are really climbing rats right. You know we have flying rats, which is another word for a pigeon. um You know et cetera and.

C

B

We have actual rats because I live in boston uh but yeah, so thanks everybody for coming, should we wrap it up there? Do we have any other closing comments? um Oh, I did have one uh sorry. I almost forgot go.

A

B

Which was, let me just find my notes, because I have a link, um someone in the show last time offered for or asked for open shift kind of starter friendly options, um and let me kill the sharing, because uh I like it better when we're not sharing, um and so recently we released um the I'm gonna, get the name wrong, openshift developer sandbox, I wanna call it um and you can go check it out, and this will give you an open shift cluster for some period of time to do with, as you will um so.

B

We already, as we've talked about on the show before we have a cool, a bunch of cool things um like the learn.openshift.com or try.openshift.com, which kind of give you more of a guided experience. um You can also contact a like whoever your red hat representative is, and they have even more access to kind of guided content if you want to play around with openshift, um but the uh we just launched this.

B

um This developer sandbox, which I think is super cool and you should go check it out um and because matt's here and is part of rel. We will also pitch the developer program, uh which gives you uh you know, free or sorry, no cost access to rel. For many many use cases, one of the things that we're actually working on is trying to update the like fac kind of not the t's and c's. The t's and c's are pretty good.

B

The problem is people read the terms and conditions around the usage of rel in the dock, and they think that there's lots of scenarios that they can't use it, uh which are not actually accurate. uh They, you can actually use it in a ridiculously large number of ways uh and still be licensed compliant, um so some free and like they said, starter and free options. I think- um and uh so I just wanted to kind of share that, because I think the dev summer box literally went live like this week.

A

It's the ninth.

B

A

Last, oh, it's right in the url.

C

B

So uh so check it out and you know definitely come back for the next show uh where we will be talking about what are we talking about? Oh we're, talking about um the docker um or sorry, the deprecation of docker in kubernetes. um That's.

A

The next show that's.

B

Supposed to be, the next show.

A

Yeah great all right, um that sounds like fun, yeah well,.

C

A

Mean to be honest, it was a cncf ambassador challenge uh that we had to like all all hands on board kind of thing on twitter. One day a couple weeks ago, so yeah that'll be fun to talk about.

B

Yeah well, so uh I basically want to do some kind of examples of why it probably doesn't matter. um So that's so I'm going to try to prep that for the next show, if I don't finish prepping it for the next show, then we will be doing something else, um but so watch for uh you know kind of my twitter feed, chris's, twitter feed, the openshift twitter feed, uh the red hat twitter feed.

B

Sometimes uh for what you know, the topic of the show will be um one of the other things I'm trying to tee up is uh trying to get some of the data scientists who work for red hat. To also do help us with the data science show um and yes, we are taking a couple week break our next.

A

B

January 7th.

A

Or something seventh, sixth, yeah.

B

um So yeah so we're taking a couple weeks off um because we got to get our writers to come up with more jokes. And you know all that stuff.

A

Wait, do you have a writer yeah, don't you no.

B

I've been getting content from from yeah the whole writing team. Like.

A

B

But we gotta give them a break once in a while uh yeah so uh and then the other episode that I want to do soon as well is um service mesh with nexcloud. So like yes, we got next cloud deployed into openshift.

B

I want to show a bunch of the cool things we can do with service mesh without actually doing anything like we're just going to check some boxes and we're going to have all kinds of cool things like uh secure traffic across the uh the application.

B

um You know things like being able to do: traffic shaping and traffic flow and canary deployments and weird stuff like that, so um that current my current plan basically, is that on the sixth we're going to talk about docker and kubernetes, then on the 13th we would talk about service mesh um and then you know in the maybe the 20th we'd be able to have the data science on, but you know obviously it'll be somewhat subject to when I can get interviewees available um so yeah.

B

So hopefully you all enjoyed this uh thanks so much to matt marcini for for coming uh to the show. We really appreciate it and uh chris do you want to pitch anything else? That's happening today.

A

Yeah so uh later today, we've got the open shift administrator office hours with the one and only andrew sullivan and then at noon. We're talking about ai, enabled protective.

B

A

For cost and performance, optimization, uh that is with uh profit stores, the name of that company that'll, be joining us for the openshift commons briefing. Then after uh yeah 1400 eastern 1900 utc we've got uh red hat enterprise. Linux presents security, so that'll be a good show to kind of finish the day. Talking about the all-important thing, security so.

B

A

Tomorrow, at a very special time, 11am eastern 1600 utc- we will have the one and only matt hicks here on the channel, so bring your questions to the uh head of product and technology. um Here at red hat.

B

um And that's the show is in the clouds right.

A

In the clouds with red hat leadership, yes,.

B

um So do you know if the admin show today is going to be about storage or not um andrew sullivan? That's we chatted about it yesterday, but I don't know if he decided which way to go.

A

I mean it can be, I think we said we were going to do storage just because of uh storage office hours. Last week. Oh.

B

A

Yeah so there's lots of questions around storage and kubernetes. Imagine that.

B

Cool well thanks everybody for coming and uh we'll see.

C

B

Time in about two weeks, ish.

A

Yeah have a great break everybody. Thank you all for joining and we'll see you next.

A

A