Sci Cloj Meetings, 30 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Lightning talks and discussions on Clojure data science - Scicloj meeting 12 - Aug 2020

Description

On Aug 30th, 2020, the Scicloj community had a public meeting with diverse lightning talks and discussions about data science in Clojure.

Background and agenda:
https://scicloj.github.io/posts/2020-08-26-public-meeting-2020-08-30/

Text chat:
https://tinyurl.com/yymuadpb

Some notes:
https://clojureverse.org/t/scicloj-needs-you/6467

A

So hello, everybody- that is the august closure, data science, meeting science meeting and locate and let us mute ourselves as much as possible. While others are speaking and um maybe in a moment we will hear ourselves discussing some stuff in the first part of the meeting. We will have a few of very short talks by several of us and afterwards we will have a discussion and.

A

The idea is mainly to catch up to hear a little bit about the projects which are going on on the ecosystem. We are building and to see everybody here. It is heartwarming because of course, it is an unusual time and people are having the private local struggles and, being here matters, I think for us. Looking for.

A

For us looking for for a way to to somehow make sense of uh what is going on so so I think I think at least for this community. We can have good hopes for the coming weeks and months, and let us discuss that a little bit later. um So now.

A

What we are going to do is to have a sequence of short lighting talks where people will be telling what they are about, what they have been feeling recently, and I guess we will try uh some method where each person may speak for five minutes, maybe less if they wish less is also good and we will use a gong sound so that people can know about where time passes.

A

The gong will be heard in the beginning and then after three minutes, four minutes and five minutes so that you can see where it is starting to end and um the first person to speak would be anthony kong about the new spark wrapper. He has been building and I'll. Just let us hear the gong for a moment.

A

Is it okay? Could you hear it yeah right so that sound will uh happen at the beginning after three and four and five minutes and uh more people are joining and so anthony? Are you ready to begin.

B

Yeah, I accept that I think you need to enable uh screen sharing. I've brought myself some slides.

A

Yeah perfect perfect here I am lion allowing that and now you can share and let us begin.

B

Okay, please let me know once uh you can see the screen.

A

B

All right, great, so, okay, first of all, thank you so much for having me, uh since we only have five minutes I'll, just cut to the chase. uh So I'm talking about guinea. So this is a closure data frame library that runs on spark.

B

So two things that I want to cover today. One is I'll: tell you what it is and also some design goals that go into it. So, let's go to what is guinea first thing to say about it? Is that it's an idiomatic closure data frame right? So what I mean by idiomatic idiomatic closure here is that it should be nice to read and nice to write and also uh being a data frame library.

B

You know uh it does what you'd expect it to do. You can you require it? You load data, and then you can uh do your account and you can look at your columns, for instance, and all of this just runs runs on spark.

B

So this is an example of doing a group buy and aggregate, and when you show you can see the the results here being run on, uh you know being hosted on spark. You get spark ml for free, so there's a reasonably rich machine learning library that goes to that goes with it. So this is an example of how you'd uh create a pipeline machine learning pipeline and then it's just a matter of like dot.

B

uh You know ml fit ml, transform just like what you'd expect in scikit-learn there's also rdd support, which just means that you can do lower level spark if you like.

B

The next thing I want to say is that so it's not just the library right, it comes with a command line interface as well, which comes with a standalone rappel or for you to submit script. So it's a bit like spark shell. Where, uh when you fired up you get the you, you don't need any requires. You get the data frame, you get the library they're required already.

B

So that's like a brief tour of uh what what guinea is so some design goals. I think the overwhelming objective of the project is to provide an environment where you can have fast feedback from the data. I think uh this would be familiar with for a lot of data scientists where you're constantly living in this cycle, where you get some idea uh about your data and then you're doing some query and then you'll get the result, and then you build on top of that idea. You'll get new ideas from that.

B

So the idea is that we want to optimize this feedback and some important factors is like kicking off the feedback cycle, so it must be fast for you to start doing it and then translating your ideas to queries and, finally, your query, speed, right and and fast and accessible rapple as uh one of the design goals that we're trying to hit- and the use case for this as like spontaneous queries so, for instance, like you're working on a data set and you're thinking, okay, what's the most expensive region in melbourne right.

B

So I think uh python is really good at this because it starts very quickly and then you just do import pandas as pd. uh You read your parquet file or whatever file you store it in and then you do your query and you get the answer. That's fine! Basically, uh we want to be able to do the same thing, but with closure.

B

So that's where the guinea cli really comes in uh uh you, no no need for line new line run uh no need for requiring this, and for that you know it starts up and then you're good to go. It's only a little bit slower than python in this case, and the second thing is about translating your ideas into code right, and one thing to say is that uh it has to be nice to write so there's nothing, stopping you from writing interrupt code such as this one.

B

This is like pure scala, interrupt right, but then it's not good, because uh it's quite verbose and a lot of these queries would be uh would be short-lived right, like the the lifespan is like under one minute and if you have to write a lot of throwaway codes like this, it's really not nice. So with guinea.

B

You can write this instead, which is a little bit cleaner and and shorter, and it has to be nice to compose as well so that instead of getting scala data structures, uh you want to get closure data structures so that you can plug it into somewhere else. So uh g collect here, uh you get a sequence of nested maps, uh query speed. This is like a very quick uh dirty uh benchmark that I did as like resembling a project that I did uh recently this year. uh But guinea is fast right.

B

It's fast out of the box. It can be up to 70 times faster than pandas and, seeing like 30x is quite typical. There's some other libraries there as well. I think that's my five minutes.

B

I hope that peaked some interest uh and contributions are welcome. Thank you. So much.

A

Thank you so much uh that last last benchmark was amazing and we'll discuss it.

A

Further okay, so now I will briefly show some closure workflow.

A

So I will share my screen and what I am trying to show in this uh workflow is a couple of tools that are growing these days and.

A

What we are seeing here is a node space, which is an experimental way to connect the repel and browser and the imax editor in this case, but any editor eventually, and we will see how we can try to connect all these tools in some in some dynamic way.

A

And what we have here is a certain kaggle avocado data sets which we are reading, uh thanks to the tablecloth library, which is a wrapper of techmel dataset um and a tablecloth allow us to do a pandas-like processing of dataframe nature and the visual visualizations we're having are by some part of the ping gorilla ecosystem, which allows a certain dsl for viewing data in using certain javascript libraries that are wrapped by that dsl and um what we are also having here is some ui element, some user interface, that can control our repel state.

A

So, for example, what we are doing here is using table cloth. We are taking all the avocado sales in seattle and grouping them by type. You see conventional and organic and looking at some statistics of price and volume, and we can filter these and we can see that if we take a minimal volume which is big, we have only conventional and if we change it a little bit, then we have organic too and all that is happening in the rebel states.

A

So with some thin layer connecting to the browser and the magic behind the scenes is.

C

Not only some parts.

A

Of the pink gorilla ecosystem, but also the cljfx library, which offers some really neat way to manage state in the jv enclosure, and I encourage you to look into it. So we see here table cloth and pink gorilla and also behind the scenes, uh the techviz library that generates vega specs for visualization and all these compose in uh pure data transformations, and that is what I I wanted to to show how these few libraries can connect.

A

And the point is that it is now very easy to to see how things can easily connect with very thin layers. So it is very easy to experiment and let us talk about further experiments of on how rapid state browser state, editor state and your minds can connect together. So thank you and I'll. Stop sharing now.

A

So now our next next speaker will be john anthony, who will show some updates about.

C

Let me see if I can get the screen share here. Yeah.

C

C

Oh, have you got, can you see that.

A

Yes, we can see.

C

Okay, so I'm just going to give a quick overview. What this is and then a little bit about shiny and then how to do. Shiny-Like things inside so psyche is, is a basic, is a self-contained interactive built environment for exploring data and creating and saving visualizations and creating shareable interactive documents. Actually, the presentation that this is this is a cycling document.

C

Some features, there's main editors. uh Main editor is like this stuff over here and then you create the document. um That's on the right. Those uh editors can create code, visualizations, markdown, laptop reactive components. Automated boilerplate has a powerful template substitution system borrowed from autonomy, it's not cellular or modal or linear. Like typical notebooks doc body editors can be live and interactive.

C

You can directly mix code from closure, enclosure, script and r. Python is in the works, multiple namespaces, so that you have single document. You can have different data scenarios um captured without having to reload everything, and you can organize documents along chapters and sections which is sort of like these. This tab stuff is uh shiny chinese on our system for visualizing our computations. He uses the web. It's a push based uh design.

C

Our code generates html and pushes it out to the client in the browser, and it's intended to be a simple and easy way for our users to visualize stuff. That they're working on key aspects involved here today is the processing is server based. So you can leverage all the capabilities. There front end provides for user interaction and it has high leverage components for widgets and charts, and I think this this is probably a big one, this bit here. So a couple of simple examples.

C

So here's an introductory lesson, first kind of shiny, app thing and then here's your code, which is fairly straightforward. You have these very high abstracted components for the widgets and stuff. Okay,.

C

Okay and then here's here's an actual interactive one.

C

C

uh Or maybe this is going to be too long. um Eventually, this will come up so inside. Okay. What we need to do the similar kind of thing is data complications for server. That's not a problem. You can use neanderthal technology, visualizations are provided via vega light and vega could be others going to have very high leverage via templates.

C

Your substitution system and reactive components are currently it could be other things layered on recon hiccup and you can, but we need a higher leverage for these things and we can abstract widgets via the template system as well, so that's kind of cool. So, let's just pull up an example.

C

We won't have time to do maybe any more of the risk, so this is running code and here's.

C

If we come back over here quickly, maybe we can see uh here's the uh the shiny version of this and uh should be able to update this stuff a bit yeah, so that's going on and then you can, every time you do these clicks here, you're going out to the server it recomputes and then pushes a new version of the of the visualization here, we're not doing that we're pulling things, but we can just move this around in exactly the same way.

C

These are again calling out to the server uh and we can do the same thing here with this this stuff. Is we open this up and take a look at how we did it? We have these high leverage components like slider input and text input. It creates this and then another one that creates the exact same thing here.

C

I mean leverage yourselves on top of vega and they like to do this stuff and uh just to go gone. Telling me I'm out of time say yes or no.

A

These are five minutes, john, so slowly you can go to the ending. If you wish.

C

Okay, so I got a little bit of time.

C

Left well, okay, so let me try and load this other one, which is a little more interesting in terms of its capabilities.

C

So again, uh these widgets are are just um take a quick look at the widgets go over here. This thing is loading all kinds of stuff, but the definitions are in these.

C

Yeah, like here's, a single drop down a selection list, we have slider.

D

C

All this stuff a moment here um we can use this and change our states. We can uh update the computation for the sliding window uh average. You can come over here and look at the similar kind of thing, or this is per state.

C

Let's go update this and again these things are all done in exactly the same kind of ways with that other one. So we have our slider input the text input over here.

C

So that's that this this this piece right here defines this entire stuff. So it's a fairly high leverage thing and um maybe that's it thanks.

A

Thank you so much john, so that was magnificent, and so we have seen now uh earlier one way of building dashboards through the rectal state. Now you showed us what may happen when one uses the browser state, and that was really magnificent and thank you so much, and so the next speaker will be mike and to everybody who have joined us uh recently. uh Thank you for your patience. uh We are now having this short sequence of um lightning talks and afterwards we'll have a short discussion um mike. Would you like to share your screen.

D

Yep, let me find the right window here that one is powerpoint.

D

Hang on a second.

D

E

um Yeah, all right, I'm going to get started, so I'm going to talk about flame, which is a visual query builder uh against an uh atomic knowledge knowledge base. um So the domain that I work in is in cancer immunotherapy, so we cancer immunotherapy, is a new technique for treating cancer using what leverages the body's own immune system um and uh the parker institute.

E

um Does research into this process we're trying to understand and make it better. um The only thing I can say here is that the the immune system is uh really complicated and there's just an incredible diversity of of data and things we want to represent so we're. So we built a knowledge base called candle uh cancer data and evidence library um based on the atomic.

E

So we ingest experimental and reference data to organize it into a common framework and provide it as a as a tool that that researchers can use to to query um so as a as a knowledge base. It's a kind of medium complexity.

E

This is a graph of all of the classes involved, um and I don't know if you can read that um so some of the central classes are there's there's subjects or patients uh samples which are, you know, tissue samples that we got that we get from patients and measurements where you uh run some experiment on a sample and see what uh what's it what's in the genome or other or other measurements.

E

um So there's about 30 years, 30 or so classes in candle um and then and then another then genes are another important data type that they're they're sort of central. So we have genes and gene variants and the samples are linked to that. um So the so. The the one question is: how is we? We have a lot of of users of this database who aren't programmers? How can we make it possible for them to query the database here?

E

Here's here are some of the some typical queries a scientist might want to ask of this, such as what are the? What are the outcomes of patients who have renal cell carcinoma, who have variants in pbr m1, which is a particular gene when treated with anti-pd1, which is a particular class of drug? How do responses vary by gender and body mass index across cancer types things like that um and they can get fairly complicated? So um so that's that's some background. The other background is uh is kind of starting to switch.

E

Gears fast is uh so. Scratch um is a children's computer programming language that some of you may be familiar with. It's a visual, it's a it's a tool designed for children that makes it possible to compose programs by snapping blocks together.

E

um It came out of the mit media lab there's a bunch of similar systems, including uh blockly, which comes out of which comes out of google, and so what I, what my my big idea was to hook up blockly with with uh candle, to make a visual query builder and I'll show you what that looks like um and the the system is called is called inflame, because everything we do is candle candle related.

E

So this is this is what inflamed looks like there's: a construction space where you pull out blocks and snap them together and make a query that query gets translated into data log and then you can view the review and browse the results results down here. um So, for instance, here so here's a a typical query. A scientist might want to find subjects with stage four things throughout variance in pberm1.

E

So um to build a query, you kind of sort of identify. What are the entities you're interested in their classes, and then you pull out blocks um that that that represent those categories and snap them together. So this so this is this- is this? Is a block translation of that query?

E

um You find subjects, there's a query block for finding the subjects and then inside of that there's various constraints and what the subjects can do and you can have sub query sub queries in that. um It's a you know, fair, it's fairly fairly, straightforward.

E

This is a part of the block. Construction process looks like this. You have all of your all of the 30 classes are here. They each have a color, because we we have about 30 classes, we're sort of pushing the boundaries of what you can do using encoding and color, and there's blocks for each of the to query the subjects and the properties uh something might have, and some of those blocks come with sub queries of their own.

E

um Here's and here's an example of a blocks and its translation into into data log.

E

uh Here's a somewhat more complicated example. You can see you can. You can also do things like like counting and control some of the data that that's returned.

E

So uh yeah describe some of the details, so, like I said, classes are mapped to color. um You have some blocks, have a little output knob, which indicates it produces a set of objects um and then query blocks have this this gap, which is we're sort of I'm sort of abusing some of the scratch metaphors. But that's a way you can add multiple clauses that, basically, you can put in query constraints and they get handed together in there, as you can see down here here, we're querying for subjects.

E

How am I doing for time.

A

uh So these were six minutes. If you wish.

E

All right, all right, I'm about I'm about done. um These are some of the underlying components. uh It's enclosure, script and reframe, and atomic we're using the blockley library from google uh there's a part part of this has been there's a library called blockoid which we've open source, which is which is a wrapper for blockly. So now it's now, it's fairly relatively easy to create block languages and closure scripts.

E

I encourage anyone interested in that to make their own um oops and then here's some of the other people who who work with me at parker on this. uh That's it.

A

Beautiful, thank you so much and we need to talk about it more slowly. One day really, I'm.

F

Sorry is, is that time for questions in the end or should? Can I ask questions now.

A

Let us have questions later. If that is okay,.

F

All right, wonderful,.

A

So the next speaker will be crystal fell, small and.

G

Yeah yeah, let me share my screen.

G

Here, okay, uh does everyone have that cool? um Let me go ahead and start here, uh hi everyone uh thanks for being here, um I'm excited to be showing you an environment. I've been working on for analyzing scalable, open-ended feedback as produced by polis, which is a digital democracy tool for engaging citizens in decision-making processes.

G

So the first thing I'll show you here is just the repositories up on github under pol-is analysis. So if you'd like to take a look and play around with some of this data, welcome you to do that. It's got instructions for running in everything here right now, I'm going to just go ahead and kick off the the docker imager and sorry. I should I missed a piece here um right, so really what what this has in.

G

It is um a docker container, which um has a bunch of analyses that have been built upon tecmo data set uh lid, python's, dlj um and visualizations built using oz and vega, and all this comes together in sort of two form factors. um One custom closure kernel for folks who, like kind of the traditional you know: jupiter notebook style, environment and, additionally, a set of oz style analysis notebooks, which I'll be showing a little bit more in a second here. um So kicking this off uh with docker compose we down at the bottom. Could.

A

You maybe zoom in a little bit.

G

Yeah, and actually I need to- I need to make it smaller before I make it bigger, uh because I need to click on this link and it won't work if it's on multiple lines um but down at the bottom. Here it will um once we run this, we'll get some uh some urls printed out uh for opening up the close jupiter uh or the jupiter uh notebook, and um now that I've got that, I can make this a little bigger.

G

um So once we're in here we can go ahead and click through to um notebooks um jupiter and there's a close jupiter example here, a little bit more yeah sure. um So uh I'm not going to go into anything in too great detail here, just that um it has some basic examples of loading up tech ml data set and loop python clj um requiring some of those python libraries. um This yeah, don't worry about that um and uh and a little oz example here.

G

So still this, this piece is still a little bit bare bones um and actually really what I'd like to show you more of right now is the um is the the oz side of things um so coming back over here, you'll see just underneath the link that we clicked um it says. Enrapple service started import 3850, so whatever sort of tool you're using you should be able to um connect to that.

G

I'm using vim here, and so what I'm going to do is first evaluate the polis, math namespace in this project and um just want to point out that this is running in the docker container right, but um uh with the n ruppel uh port file, it's able to sort of find find the right connection uh into the docker image, with all this stuff sort of baked together, um and uh once this uh once this polish math um uh namespace is loaded.

G

uh We can then go in and evaluate the username space um where we have some sort of stubbed out code for starting the oz processes. Now it should be the case that you sh, you should be able to just immediately evaluate um the user name space directly and have it require math, but there's a little bug. That's preventing that from happening right now, so just kind of a heads up if you're poking around at this. um So once we get that running, uh we get a message here: web server running at localhost 3860.

G

um So we can go ahead and plug that in here at the browser, and um we get this little message saying that it's ready for um ready for a spec to load. uh So the first thing I'll notice, a basic um vega visualization with a couple of data points um just to keep the wheels here and make sure things are working. um But really we want to take a look at.

G

Is this oz build process here which kicks off this live code, reloading process that gives you a sort of notebook like environment, um a little bit more like what you um something like between a notebook environment and like a live coding, environment um with, say, like reagent or fig wheel or shadow? um If you used to those tools, so once we've got this running now, um you'll see that um this is looking at the directory, notebooks, oz and so down. Here we can.

G

um We can open up this file um in that directory, and so this is just a regular closure file right. The only uh interesting thing being that in this particular mode of evaluation, whenever it sees a literal vector form, it will interpret that as hiccup and render to the page. So all I have to do to kick start. This is save the file. It sees that something's changed and you can see over here on the right. It's reloading the file.

G

It prints out information about long-running forms that are being processed as well as how long it takes them to run, um and once that once that's done, uh we get a visualization and scientific document here on the right, um it looks like I'm right in five minutes I'll try to speed through this here.

G

um So this this particular data set is from a uh a consultation done in taiwan around how to regulate uber uh in in the nation, and so you see, there's 1200 participants here, um 50 000 votes um an average of 40 per participant, and we can put all these votes together in a matrix of voter by comment um which is cool looking, but not very useful, uh necessarily uh and where this gets a little more useful is where we start applying dimensionality reduction to um to project this really high dimensional data set into a lower dimensional space that we can visualize, and so this is here, is a pca projection where we're coloring by some groups that have been assigned using clustering algorithms, and we can interact with that using this hover over here um and here, we're looking at another dimensionality reduction called umap, which has a little more degrees of freedom for finding um kind of finer grain structure and so um just to kind of demonstrate some of the interactive features.

G

Here, we can actually see how these two projections relate um just really cool, so um I think uh yeah I'll kind of wrap up here, um some longer term goals for the project. um I'd like to eventually migrate the core polis tool itself to this tekkenml dataset and the python stack um as appropriate right now.

G

It's just kind of an experimental place where we can build more custom, analyses and and tool around and investigate the data, and the other thing I'm really excited to explore is finding a way to take this api and make it something that people can use from the python ecosystem so that um data scientists, you know who are who are more familiar with that- can take advantage of this same kind of core core logic in in playing around with some of this um some of this sort of book, civic data um so yeah, I think that's it thanks.

G

Everyone for uh for listening and um yeah look forward to answering questions later. So please, thanks.

A

H

A

That wasn't enlightening to see your workflow and and the umap versus pca, so much to think about now and yeah, and now the next speaker will be sivaram. um I think, are you here, oh yeah yeah. Would you like to share your screen yeah? Thank you. Daniel.

I

I

I

Let's see, are you able to see my screen now? Yes, oh okay! Thank you daniel and uh thank you all for having me here. This is my first time here and I've been enjoying the listening to the different uh talks and the tools. So um I'm I'm a clinician, I'm a physician by training and doing clinical oncology and knowledge engineering.

I

um So what I'm talking about here in terms of imprecise data is coming from that ontological perspective right uh so imprecise right. It means lacking exactness and accuracy of expression or detail, and so the question is: why do we care when it comes to data? Why do we care whether something something is you know we want to do things very precisely or not.

I

We do care because it has, uh you know very serious consequences, so, for example, uh many years back, it's probably more than decade. Now we had this hubble telescope, which launched into space and after launch they found out that the images were not sharp and then finally, they traced it back to uh having an issue with you know. uh One of the mirrors uh was designed like with the one millimeter difference and that one millimeter flaw led to a huge problem in terms of images being sharp or not.

I

Now I come from the clinical space, so I want to give you an example from uh from medicine, so we have some. We do something called the egfr or the estimated glomerular filtration rate, and that is necessary for calculating uh that's necessary for uh uh for staging, diagnosing or staging chronic kidney disease.

I

So the gfr is measured based on serum creatinine levels, the normal values for each run, from around 0.84 to 0.21 to 1.21 milligrams per deciliter.

I

Now there are different ways of doing it, but the main methods are the jfa and enzymatic methods and depending upon which method you use, the difference can be up to about 0.2 milligrams per deciliter.

I

Now, when you do calculations based on this, it can lead to a bigger difference in the results. So the calculated egfr can vary quite widely, depending upon what method you use and so at an individual level. If you are uh calculating the egfr and comparing and trying to see if the patient has chronic kidney disease or not, then you could be off by a little bit and a patient who is not in chronic kidney disease. You can very well diagnose as kidney disease and start treating that now, source of impressive imprecision in data can be.

I

You know twofold. One is at the time data is generated because you're using different methods, uh different instruments with different calibrations and things like that. But today what I want to talk about is more on the storage, computation and exchange of data, where imprecision can be far more insidious in a sense that we don't even realize it, and this is an exam.

I

This is something that I found in java about almost about 15 years back when I was working on some some data, it is simple calculations using a float can lead to an imprecise number and I tried to replicate that in closure, and you know 15 20 years later, it's giving me similar kind of results. So if you add 3.3 uh 3.3, you get 6.6, but if you do it three times you get 9.89999 instead of 9.9.

I

The the problem with this is as by itself you round it you'll get the right answer, but once you start multiplying it, adding it to other things, subtracting and do all kinds of computations. The end result can be very uh problematic now when it comes to dates. uh I'm seeing that we have a huge issue- and this this course in in medicine, when we're dealing with a lot of events and trying to figure out what happened before and what happened later.

I

It's a big issue now, if, if you want to create a date uh instance in java or enclosure right, this is what you get you get a year month, date time and then the time and then up to the milliseconds uh level.

I

The problem is we don't deal with events like in in real life up to this, uh this position right. So we have you, have a birthday you're, typically dealing with uh up to your month and day. Somebody.

D

Has died and you're saying: okay,.

I

Something like in 2008, and if you have a diagnosis you could be saying you know. Oh that happened in january of 2011. and when it comes to some things like you know, procedures it could be. As vegas saying you know, I had this procedure about a week before christmas.

I

Now, how do you represent this when you do not when you, uh when your precision is, uh you know, goes up to the millisecond level. So when you say something like 2011 uh january, I found that different uh systems right different, different places. They compromise in different ways. They say 2011 january or one or january 31st, or sometimes they go into the middle right january 15th, and they take so when, when you, when you start comparing dates right what happened before and what happened later, this this causes problems.

I

So this is one of the biggest areas uh that I face. An issue in in terms of uh handling dates in medicine. uh Now mike earlier talked about. uh You know the the uh querying for data where you mentioned, you know, find all subjects with stage four cancer and things like that now. What does subject mean? What does stage four cancer mean? These are kind of things that look very simple, but there there's a lot of ambiguity when it comes to the exact meaning of these terms, and that's what oncology is all about.

I

So we deal with a lot of situations where a single term can have several different meanings. For example, the word code, if you find it in a sentence right, you could mean feeling cold or I have a cold infection or it could mean chronic obstructive lung disease.

I

So here is something that we found in uh when we're looking at actual clinical data. When we come across a sentence which says like no recent catheterizations, what does? What do you mean, and here are like all the different meanings for catheterization?

I

So the uh so precision is very important in when we come to uh dealing with uh data and it's very important to have the you know the exact meaning, whether it is numbers or dates or terms. um You know they all follow the same pattern. It's it's very, very necessary to have the exact meaning for this. These things.

F

A

Thank you so much and it has opened so much so many questions and but you know at least conceptually to me. It helped to see this way you and conceptualizing this and so the last. The last talk before discussion will be by luke lukash. Yes, oh you're, here, oh hello, hello. Can you hear me? Yes? Yes,.

J

All right, let me show my screen: can you can you see my screen? Yes, that's great I'd like to talk about I'd like to talk about differential privacy or how you can protect private data with closure.

J

Unfortunately, anonymization does not work. Basically, you can cross-reference and de-anonymize data sets. It happens all the time, it's really horrible. So we- and there are some horror stories you can read online, so we need something better. Unfortunately, now we have differential privacy, researchers now call it the golden standard for privacy and the u.s census has already been using differential privacy differential, private algorithms. Big corporations are doing it. There are more and more open source tools, but the idea is not widely known, which is why I'll try to explain what it is.

J

So the the general idea is your your private data is safe if the query cannot even reveal if your data is there in the data set or not. So here we have a here. We have a data scientist he's looking at some query results and he cannot even figure out. Are these coming from the full data set or one where all your data is missing, and this is this holds for all individuals in the data set?

J

So the way I I just put, it is a little too strong. That would mean we have perfect privacy, which does not exist, and it will also mean the algorithms produce. The queries produce results completely independent from the data sets, so that would be useless right.

J

So this is what we really have: here's the definition of what it means for an algorithm to be epsilon differential, private.

J

Basically, uh this parameter epsilon here expresses the trade-off between privacy and utility, so for perfect privacy we would have epsilon equals zero. This thing would then be one and for all these pair for all pairs of such data sets, we could not distinguish between them. We would not be able to distinguish between the two data sets, so we have to set this parameter to something higher than zero, and but not this, this value should not be high, but it has to be higher than zero.

J

We call it a privacy budget and this is still a very strong property, so yeah. So this is the definition and the desired property. To actually achieve this. We need to use random noise, so we can either add random noise to local to data itself. It's called local differential privacy or we can add noise to the query results. That's called global differential privacy, so here there would be a trusted curator.

J

So we could we could. We would send the query to the trusted curator, who would compute the query? Results from the data set? Add the right amount of noise and send it back to us, so they would. We would have this uh interactive mode of work here and it seems the this whole differential privacy field is moving more towards this model, where there's adjusted curator and the the noises added to the query results not to the data itself enough enough. Noise means uh we need.

J

We need enough noise to protect individuals, privacy, but not too much so that analysis is still good right and even uh machine learning with differential privacy is possible. It's all fascinating, but I have to move fast and before I show something that works. uh I need to talk about open mind.

J

We are a. We are an open source community. We are working on all kinds of privacy related technology. There will be a free online conference at the end of september. If you're interested check out those links. I'm a member of the differential privacy team in openmind and we we have a little closure library for differential privacy.

J

It's actually a wrapper for uh for a java library by google from google and there's there's great value in having this java library under the hood, namely in differential privacy. We have to worry about attacks on implementation, so this is a little similar to cryptography in that way. So having this audited and battle tested library under the hood is drained by this for us all right so I'll show you a notebook, real, quick. So here here's a bar chart showing counts of visits per hour, so in the restaurant right.

J

So so the red bars are showing true unaltered and real numbers of visitors per hour in the restaurant and the blue bars are showing the same values, but with a little bit of noise, noise is generated randomly it's laplacian noise. If you're curious, uh so you can see, the blue bars are slightly different than the red bars.

J

There's some distortion, but not too much so you can still see the pattern right and so even from the blue bars themselves, you can you can uh see, for instance, when the restaurant is more busy and to compute these differential, private values or the to add the right amount of noise we need to use. uh We need to use functions from the library, here's a count function for counting visits right, so it's basically it takes a collection, and then you need two extra parameters. The first one is privacy budget and the other parameter.

J

Long story short means how much an individual can contribute to the algorithm's result and in general the more an individual can contribute to the algorithm's result. The more noise needs to be added. I hope it makes sense. It's all about hiding individuals contributions, so you can, you can count, you can compute differential private count, you can compute differential private sum of elements from a from a collection also mean, and some other functions will be available. It's all work in progress.

J

The java library is working progress and the wrapper is also work in progress, and that's it thanks for listening.

A

Wonderful, thank you so much lukas, and on that too, we will wish to have a longer talk. So these were the lightning talks, and now we have some time to.

A

Discuss now uh we have 17 minutes to the official end and for some people it is late hour at night, but I guess after the official end, some of us may wish to stay more because of all the trouble in the beginning.

A

So anyhow, let us try to be short, and my advice for this discussion is maybe try to ask questions more than trying to.

A

That the most suitable for this kind of meeting and let us chat anybody.

F

uh Yes, I have a question uh to uh what is it uh in regards that the atomic varies? uh Is there any uh speed problem uh in using the turmeric for this kind of queries that you are doing.

A

And that is to mike, I guess, yeah.

E

F

E

Guess um yeah atomic's, not you know not not the world's fastest um we're. Actually, we actually are working with cognitect um who who are building like a query, optimizer engine for us, since they they have something against putting that into atomic itself. So we we've been doing some work on that. um So it's not the it's not the fastest graph graph database in the world.

H

So uh I have a question to mike as well, so if I understood correctly, um the database has a graph model or or which model if it is on the graph.

E

um Let's say I'm not trying to understand the question we have the I.

H

Mean actually my question is: why did you choose diatomic over another database, whatever it could be.

E

I I think um I I think we chose a graph representation because it was necessary for the complexity, complexity of the of the domain of the data domain, um and we choose you know atomic over another graph database, for you know more more more more social reasons than than than anything else.

E

That that was that that was that was I that was before my before my time on the project. Actually I mean so the atomic uh the time you know, there's other art, there's other, like rdf databases which are sort of from this from the kind of representation query perspective are pretty similar to to to atomic.

E

um So I- and so uh so I don't know hope that answers your question.

H

um Yeah yeah, you have, I mean my question was which was the the model you use. So it's a graph.

E

Yeah, it's it's a it's a graph, um we've sort of imposed a schema over it, which was you know the subject of that that uh that graph diagram I have so there's there's about 30 classes and a class can have you know in our case. Sometimes you know many hundreds of thousands of entities associated with it. For instance, like I think your measurements are the biggest the biggest uh biggest class in in in candle, because that represents basically every ever every every piece of information you get you get from a sample.

E

um So there's you know, there's you know hundreds of subjects, but you know hundreds of thousands of of measurements and some more samples or somewhere in between that and then so. And one thing I didn't talk about so I wrote a system called al zabo which generally generated that diagram and kind of and manages the manages the schema. So it makes it a little bit nicer to to create and manipulate uh schema level information with atomic and that's something I'm working on, I'm hoping to open source soon.

G

Just wanted to say, I love seeing that um you're able to open source the the closure script. um uh What is a blockley um rapper? That seems like a really cool thing for me able to expose. You know, computational functionality to folks who, who aren't you know ready to dive into code. Yet.

E

Yeah yeah I'd be very happy to hear from people who are using that or help them to help help them get started.

E

This blah blah blah by the way is the name of that library.

G

Yeah, I think I have some kid projects in mind.

I

So I'm just curious about if people are doing any work around using directed acyclic graphs in any particular libraries or anything that you're using for that work,.

I

So most of the most of the ones like uh vega, lite and all don't have that functionality you have to dip into either d3 or or uh something else like, I think, loom is another one that has some functionality, but otherwise there's not a whole lot of choice out there.

G

Yeah, I think you can use vega for doing some. Of that. um I don't I mean you can do some graph work directly from vega light and some of the layouts have gotten a little bit better there recently um I'll just end, I'm always pitching vega. um The team is just one thing I love about them is the team is just so responsive to to um requests and questions, um and I had thrown out.

G

um I was trying to build a um phylogenetic tree, visualization toolkit and um didn't have the right layout to to kind of make things look right for a phylogenetic tree um and they added it like. I, I don't remember it was like a few weeks later like it was pretty pretty awesome, um but um but yeah, I think I think you can I'd have to look, but I think you can do if you can't do it with vega light. I think you'd be able to do some basic stuff, like that, with um with vega.

C

And you can do, I think you're right like basic stuff, but I I've talked to the idl folks as well about some of this and and they realize, um there's there's a lot of limitations there and it's just not high on their radar. It's not something! That's um they're! Really, I'm going to get on the road map very soon.

C

The thing that I would suggest is probably side escape with that. That's probably the best draft thing out there, visualization, okay yeah! You can do it's just amazing what you can do with it.

I

Yeah yeah I've looked at cytoscape, but this then decided to experiment with d3 stick to b3 and experiment with that all.

K

Right this sounds much more low level, but okay.

I

A

I

Now I'm not a programmer or a developer, so I do most of the closure work from a hobby perspective, and uh so one of the challenges that I find is you know you guys are showing some amazing tools and all that stuff. But the the learning curve to get set up with any of these tools is quite a lot and if you're not uh working with them constantly, then it's really difficult to get wrap your mind around it.

I

You know to see how it works and all that, so I think I think one of the things that will help is uh if, if there is any chance I know all of a lot of this is like open source, voluntary work and all that, but uh it will help as if, if you can put together a small hello world kind of thing, for each of the things and uh and and and you know, go from there work- take a single example and make it more and more complex, rather than giving different examples.

I

Just my two cents.

C

Yeah, um I think, there's actually two pieces to that question or point one. Is the idea of.

C

Getting the infrastructure up and working at all and then on the second one. The second part is understanding the details of the particular system or library or whatever, so those are kind of two different things. In my in my mind, for something like scitec you, you really just need java java 8 right now. It's it still hasn't really ported to 11., but that's all you need and- and it has it's a self-contained uber jar and then we'll even install the mkl libraries for neanderthal, for you.

C

So you don't and you don't need I mean well, you still have to. If you know emacs uh it, ha the editors are all basically emacs like they're, not.

D

C

um But they have all the cider type bindings you can and you can configure all of that stuff if you want, but it has all those as kind of default findings and you can use them if you're a vim person or sublime you're that available, but the default out of the box thing excited.

A

Yeah, maybe I will comment about that.

A

I think yes, one of the problems we have is that many of the tools and libraries are not yet accessible enough, and that is the stage where we are at and I think a lot is still missing in terms of functionality in the ecosystem.

A

And what we may imagine is that after several months, some of the missing pieces.

L

A

There and then we hope we will be able to focus on writing tutorials being able to reach out to more people.

A

K

A

Tutorials and some other people are writing tutorials, and we are kind of in the beginning of this stage, and I hope that in a few months things will be different.

G

Yeah, if I can add something to this too, I think one of- and I you know- didn't- have a lot of time to touch on this, but um one of my goals with the the docker environment that I put together um and demonstrated, is um that, while I think, as as john anthony was pointing out like closure by virtue being on, the jvm, is pretty good about just kind of running and being sort of repeatable and not being different on different systems, et cetera, um but uh now that we're kind of bridging out into the space of uh creating better interfaces with languages like python and r, some of those environmental, especially with python.

G

Some of those environmental kind of complications um can be quite challenging. um I mean, if you've worked with python for more than any period of time at all, you've you've probably had an issue with virtual environments, and this version of that not working with that, and um it's just it's still a mess there. Unfortunately, uh and so one of my goals with this was just to put together this package that it's all baked together, you've got the it's all baked into a close jupiter jar.

G

So if you again, if you like, using it from the kind of notebook environment, that's easier for some folks to get started with, um then you have these tools put together there, um if you, if um and if you want to branch out into kind of working more directly from a rebel connection, as I was demonstrating and um and as john anthony was kind of pointing out, um then then that stuff is all kind of there for you, too, um so uh yeah.

G

So I'd encourage people um who are kind of interested in um building environments. Like that I mean I realized what I what I put together here is very kind of specifically focused on um this.

G

These kind of polish data sets, but uh I'd love to see people taking kind of the guts of that um docker environment and sort of repurposing them uh for other uses, because I think that um there's a lot that we can do now that we have this ability to um uh to stitch together these different these different ecosystems, uh but it it becomes it's a problem.

G

If you have to set up those environments yourself, I mean it's hard to set up a python environment, setting up to to to merge with jupiter and with and with um and with all disclosure stuff. um It can be a little bit daunting if you, if you're you're, not not oriented so um just throwing that in there yeah. I totally agree.

C

With that, um I do, on the other hand, I would say: python is hands down the worst uh r isn't anywhere near as bad um yeah. I agree.

B

But on that, may I ask a sevaran like whether or not you have like a some sort of a reference point right when you say like uh maybe getting started, is not as good at enclosure or it's not as easy like you have. uh What are you comparing that to and what is it? Is it the tutorials? Is it the the docker container that would make it easier.

I

So I actually love working closure, so I had actually uh given up on uh doing any programming work about um about 10 years back uh so before that I used to do is again uh do a lot of stuff in using java, mainly, and then um I got tired of java with all its verbosity and uh everything right. It was painful to set up and take down every single thing, so move more into management. But of course you know the uh the development each was always there and then came across uh clojure and yeah.

I

It was a steep learning curve, but I I stuck to it and I actually enjoy working with clojure, so pretty much everything I do now is enclosure, and uh so so it's uh so don't. Take me wrong when I say that you know the the there should be better tutorials or better uh hello, all kind of things. um I think it's it's uh even after five years I find that you know with some of the new libraries when I want to bring them in and start using them.

I

It's not that easy to uh to get to get up to speed with them and um and and many of them don't seem to kind of like what uh uh work in the way that um that's described. Maybe it is because of version differences or whatnot, but there are still challenges now, a lot of them work very well. It's just that getting up and running with them is can be made a little bit more easier.

G

Yeah, um I I I don't want to stay away off this topic too much, because I think it's an important one. But um uh there are some questions I wanted to sort of throw out there around um some of the kind of interactive features that um in particular what you were kind of demonstrating daniel, um as well as what you're working on john um and some things that I've been thinking about for oz. So um uh I guess I'll I'll put that out there that I'd like to at some point asking questions about that.

G

But um I think if I'd also like to leave more space, if folks want to continue talking about this particular thread.

M

Well, I'm enthusiastic to hear the answer to your question because I my experience in data science is detecting the difference between comma separated and tab separated files.

M

So, for example, I think that I've seen three different notebook kinds of solutions demonstrated here, either as the main event or some factor in a presentation. um Was there globe, jupiter and and oz and cyte and tablecloth, um and I'd be interested to learn more about what they're based on and how they differ. How you choose, which one what's possible to accomplish whether their strengths are mainly in the discovery mode like a rebel, a rich rebel or a presentation mode, a a a supercharged powerpoint.

A

Yeah, so I guess one way to address that is to have a sequence of small meetings where we will be discussing each and every tool of these. I think both site and ours have had some detailed presentations that are available available in video and but they have evolved since then, and yes, we need to keep describe this, keep describing what we are having here and let us have that in further more meetings, and we are now at the official end and in a moment we will say goodbye.

A

Does anybody wish to say anything else before the goodbye.

N

I have a quick question for lucas. Perhaps uh maybe I don't understand it well enough. I actually have access to a large number of customer contact center uh transaction logs involving you know: digital chats phone calls, web self-service sessions, etc. I was wondering if his privacy package might be applicable to create larger and scalable data sets sufficiently protecting individual privacy or whether or not that's not the purpose of the library, I'm just curious about whether it could be applied. That way.

J

uh So it's not production ready. Yet can you hear me yeah so so yeah? This is. This is exactly financial data. Medical data data about crime, this type of stuff is, uh is exactly what we what we care about? What why these tools are there? uh Yes, so there's in general, there is this need to protect some data and, at the same time, somehow allow analysis right or even machine learning and there's there's so much.

J

We could do with machine learning on medical data, for instance, but it's all locked, it's it's somewhere there in hospitals and it's for for good reasons. It's protected, it's private and the same thing. If with financial data, yes, that's exactly what what these things are for. uh What I just showed you is something very small: it's all work in progress, not production ready at all, but I encourage you to go to openmind.org. We have all kinds of uh solutions there there's the educational materials we we're all about that.

J

We want to lower the barrier to entry to all this privacy, preserving technology and ai data science. We offer teaching materials and open source tools for free, also consulting so go check out. Those links.

G

Yeah and if I can ask my question of daniel real quick um so that you had the thing with the slider, um I think this is maybe hopefully a short question, but you had the thing with the slider where it updated the visualizations and you made it sound like the slider, was updating state on the server or you know in the jvm process, and that then that was triggering processing that that flowed back through to the browser. um So you said, that's using jfx is that right or javafx.

A

Yes, yeah yeah.

G

A

G

Can you say a little bit more about like a how that works and b, like what the api looks like for that? As far as like when you're creating your note, your note space, um uh how how you actually wire those things.

A

Perfect perfect.

G

A

Just say briefly, and afterwards I will just make a video about that on one day and yeah, but I guess we have seen a different um dashboard like systems today, one by john, which was magnificent, and what john was showing us was how we can manage states in the browser in the client and use the back-end jvm just for computation and as a data source and a similar library.

A

That does that this way of dashboard building is called godly by floyan and andreas who are here today, and it is also worth looking into, and what a node space does is. Different.

A

Node space uses the state of the jvm as the dashboard state, and it is built for a different use case not for building dashboards, which can be served in the web to many users, but as a tool for one person with the rattle, and the idea is to connect the browser and the weapon and then indeed the challenge is to manage state in the jvm and our friend vlad has built a closure library called cljfx.

A

It is a wrapper of javafx for building desktop applications, but node space does not use the whole cljfx. It only uses the core of clgfx, which is used for managing state in the jvm, and this core is similar to what you may see in closurescape libraries for client development, but kind of different and refreshing.

A

In a sense, it is all around pure data transformations and it is really a beautiful model worth looking into so you can look into contexts and subscriptions in the readme of clgfx and everything is there and we will discuss it further soon.

A

So um thank you for that, and I guess maybe that is a good moment to say goodbye and thank you so much for everybody here who have been here in unusual hours in the end and beginning of their day, and I guess after the goodbye we can stay and keep checking and keep recording if you wish.

A

But uh at that moment thank you so much to everybody, everybody who needs to leave and see you next time and we're only scratching the surface, of course, and we need to talk again more and more and now let us keep chatting if you wish.

A

So anybody any topic.

H

Any question or thought I have a question which has actually nothing to do with the data.

H

um I've started working with with closure like a few years ago, but uh I'm still a beginner because, as uh sivaram said earlier on, uh when once when one loses their rhythm, the you know, if you don't use uh all the tools every single day, you forget about how to use them.

H

So um my my question would be: is there some kind of uh of mentoring activity around and some anyone knows about? I mean I, I know: there's closure firm sign up to it, but unfortunately the the problem is that there weren't enough mentors.

H

So it's it's a little bit disappointing that uh I mean it's nice to have this one person who knows more than the rest, so that person can, uh you know, lead and help the others yeah. That's that's. That's mainly my my question.

H

A

I think theodore knows something about mentoring projects right tilde or maybe I'm wrong.

O

Well, uh we did try to do some mentoring work uh on athens.

O

I was involved in athens development for a while, which is a tool to build a knowledge base, uh a clone of rome basically, and our strategy was to get more closure developers by teaching closure, and then we tried to match up mentors and and mentees, and it seemed really simple on the surface because we wanted to make something that was scalable and we just needed to attract people.

O

um But when you get into the weeds it gets difficult to organize, because you need really invested mentors and really invested mentees and what you do when some people sign up and don't really want to to contribute and other people uh have different motivations in mind. uh But that's kind of that's the low end story. uh There are cool parts as well, because sometimes they just strike it off and it works fantastically for a while.

O

I was doing code reviews on a mentee who was working through closure for the brave and true and he uh put his exercise in a gist and later on youtube. And then I did code reviews which was interesting um yeah. So uh an interesting experience, I don't know whether that really answered your question. I didn't. uh I don't think I understood perfectly what you were asking for.

H

Yeah, in fact, uh closure farming. I think it's organized by the same people from uh athens, so yeah.

N

Yeah on that line, I've tried to figure out where in getter or closure in slack or whether it's you know some kind of mastodon group or whatever I've posted a number of different questions, even on reddit and I'm not able to figure out yet exactly where the right place is to ask questions.

O

About closure found.

N

No, not about closure form itself, but just you know specific things that I get blocked by and I'm expecting. Somebody would have a quick answer and uh some of those forums have dozens of people signed in, but nobody really active. So I'm just trying to figure out where the action is taking place.

H

I I found many answers in the closures zulib channel.

N

Okay, zuma, okay,.

H

I asked in the beginner's thread and I normally would get answers right away- actually very useful answers by the way.

H

So that was very, very important and very it's very rewarding to to know that if you ask someone will answer.

N

I used to be a whisper way back when lit interlisp was my favorite language and not a xerox dorado 36 machine, but you know in intervening years. I got tired of java and it was very verbose and you know the whole uh reach uh message about that was right on. So I just recently just five six months ago, re-engage with closure and I'm having more fun than you are. I suspect, so I'm really uh trying to get back into it. Myself.

A

Beautiful and leandro I I just wish to say that I remember your ideas about mentoring and the discussion we have had about it, and I think you have had beautiful vision, a beautiful vision of how we could organize a mentoring system where we could help each other, and let us try it out. I think it will be challenging as theodore said, and the challenging thing would be to find some continuity, some people who can commit to a process and let us keep trying.

H

Yeah, actually I got into closure, I mean like uh the the this. The strongest way in which, through which I got into closure, was uh via a closure developer. I met here. He was living here in buenos aires and we got together and we did programming and it was really awesome and it was very important uh to have someone to ask about how do you do this simple little thing in e-max ah like this way? Okay, then, okay I'll keep on working.

H

So things like that, I mean sometimes it's not about asking like big questions about the language or the ideas or the paradigm, because I mean for me, that's not really very difficult, but what really makes the difference, at least for me, is to be able to have a fluid workflow.

H

For instance, like a couple of years days ago, I was, um I was uh swamped uh because I wanted to create a new namespace and like a new, a new file. Actually- and I wanted I mean- and I and I was asking myself: isn't there a way to automatically create it like so emacs will take care of it like eclipse does when you want to create a new class. Something like this like like that, and I ended up doing it by hand.

H

It was like quite annoying, I mean because I didn't know how to probably there is a way, but I didn't- and I don't so things like that.

H

I know there are also um very there's very good documentation uh from um practically that that's a very good project, I, like it very much uh but yeah mean, of course it cannot cover everything, and those are the things that I could not uh easily find.

G

Yeah, if I could throw something out there, I feel like there's, there's, I think, kind of the key to finding, and this is kind of in response to your your question, sam. um I I think that there's a little bit of a um it's like it. Sometimes it's not just finding the right place, but like the right space within the place, if that makes sense.

G

um So, for example, um if you're, if you're on, um if you're on the slack closureians group, um you know you might ask in a general room some kind of question about whatever and really not get much of a response, but then find that there's actually a more focused group somewhere, say talking about data script or daytonic, where you can get answers to questions really quickly, um and so I think it depends a little bit on the community and like what the particular question is um and yeah.

G

I wish I had a better answer than like find the right place, but um I think for the data science community, a lot of us are on are on zooloop. Now um there is, I mean people are still on the data science channel in um in the closure.

G

In slack, but um there's a lot of activity in zulip as well, and so um so yeah, I think it kind of comes down to what's the question and I think sometimes with tooling questions which they may feel sort of ancillary or you know secondary to you, know kind of core pro core questions about how things work in the language say, um but they're also really important right because there's little those little bits of you know, I mean so much of my workflow being something that I can be fast and effective in comes down to just how I have my editor and my environment set up right, and it's sometimes hard to know where to ask those questions.

G

um But um but yes, I just say like poking around, and you know you know not being afraid to ask in multiple places, sometimes um to kind of figure out where what questions get answered with what frequency is kind of, um maybe the best advice. I could.

G

Give also, I guess one further thing like asking questions on stack overflow, um I feel like I don't know it maybe doesn't quite happen as much as it used to uh in the closure space, but it's great that when you do ask questions there. um If it's like the right kind of question, um you know those things are then really easily searchable, um which is which is nice and can be improved upon again. You know.

C

Isn't there that new gas closure thing that cognitive put up or something yeah? I didn't you said at all. Really I mean yeah, I mean it sounds good, but I actually, I don't think I've ever even been on it. Yeah.

M

Do check it out. It is especially useful if you have the suspicion that your question might lead to a jira ticket against closure or closure script.

N

M

Because they it it seems to be that the the closure jira is browsable publicly, but I don't think just anybody can submit a ticket.

M

So ask.closure.org is a very nice way of of raising a question which might indeed have a useful answer, but in case it doesn't the right people will notice and and echo it in the ticket in their jira.

M

On the other hand, if you are looking for more conversational answers, there is a website called closureverse, which seems to have taken up a lot of the activity that used to be on the google closure group.

M

G

Yeah, the google closure group used to be kind of the place to go. I think- and that has I've noticed way less activity there. um So that makes sense, that's kind of where that's.

A

Gone wonderful. um Anybody else wants to raise some topic idea. Question issue.

P

Is it plan another a meeting like this one for the future.

A

Could you repeat, please.

P

uh Do you have plans to for the following meeting? Is? Are these meetings a regular.

A

Yeah, so um we are thinking about it.

A

First, let us say that all this community activity is open to anybody who wants to take part and influence, how we're going where we're going, how we do that, and so, if you have any opinion or thought about what we need to do with your community, let us talk and yes, it is plans to have public meetings of this kind every month or two next time will not be at this hour.

A

It will be on an hour which is more comfortable for the friends in east asia, which is earlier, and what we are planning to do. More is small meetings like a small interview of one person telling their story and a few others listening and asking questions. This kind of format is something that we are going to do more, because we want to have also more focused discussions alongside the the public ones. Does it make sense any any thoughts about it?.

P

Yes sure, because it is wonderful to have the chance to to have access a I'm living in in buenos aires argentina, so here the the closing communities is almost non-existent so having the chance to have direct access to to more major communities. All around the world is wonderful for us. So that's why I'm interested in keeping in touch with you.

A

Yeah perfect, let us have it more.

P

Yeah yeah sure.

G

I'll add that, well I I yeah, while that I really like having kind of more focused uh sessions on you, know one person talking about a thing that they're doing um being able to dive in a little bit deep on that.

G

It is also really refreshing to every now and then have something kind of closer to this format uh with lightning talks and a little more open discussion, um because it's yeah, it's you can kind of like take in a more bird's eye view of what folks are working on, and um it's been fun today to to see that from folks.

G

I

Yeah, I agree with that sentiment too and uh regarding the smaller meetings right, what about uh something like doing uh workshops on a regular basis where one person like chris or somebody can you know, go with the tools that they're doing and start from scratch and build up a project and work through some small, well-defined problem uh in a couple of hours yeah. I love that you know everybody else can follow along with that.

G

Yeah that's great and that that seems to thread into some other stuff we were discussing with um you know: how can we onboard more people and yeah? I like that idea.

A

Do you mean to have an open workshop of this kind or something that is recorded and shown afterwards.

I

I think having it recorded, I think, is always useful right, because um we can, you know a lot of times. It's yeah you want to follow along, but it's not that easy many times practically and you may want to go back and pause. It get your environment in order right, get and, and then start uh you know continue from there.

I

So I think recording it would be really useful and part of the reason uh is that you know when you have a workshop kind of a thing you have some discussion going on. People are asking different kinds of questions right, and I think that adds to the value of the of the of the discussion of the workshop itself, rather than somebody just giving a tutorial, which is one-sided, yeah.

N

I mean I'll suggest one example. I mean I don't know what I've been doing. Maybe I've been trying to use a java 14 too soon, but I can't get gorilla working. So maybe I've got to go back to yeah gorilla.

I

Doesn't work you know anything other than nate, so.

N

So I didn't read the fine print there.

C

Yeah, it's yeah. It's it's a it's a it's kind of a problem uh because I.

D

Mean like saita.

C

Doesn't quite work in 11., um I guess it isn't. 11, though the most the largest version that closure cognitic actually claims to officially support.

G

Yeah, I think so. I think what happened was that um there were some changes to what libraries came: pre-packaged with java um between um yeah between eight and eleven somewhere uh and yeah.

C

G

It's subtle: it's subtle things, because, while closure, I think by itself, is great at maintaining. um You know backwards, compatibility and things kind of just working years later, if, if something breaks in the underlying or yeah, what, if something changes in the underlying java system,.

C

G

Kind of screwed- and so um I think there hasn't been too much of that, but it's funny how just the one or two things that have caused friction. um They seem to kind of great again and again um so yeah.

C

When they change that whole, I guess they try to call it security model. The.

G

D

Between packaging.

C

And stuff from eight to nine, that was a complete mess and- and really I think that was one of the dumbest things they've ever done. But I guess, if you can get past that, then you know things like 11 going. You know the big difference between 10 and 11 and 11 and 13 or whatever is, is not that bad. It's it was that yeah eight to eight plus that killed everything I mean it's. It's amazing how many things just don't even work uh on on nine or anything like nine plus.

C

So, like I mean like psyche, doesn't really I mean it kind of works like uh on 11, but not really, and it's mostly because they just haven't, went in and fixed all these things that the jvm broke, and you know, and it's bad as it is. I mean I guess from surveys. It looks like something like 70, 80 percent of all jbm shops still run on java 8..

C

Mostly because of this you know all their code breaks if they move to something else.

I

I think the atomic ransom still runs on eight right.

C

Yeah, I don't know, maybe only.

D

Runs on eight yeah.

C

It's a big possibility: yeah.

N

So from where I sit, if anybody uh wants ideas for workshops uh connecting to you know one thing I mentioned earlier: connecting the sql server, 16 or 19, or setting up a notebook or note space style, interactive environment or uh any kind of integration with libpython clj and just getting that just basically working, I mean I know, there's some readme stuff there, but even little gotchas, like you know, I didn't get this version of something right uh would help unblock me.

N

So those are three specific things that I could I'm not going to work through that this week, so anyway, anything that could be done there, I think, would help unblock me.

C

Yeah, there's definitely stuff that I put together like.

C

These documents inside you can save these things and push them out to github and then and then just load them from any given uh running instance of the thing and there's definitely versions, you have definitely notebooks out there that do jdbc next jdbc stuff it doesn't connect to. I mean I've used it mostly on like my sequel and postgres, so I haven't, I haven't actually used it on sql server, um but I get. I actually think next jdbc is really a solid piece of work and it just and and it's not very high level.

C

So it's not like uh it's it's not it's unlikely that the only thing you'll need more than just the um you know the the adapter for for the database. It's that that's probably the only thing.

N

That was the question you know: did you get you get msc called jdbc7.3.1 took a few days to figure out whether it was that or a virgin.

C

Oh well, yeah, okay, who knows uh yeah, and I you know, I have documents that that that run with clutches or but but it does request as chris points out that this you know these extra environment things having to do with other ecosystems are probably much more painful than than anything associated with closure of the jvf.

C

You know with r I've had pretty good luck, except on the mac newman. The mac is just apple, has just become.

C

I actually think I hate apple now uh and I probably would be more willing to go with windows, except that everybody in the labs use. You uses mac.

D

So I I have to.

C

Kind of stick with the mac, but um r on r on like well r on linux, is dead. Easy. It just works. You don't have to do and imunix it's like a no-brainer. It's a complete no-brainer.

A

Yeah maybe I'll comment about that. I'm part of an our team. We are maybe seven or eight people and except for a couple of us, everybody used max and the idea was to have kind of a like a standardized environment so that we can avoid all these setup problems.

A

But we cannot avoid them and we use dockers as much as possible. But there is so much about setup. It is taking like a decent percentage of our day, just to cope with installing our packages and making sure that things are unified and standardized.

A

It is never ending, and I am amazed by how easier it is on the jvm by default.

C

Oh yeah yeah I mean the difference, is, is light and dead. I mean it's just crazy. At the difference I mean, and python is is way worse than I I mean it's. It's like beyond bad.

G

Yeah, I think there are a lot of fundamental assumptions baked into python and how it thinks about libraries. That um I mean, I think the original saying is just.

G

You cannot install multiple versions of a library in the same environment which, like that's so many decades ago now, um but whether you look at you know, like um you know, obviously closure or um or java or or um you know, npm and javascript um I mean that's just not um you know, we don't have these these problems anymore and they still haven't sort of and uh from that we get virtual on. Then you know all these other. um um All these other challenges. uh So yeah, it's um it's it's a challenge.

G

I'd be interested in. I I think yeah, the other thing that that has occurred to me. Just recently we had um we had this docker environment working and all of a sudden, someone tried to run it and and it stopped working and it all came down to well. uh One of these pit packages had updated, and now they were using they're compiling against a different like lower level. um You know system, library and, um and now nothing was working right because it couldn't it.

G

Couldn't bind to that, and so um you know you can tell people to well just include the versions of whatever pip packages or whatever you want to install, but there's there's something really nice about languages that just force you to or environments. I guess that force you to um to include your versions when you, when you make your declarations, um it's something I've come to really appreciate with closure.

G

You just cannot specify your dependencies without saying what versions you want to point to, and it just kind of forces us all into the habit of like oh now, this thing's repeatable um so yeah. I think that um you know it doesn't. It doesn't solve the problem once we start trying to interface with the rest of the world, um but um but it is um it's something. That's nice that that we have at least in our core kind of ecosystem.

G

So I'm interested in thinking about I mean I I don't know if, like docker is the only answer to this and it's just like. We just have to make a container for every possible like configuration of environments we might want, um or whether there's some possibility of us kind of um bringing a little bit of that sanity that we have in the closure world um to to maybe not to these other worlds by themselves, but at least to the way that we interface with them. I mean it'd, be sort of wonderful.

G

If we had an extension of like a depth, dot eden where you could specify the um uh the say, the python packages you wanted or the r packages you wanted or whatever, um and that, maybe it even forced you to specify the versions um so that uh so we get a little bit more of that repeatability.

G

um I don't know if then you'd have to like. Maybe you would have something that wrapped um virtual and you know virtual, on or or constructed docker images for you or whatever, but um it's interesting about how we could sort of improve upon that situation.

C

Another thing is that I've looked into a little bit recently, and I've talked to some other people, for like data science, stuff in particular, is singularity instead of docker.

G

C

G

C

Is kind of geared more towards just a big blob that um will run your stuff everything everything is in it. Where's doctor yeah a little bit more of a dev ops, kind of a.

G

Thing yeah I've. um When I was last working at the hutch uh some folks, there were starting to use it kind of in the bioinformatics space and um seemed to find that it was a good, a good fit for that.

C

That singularity.

O

C

Yeah it just yeah because of all these, especially in that space, you know, like I have this thing, that one of the pis wants to start pushing out to other labs and was like wow. I mean it's, it's it's an environment problem, because zillions of these c, plus, typically c plus plus packages um and they're, all you know trying to build these. You can't run it across operating systems. You can build all the binary. You build all that stuff, you might work across. You know if you build it on ubuntu.

D

C

Different versions of a bunker, but it won't run on centos, can I jump into.

O

That question about.

C

Building this one, this bundle that has everything as a binary. That's what the singularity thing seems to seem to be about.

O

Can I ask the stupid question of what singularity is and how it's different from docker.

C

Maybe I can get a link and put it in the chat it. It's it's just it's it's a container, it's a container thing, and so an enzyme is like docker, but its emphasis is different.

G

Yeah I wish I had a better answer. I never I never dug too deep into it, um but um it seems I mean I guess I it seems more flexible. I think, and I I'm trying to remember whether it was um to remember what it was that was making folks sort of switch to it at the hutch.

G

It might have had something to do with like ease of access to say, local data files um or sort of like that definitely yeah, so crossing the sort of file system boundary was something that was a little bit easier. That might have been one of the main appeals of it.

C

It's definitely one.

D

C

That's for sure um what is.

K

C

I

So it's a government uh organization, it says lbl.gov for singularity.

C

I think it started out as that now it's more of an org thing or something I got it here if I could just get this stupid mac to do the right.

D

A

Yeah, um so maybe uh I guess let us continue for a few more minutes. If that is okay, I will have to leave soon and I guess everybody are invited to stay. um Is there any other topic that we wish to discuss today?.

G

Can I add, can I add one thing for uh for taylor's question? um One thing I remember also being an issue is that um if you're on, like a shared compute system, where you don't have pseudo access, it can be sometimes difficult to get docker set up to work in those environments, and that's I I was just skimming through here. The docs and that's one of the other things I think was appealing to folks about singularity is that you can kind of set it up a little with with less less permissions or lower permissions.

C

Thanks, do I put a comparison between docker and singularity? I think.

D

C

In there, let's see if it works yeah it looks like it's going down. I mean there's a there's, a bunch of other stuff on it too, but give you a bit of an overview.

A

Maybe it is an opportunity to mention that there are some smaller alternative systems for creating reproducible systems. One of them is geeks that we have discussed in one of our previous meetings.

A

Geeks is a package manager written in scheme in guile scheme, and it is based on functional programming principles and allows for very flexible ways to build systems, and there is also some new project called hermes, which is like geeks but written in janet, which is a closure like like a lightweight language, and all these, I guess, are worth looking into, even though they are maybe less popular again. Daniel yeah. So I'll share the link one project I'll just send you a link to the chat yeah. So one project, I'm sharing, is geeks and oh yeah.

A

Thank you and another is hermes that I'll find in a moment, which is something really new and it is like geeks but written in a closure like language which is called janet.

A

That is hermes, and I guess it. At least it could be interesting to learn and see how people are building this kind of systems based on the principle of immutable data structures. In lisp.

H

Danielle, I cannot see the link in the chat. Oh.

A

Oh, that is interesting.

G

No press harder press enter.

Q

R

It is thank you, it is yes that worked.

A

Okay, um is there anything else that we wish to discuss today.

I

uh One uh quick question uh are: what are you, what are people using for doing like machine learning and deep learning work these days, um so the ones that I found were not things like dl4j or apache mxnet, um but again the clojure versions of those are either not available or broken.

I

I think some of the older versions are available. So what are people working using for this kind of work?.

A

Are you talking specifically about deep learning.

I

Yeah, deep learning- and uh you know mlb or any of those kind of things.

G

Yeah, so um the smile uh library, which is what it stands for, but um it's uh one of the things that's backing the checkml dataset and we'll take the entire tecmo stack uh and it um it has some great stuff in there. I was really surprised to find out that I had actually implemented umap, which um you know uh was implemented in python, and so I thought it would be forever until we had something kind of a native java, but it was sort of stunned to find that he had done the work there.

G

um There's also um yeah. uh There's.

C

G

C

Agree with you, chris, that the smile thing via tech, animal was really nice.

G

Yeah, um there's also, of course, with lib python clj. Now um we have access to anything that we'd have in python um and we've seen some great examples um from karen meyer who was on earlier uh of you know, connecting to different um uh well the original umap um and um I think, keras and some other um some other kind of python libraries um there's also um she's, also working on the um the closure bindings for mx, um yeah, mxnet um and yeah. I mean.

I

About that area that I find are you know, there's uh uh it seems to be there, but when you actually go to the websites and check them out, the closure versions are not there.

G

hmm Yeah, it's she's dropped off now, so.

G

To my knowledge, she's still working on that- and that's still um I mean for at some point- they ended up featuring closure on like the main mxnet page as like supported language. um So I think that one of the um one of the arguments, some folks at least, have been making about the kind of value proposition there with mxnet is because that's an amazon, sponsored library.

G

You know that um you know that um uh um you know that it's gonna run well on, like amazon web services right, so this kind of um this kind of deployment path um issue uh from from you know, getting getting to your you know: development, environment, or you know, kind of data science exploration environment to something that can be kind of productivitized um is going to be a well-trod path. um So it's one argument that I've heard uh folks make for for the mxnet route, um but yeah.

G

I think it kind of depends on what specifically you want to do, and um I I I'm a little bit and I'm a little bit. um uh I think there's there's an extent to which, like the deep learning stuff, has taken a lot of air out of the room from other really interesting stuff, that's going on in um uh in the machine learning space.

G

uh So you know with stuff like I mean again, umap um there's this other kind of interesting dimension direction, algorithm called tri-map, which um again actually karen meyer, pointed out uh to us- and um you know, there's a lot of stuff out there, so um yeah. So it's for uh yeah. I can um so it's the second of the dimensionality reductions.

G

I showed off in the the presentation, but um let me see if I can get um the um the python thing for it yeah and actually I'm not sure whether they should share the uh I'll share, both the um like the read the docs page and um there's a link there to the to the academic publication, um which I highly recommend.

G

Anyone read who's interested in math, because it's um it's a fascinating paper. uh It gets into a ton of really cool um uh category theory and like um topology and stuff um geometric or what is it geometric, algebra, um algebraic geometry um and uh yeah? It was just it was sort of refreshing I feel like so much. Kind of machine learning ends up being just very matrixy and it's just really cool to read a paper. That's got a bunch of really interesting category theory and stuff.

K

A

Yeah and maybe one more link worth sharing that I'm sharing now is to the deep diamond project, which is a work in progress. Closure library for deep learning by our friend.

G

Dragon yeah: I'm really excited to see what he's doing with that.

C

Yeah, I totally agree it's it's. um It's kind of amazing.

I

C

It's come along.

I

Yup is anybody using it.

C

I am not directly using that, but I do use neanderthal so and that's pretty low level stuff right.

C

It's it is matrix, but there is a there's, a fair amount of high level stuff in it.

D

C

Yeah I mean in general, though yeah, if you're going to try and use neanderthal to to to build a deep neural net or just a neural net, then you have to you have to do what training is doing in uh deep diamond yeah. So it's it's more of the low level stuff, but it, but um I just yeah I had to use. I used it um in some doing large-scale injury calculations across a bunch of genomes.

C

And it it and it's it's it's. um I doubt that there's anything else out there other than maybe like tbm, that's faster. I don't care forget about it. You know uh numpy or any of this stuff.

G

Yeah he's he's put out some really impressive benchmarks. um It's been really cool to see what he's, what wizardry he's been cooking up.

C

Yeah, it really is kind of amazing yeah, and it's so nice in the sense of uh how you can do this. You can even do gpu in the rebel just sort of.

D

C

Out, it's really impressive stuff.

M

Also impressive is that he dragon has made his work um function with not only the intel but also the amd hardware, yeah or the nvidia and amd hardware and intel hardware, so that, if you're not aiming at amazon web services, if you're aiming at desktop or laptop computers, just about any desktop or laptop computer might work nowadays, that is with it with a certain level of of computing functionality in the gpu.

C

Yeah, I I think that's right. um Actually, we use it, though, what uh some of our servers are gpu servers and it works there.

I

So how about doing some kind of workshop which, uh where you take what dragon has done right his libraries and things like that and doing something fairly simple like uh word to act or doctor or nlp kind of stuff and um and- and you know, as a uh both both to be able to you know for for a lot of us- would not use these things to get up to speed with it and also like it's a it's. A good uh tutorial on the whole data science pipeline.

I

Also well. Does that sound.

K

Reasonable it sounds like a great idea.

H

K

I

And- and maybe you know like chris- you mentioned smile right so to you something like that, and I think what would be useful is the the challenge I think, with any closure. Libra uh a developer faces is that which library to use right, like even with uh with the dates and times you want?

I

Are you do you want to use java time or time or which one you know so so something just using smile and uh neanderthal or any of those other libraries, you know maybe one week do this one week do the other one and on the same data set for the same outcomes. You know you want to do that and see how they work out.

C

Yeah, I I think one of the one one thing that I think it's becoming like um I mean chris just remembers uh the tech ml stuff. This is a great umbrella, I mean. Actually I haven't tried this myself just yet, but he announced how he's folded neanderthal into the tech ml as.

I

C

M

C

Mean that I think, is going to be something that I I'm really going to use. uh I mean it's just.

D

A great idea and it's super cool.

C

That that that you know techno sits over the top of patchy arrow now and uh smile and all the it's really. It's a very nice thing, and if you put that together with tablecloth, you get this deep plier-like api that you might be familiar with from our.

G

Yeah and to add to that, um uh it's also has really good, inter linking with the lid python clj stack, which he also wrote um so not entirely surprising there um but yeah. It's 100, echo everything everything uh john just said, and in fact that's that's the reason why um in this you know new sort of analysis, uh notebook stuff that I was showing off.

G

um That's exactly why we're kind of going the techml route um because recognize that if we go that route we have quick access to all of the um python clj stuff and um and that you know those uh and as well as smile right, because it's built on top of smile and so uh and and again connecting to neanderthal. um You got that there and so um those those three and and the the parquet and stuff uh I mean those all those things kind of collectively together.

G

I think it's kind of rapidly becoming a sort of go-to um go-to place for you know, get your data in and be able to do a lot of stuff with it um pretty quickly.

C

Yeah, I I I'm moving everything that.

O

I can over to.

C

D

C

With the uh data sets that don't even fit in memory, the capability has that that's I mean it's a game, changer and and given and is, is anthony still here- maybe maybe maybe not, but you.

O

Know with the guinea.

C

The the that that he was really cool because he included in that in that benchmark, detect ml stuff and uh even with the thing that just floored me was even the uh you know, the very idiomatic kind of uh tablecloth. You know deep fire api version in xml was almost I mean it was within a few, this, a tick or so of the of data.table yeah in terms of raw speed and data.table. For me, is you know it's always been sort of the gold standard for like there's, hardly not going to be much out there.

C

That's faster than that, then.

C

And it's just it's a lot nicer to you, they're going.

A

Is there any other topic that we wish to discuss today.

C

I think I'm I'm probably done.

N

How are you going to decide the uh date uh and time or hour of day for the next meeting.

A

Yeah, so so I guess next time we will come, we will have to do some kind of survey among the people who are not here, uh naturally, so that we see what would be comfortable to the people who couldn't make it on these times and certainly we need to diversify the times across the day and- and we just have to have enough meetings so that everybody can attend. Sometimes.

M

A

So does it seem to like a good time to say goodbye.

A

Yeah, so thank you, everybody. It is really amazing. Oh many people here well in the middle of the night and early in the morning and.

N

The returning whisperer.

A

Yeah and- and thank you so much for this meeting- it is really heart.

L

A

And see you again.

L

A

And let us talk.

L

Absolutely thanks everyone. Thank you, bye-bye. Thank you very.

L