National Energy Research Scientific Computing Center (NERSC) Jupyter Community Workshop June 11-13, 2019, 12 Jun 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 15. Jupyter Dataset Registry Discussion

Description

June 12, 2019 Jupyter Community Workshop talk by Brian Granger, Cal Poly State University

A

Hi, everyone hey show.

B

A

Show us: how long do you want us to take up? Oh I.

C

Think we have five to ten minutes for. Ladies oh you're,.

B

Standing between us and lunch.

A

You're standing between me and lunch so make it quick, so so Saul I'm gonna present a slide just to introduce it and then, if you want to hop on and do the demo that way, we can keep it really short here and I'm. Gonna share a slide, yeah sure all right. Can everyone see that so where this came from is obviously data sets are a first-class entity in scientific computing data science and AI.

A

However, until now, data sets are not known to Jupiter, broadly in the sense that they're known to the Python code or R or skaila or C code running in a Jupiter kernel, but the Jupiter system itself knows nothing about data sets, and this has been creating a lot of challenges for us as people build extensions for Jupiter labs in particular that work with data sets so, for example, saul grant. Nestor master student of ours have created tools that can do data visualization using voyager or plotly and jupiter lab, and those tools need to get tabular.

A

Data sets into them and notebooks have tabular data sets in the form of data frames. There may be tabular data sets on the file system. There may be one in a sequel database and we were having to start to write a lot of really brittle, basically N squared type of code that you know so this. This data visualization tool knows how to pull tabular data sets out of notebooks, and this one knows how to pull it out of CSV and so to address this, we wrote a grant.

A

This is funded by the the Schmidt Foundation and this is in collaboration with NYU and then also Saul and others Quan site, and so we're building a dataset registry API for Jupiter and in particular, Jupiter lab.

A

So this is in the JavaScript front-end side of things, and here a dataset is basically a URL, a mime type that says how that URL should be interpreted in terms of the type of data and then the JavaScript equivalent of a void star, if we're also extending it to include notions of hierarchy and search, and then this will all be extensible it's in a similar way to the Jupiter output system.

A

That is also mimetype based and on top of this there's a set of conversion api's that basically know how to map between different mime types in an efficient manner, and the idea here is that someone may write an extension, though. That knows how to work with tabular data sets, but there's dozens of ways. The tabular data sets can be encoded. It can be a CSV file or URL.

A

The points to a CSV file, a tabular data, a table in the sequel database, a JSON file and so Jupiter lab extensions can register converters between different mime types and then the overall system knows how to find traverse that graph and find basically get you from a source to a target mime type, so that individual extensions only need to say: hey, I know how to consume this mime type, and then the data registry and the converter api's can be responsible for basically figuring out.

A

Can we get this data set into that needed one type if you're familiar with odo on the Python side? Lots of similar ideas in this to emphasize the our notion of data here is entirely abstracts and would include any possible notion of data ranging from files remote endpoints data api's that expose larger than memory datasets, essentially, anything that you could possibly imagine could be a data set in this context. A key point is that this is not a data catalog.

A

This is a system that existing data catalogues can use to get the data into Jupiter in a meaningful way. A data catalog is not required. There's other routes of getting data into the system, and so the goal here is to enable this deep integration across different components within Jupiter lab as concerns data. So with that I'll let I will stop. Sharing and Saul can see.

B

A

In from there, this.

B

Is a great introduction: you guys see this yeah okay, so the data registry provides a bunch of hooks. For extension, authors like why am I saying, but it also.

B

Am I loud or am I cutting out okay.

C

Yes, I know: there's just salt, so just speak.

B

So that it also comes with the built-in UI, the data Explorer, so that we can see what data sets we have registered and one thing that we've added recently is the ability to have nesting. So where should here we're showing the local file system as in nest data set, and we can find a data set inside of it like the CSV file and view it in our built-in grid viewer.

B

So the idea here is that if you had a database of then or, for example, some other kind of data catalog and then it would show up in these in India in the data sets you available to you, and users can then browse that and find actions to do with their data set and that all of this is so that it, it composes nicely together and yeah just to add, like we're working on this very actively.

B

So, if you'd like to collaborate or have more questions- and you could- we have a github repo in the Juke, your lab namespace, the data Explorer yeah and we'd love to chat more in depth about these ideas. I think we actually.

C

May be talking about a breakout on very similar topics, so we should important it on a good time, but it would be great if one or both of you could join that during either today or tomorrow afternoon session. So tomorrow.

B

Would be better for me at least that's.

C

It that's why I think I was hoping Rick from could also be there. So maybe tomorrow would be good for that. Yeah.

A

Let's coordinate over email about a time.

B

A

We would love to work with others on this, because this is. This is a type of thing that if we can get a broad consensus in the community that this type of approach makes sense, I think it'll really unlock a lot of different groups to begin building things that will interoperate with very minimal friction.

A

Yeah great, let's coordinate on email, then thank you thanks.

B

A