South Big Data Hub Data Sharing & Infrastructure Group, 1 Nov 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CI WG demo: Arvados: A platform for storing, organizing, processing & sharing genomic big data

Description

Arvados: An open source platform for storing, organizing, processing, and sharing genomic and other big data.

Date: 11/1/2019
Presenter: Tom Morris
Institution: Veritas Genetics
South Big Data Hub

A

Without further ado, Tom Morris from Veritas genetics is here to talk to us about the arvados platform and I'm really excited to hear more about this, as I think many of us on our respective campuses have heard about this platform and be interested to know more about it. So without further ado, I'll turn it over to you. Tom and you'll be able to share your screen and and then really take over. From that point, cool.

B

Thank you very much.

A

That looks like it yeah that's cool thanks, Tom.

B

Okay, so, as I mentioned I'm, the director of product management, here at Veritas actually came in to Veritas, through acquisition of a company called cure, burst, which is was the original developers of the ibadis platform.

B

So our vados is infrastructure software, designed from specifically for managing and processing big and distributed data. It was originally targeted at genomics data, but it has been used for imaging radio, astronomy, all kinds of other applications as well.

B

It's designed for storage and compute at it's used in a number of large production scenarios and it's designed to be able to reliably support those types of installations. It's a 100% open-source, there's a project specific website at our monastery in addition to our corporate websites.

B

It's a modern architecture which I'll go into in a little bit more detail. It's built from the ground up for Federation, which is something that we think is very important for these types of applications and one of the things about Big Data is that tends to not be very feasible to be shuffling the data around. Although I know you guys have a very fancy system for being able to do that.

B

High performance in general, it's much better to move the computation to the data rather than the other way around and federated architectures allow for doing that easily and reliably.

B

It supports all three native cloud platforms, as well as on-premise HPC clusters and can be used in a combination of those we have people. We have customers that have migrated from one to the other or use both as well as using multiple cloud vendors, there's workflow portability across all those platforms and a uniform API across all those platforms. So everything at the layer above arvados is something that hides all of the differences underneath, as I mentioned, it was designed to deal with genomic data.

B

Originally, it came out of the church lab at Harvard Medical School from quite a few years ago, when George who's, a forward thinker, was seeing the cost of sequencing coming down an explosion of genomic data in the horizon, and it was really designed to be able to handle that scale as well as to be able to provide some of the scientific attributes that are important like provenance and reproducibility, and things like there.

B

Excuse me. There are three main components to our bottoms: there's a storage layer which we call keep its a Content address, store that scales up to petabytes of data.

B

There's a workflow manager called crunch, which is designed to be able to reliably and reproducibly, run complex workflows and be able to provide the provenance to trace back to where all the outputs came from.

B

There's a query: engine called lightning, which is a combination of query engine and compression technology that compresses human genomes or any genome actually down to a very compact representation. That's quick to caret query and is also efficient for doing machine learning and it's a format. That's in amenable to machine learning.

B

Everything in the system is built on a common set of api's that are the same api is that we offer for extending and integrating the system. So there are no special private internal api's or anything like that.

B

Those api's are offered through a variety of SDKs, supporting Python, Go, Java, Ruby and R, and there's also a set of command line utilities that can be used for shell scripts and then on the right. You see that web interface that we have as well. That's the current generation. We've got a next generation web interface. That's in beta now, which I will show you a shot of later. The whole system is designed to be easily extensible and the API is can be used either, as I mentioned, to extend that or to integrate it.

B

So one of the things that, for instance, Veritas does so Veritas may business is sequencing for so direct-to-consumer genetic testing, and so as such, we have very large-scale sequencing operations and it's important for us to be able to attract. You know customer orders as they go through the system. The API is, can be used to integrate with our various operational dashboards to have back-end. Workflow processes kicked off when a new order comes in and deliver the data when it's done and so on.

B

The core definition language for describing workflows is called the common workflow language. This is industry, standard language that came out of one of the open-source conferences about four or five years ago, and it's something that cure averse. The a virus engineering team has been involved with helping standardize and both write the specification, as well as contributing to the reference implementation.

B

That's used as a core of a lot of the other implementations, as you can see from the list participating organizations, it spans both commercial organizations, including some of our competitors, as well as academic organizations and research institutions. So it's got a very broad support and it's the ecosystem is continuing to grow. One of the things that I think is really powerful about that is. It provides for a community where you can share workflow definitions share tool. Wrappers just saw something pop-up, here's so when chatting for me me.

C

B

Two rappers, as well as just knowledge, so to be able to help each other out so I think that's one of the really powerful things about seasoned well and that's the work full language that the system runs on.

B

So, to go through the components is how you know before it a little bit more detail. This Gorge layer is called keep everything in the system is content addressed, which means that all of the data is run through a cryptographic hash which produces a small, unique value that can be used to address it, I'll be addressing, and the system is done using that so one of the things that provides its automatic deduplication. So if you have two copies of any, you can never have two copies of any piece of data.

B

If someone tries to create a new copy, the system will recognize that it's already got that and just use that copy. It also means that creating copies is a very efficient process, because it's just a matter of moving pointers and incrementing counters it that content addressing is also used to support. Some of the promenades features that I'll talk about on the next slide. It can be backed by either cloud object, storage or a traditional file system on HPC cluster, including cluster file systems.

B

So it can either be just you know, a single file system or a cluster file system skills up to petabytes and there's kind of two levels of abstractions. Here. One is a hierarchy of projects and sub projects which can be nested arbitrarily deeply and the within those, and those are the basic unit for sharing of data.

B

So all data is private by default, but you can share a project and it's children with individuals or groups of individuals and they can have the granted either read access, read/write access or manage access which allows them to share on with other people, so very fine-grained permission system there, which is useful for managing access to data and then the other basic abstraction is collection, which is basically a virtual folder that contains a bunch of files. So the workflow manager crunch is built for reproducibility.

B

It uses the content and addresses that are maintained for by the storage layer, as well as the content addresses that are maintained by docker to provide for very strong provenance. So all of the software is containerized. All jobs run inside docker, the by looking at the content hashes for those containers and the content. Passions for the inputs. You can tell easily the exact constituents of any output that you got so you're able to trace from the outputs all the way back to the various levels of input you.

B

The other thing you can do is use that to take a look at to do smart job reuse. So, if you're on step 37 of a 40 step process and the system fails either due to a bug in your workflow or a glitch in the cluster or a glitch in the cloud rather than restarting from the beginning or having to manually slice up your workflow and run it in pieces, the system automatically knows that it can start at the failed step and skip all the previous steps.

B

Because by looking at the content, hashes of the inputs and the content hashes of the docker containers and seeing that they're the same as the previous run. It knows that as long as the computation is deterministic and use reuse the outputs without having to recompute that so that's very useful in the development environment, where you want to be able to iterate very quickly and fix a bug and restart and fix the next button and keep going.

B

It's also very useful for cost savings in a production environment where some minor glitch caused a failure- and you know you had a transient error with a particular computer or something like that, and you can just restart from where you left off and save all that money rather than recomputing things.

B

The other thing that this does is for cloud installations. It will dynamically scale up and down the compute capacity so that, when you need, if you have a workflow, that's paralyzed across all of your 23 chromosomes, you can, you know, spin up the peers of it. If it's paralyzed across 100 samples, you can spin up a hundred computers and basically get as much compute capacity as you need. One of the nice things about cloud pricing is that it's basically linear, so you can get lots of little computers.

B

You know a few big computers, lots of big computers and pay pretty much the same amount to get things done much more quickly. I.

B

Talked a little bit about provenance, so this is kind of a graphical view of what a pro Bono's graph looks like. So you can see the rectangles here are data collections and the ovals are compute processes. So you can see how you can trace from any output back through the computation that produced it and back to the original inputs, including not only your your sample data, but any reference data that was used as well as the actual software that was used.

B

So I talked a little bit about Federation and moving the workflow to the data. There's a bunch of two reasons of doing that. It went in the genomic space because it the data, is big. That's one good reason, but it's also.

B

Information which is privacy sensitive, so there's often transporter data laws that prevent you from moving things across national boundaries or just organizational issues where you can't things move things across organizational boundaries. So what other things said about us allows you to do is to be able to push workflows out to remote clusters and if you have the appropriate privileges, be able to run the computation there.

B

And then we pay treat some summary data that you then could do meta-analysis on and by being able to partition things like that, you're able to fit within a variety of different scenarios that, where it wouldn't be feasible to be able to have to centralize all of the data.

B

Some of the recent features that just wanted to highlight that we're added for people that are familiar with earlier incarnations of arvados and support for storage tiers, who can roll things off to cool and cold storage, tiers to save money, support for spot instances which are much cheaper pricing distributed workflows. So, in addition to being able to have a single workflow that you push to a remote cluster, you could have a distributed workflow.

B

That did some of its work remotely and then some of its work locally and be able to stitch all that together and all that is supported by federated identity, support across all the clusters. So this allows you to be able to easily manage your sharing controls and what access you want to grant people having a federated identity doesn't give you any additional rights per se, but it does give you a common identity across all of the clusters. So you can use that for setting up your various roles and sharing.

B

Support for versioning of collections so that if you have different metadata associated with them or change the contents, you can go back and look at previous versions and see what was there? The new version of the vadis web interface, which we call workbench missing data and also python pre support.

B

So this is actually what the workbench to beta looks like you can, if you're familiar with Google Drive at all, you see it's kind of similar to that. It's a much more modern implementation, so the original workbenches Ruby on Rails app. This is a single page, react and was designed designed from the ground up with a cohesive user experience and it's much more performant and responsive.

B

So we think it's something that will be very attractive to customers.

B

So, just in terms of the experience that Veritas has- and the engineering team here has been working on this since 2006- have clusters across a number of different continents with more coming online petabytes of data under installation and use for very large, so one of things I mentioned is that Veritas uses it for all of its production work. There are a number of large companies that have multiple clusters spread across multiple continents, use it to support their day-to-day operations, and this is kind of a recapitulation of the stuff that I talked about before I.

B

Think in the scientific environment. The fact that it's open source and and provides such strong support for provenance are we producing or some of the really key features there and with that hopefully I'm within my time. Budget and I will take questions.

A

Excellent thanks, Tom I've got a question maybe to lead things off well. First of all, thank you very much. That was that was a great overview. I'm interested in this notion notion of Federation, and you know it certainly addresses some of the issues that I would say a number of domains. Experience not just genomics in the case of the Federation implementation. You have.

A

What capability would you have of actually sort of moving the data in the event that that was desired like if you, if you agreed to federate with some other institution, would there be the opportunity to actually move those data to one of the other sites within that federated system? Yeah.

B

Absolutely so, in the the distributed workflow case, if the data is remote and you have access to it, it'll actually be fetched from the road system, and so the current implementation is that that's fetched during processing and then cache locally. But you could certainly imagine scenarios where you do more sophisticated things. We've kind of resisted going crazy with pre-staging data, doing fancy optimization until we kind of see how customers are using this in earnest. Some of these features are relatively recently added to the system.

B

So you know generally, you would want to run your workflow where the data is positioned, but in some cases like reference data, you probably want multiple copies of it, so you want that. You know is scattered everywhere and one of the things that the system knows is because the workflows have all of the information they're completely self-contained in terms of what scripts are being run. What reference data is being used when you move a workflow from place to place, you can actually the system knows how to copy all of the reference data with it.

B

So do that, for you automatically okay,.

A

And in that copying is being done securely or what are what are the? What are the ways in which those data are moved, say: I have a local instantiation of the software. What what sort of method are those data moved from one instantiation to the next? Yes,.

B

So everything's done over SSL like so it's encrypted in flight and you, the permission system requires you to have read access to the source and write access to the destination. So that's one of the things that the I having federated identity, the remote system could grant. You write access to a particular project hierarchy and you could put your stuff there, but without that you're not going to be able to write anything there, gotcha and.

A

Have you have you found? Are there many people using this federated component of the system just dealing with that federated identity, side I'm, just curious, because this is something at universities we often run into and NSF has done a lot to try to help us. You know by promoting things like in common and so forth, but do you run into issues with across the Identity Management sphere?.

B

So most of the Federation setups that we have now are owned by a single organization, so the Federation is used to deal with geographic diversity as opposed to organizational diversity right, so don't have a ton of experience with that it. The other thing I guess I should say- and this front is that there's a bunch of standards work going on in the ga4gh, which is the genomic Alliance for the Global Alliance for genomics and health.

B

That is looking closely at federated identity issues as well, so we're tracking that work very closely to be able to fit in with that, and they they have concepts of research passports that can be used for helping to support data access committees. Decisions on whether or not access should be granted, and things like that. So there's a lot of kind of policy machinery in addition to the technical machinery. That's needed for these things, yeah.

A

Excellent Thanks I had.

C

A question in a different direction. You said at the beginning that the platform was being used kind of outside of life sciences, but clearly your origins are life sciences and all the examples that you gave were life sciences, there's always a tension between the desire to serve everyone and the desire to do something. One thing well I'm wondering if you could comment on how well the platform is as generalized but also kind of the snags that you've run up against, as people have attempted to use it kind of outside of its originally intended domain.

B

So the examples I'm most familiar with are some of the image processing and both 2d and 3d image processing and, in those cases, I think they have kind of similar application profiles. So it works without really any issues.

B

Some of the more exotic examples that not as familiar with so I guess, people probably kind of self select. So if they try the system- and it doesn't really work for them- then they don't push it too much.

C

A

Other questions for Tom, we have time.

C

Well, I have a question, so you know system. Why are you doing any Hadoop.

B

No, it doesn't use Hadoop or MapReduce style computation. Currently, so the most of the genomics tools are designed to run standalone, often they're multi-threaded, but there is some movement to do spark based versions, some of the tooling and we're tracking that, but it's a little bit kind of old-school in terms of that and the Hadoop style of the MapReduce style of computation doesn't apply as readily to those okay.

C

A

Other questions jump in otherwise I'm going to ask one about the analysis. Interface. Some of the layout actually looked a bit jupiter-like. Have you had any interest in people actually using this environment to support, say Jupiter notebook like interfaces for developing notebooks and other shareable products, yeah.

B

There is, there has been a bunch of interest in that, and some people have done some work on that. I have not looked at it recently to see it's an area that we're interested in doing tighter integration, with kind of in a similar vein workbench to as much better support for pluggable viewers for different data types and stuff, which also helps so that you can kind of wire up different viewers for different data types in the system and be able to use those easily.

A

And and sort of thinking about your your business, do you get pulled in in cases where people are using this for data that might be considered protected by HIPAA, say? Do you work under a BA, a business associates agreement, or have you not found that necessary at this stage, so.

B

We don't offer this as a SAS operate, so the wit. The kind of deployment model is typically, we would install it in a customer's cloud environment or support them installing it on their local HPC cluster.

B

So in that case they would need a BA a with their cloud vendor which they offer the, and we will also work with them to you know if they want to get if they're doing drug development and the you know higher levels of certification and stuff again that doesn't apply directly to us, but we would support them in that, because those typically have a bunch of training and procedural and other things. So we can.

B

A lot of us will not prevent them from doing it and we can check all the boxes, but from kind of a relationship role. We just support them and provide them with the documentation and consulting and stuff like that.