►
From YouTube: CI WG demo: Arvados: A platform for storing, organizing, processing & sharing genomic big data
Description
Arvados: An open source platform for storing, organizing, processing, and sharing genomic and other big data.
Date: 11/1/2019
Presenter: Tom Morris
Institution: Veritas Genetics
South Big Data Hub
A
Without
further
ado,
Tom
Morris
from
Veritas
genetics
is
here
to
talk
to
us
about
the
arvados
platform
and
I'm
really
excited
to
hear
more
about
this,
as
I
think
many
of
us
on
our
respective
campuses
have
heard
about
this
platform
and
be
interested
to
know
more
about
it.
So
without
further
ado,
I'll
turn
it
over
to
you.
Tom
and
you'll
be
able
to
share
your
screen
and
and
then
really
take
over.
From
that
point,
cool.
B
B
B
It's
a
modern
architecture
which
I'll
go
into
in
a
little
bit
more
detail.
It's
built
from
the
ground
up
for
Federation,
which
is
something
that
we
think
is
very
important
for
these
types
of
applications
and
one
of
the
things
about
Big
Data
is
that
tends
to
not
be
very
feasible
to
be
shuffling
the
data
around.
Although
I
know
you
guys
have
a
very
fancy
system
for
being
able
to
do
that.
B
It
supports
all
three
native
cloud
platforms,
as
well
as
on-premise
HPC
clusters
and
can
be
used
in
a
combination
of
those
we
have
people.
We
have
customers
that
have
migrated
from
one
to
the
other
or
use
both
as
well
as
using
multiple
cloud
vendors,
there's
workflow
portability
across
all
those
platforms
and
a
uniform
API
across
all
those
platforms.
So
everything
at
the
layer
above
arvados
is
something
that
hides
all
of
the
differences
underneath,
as
I
mentioned,
it
was
designed
to
deal
with
genomic
data.
B
B
There's
a
query:
engine
called
lightning,
which
is
a
combination
of
query
engine
and
compression
technology
that
compresses
human
genomes
or
any
genome
actually
down
to
a
very
compact
representation.
That's
quick
to
caret
query
and
is
also
efficient
for
doing
machine
learning
and
it's
a
format.
That's
in
amenable
to
machine
learning.
B
B
Those
api's
are
offered
through
a
variety
of
SDKs,
supporting
Python,
Go,
Java,
Ruby
and
R,
and
there's
also
a
set
of
command
line
utilities
that
can
be
used
for
shell
scripts
and
then
on
the
right.
You
see
that
web
interface
that
we
have
as
well.
That's
the
current
generation.
We've
got
a
next
generation
web
interface.
That's
in
beta
now,
which
I
will
show
you
a
shot
of
later.
The
whole
system
is
designed
to
be
easily
extensible
and
the
API
is
can
be
used
either,
as
I
mentioned,
to
extend
that
or
to
integrate
it.
B
So
one
of
the
things
that,
for
instance,
Veritas
does
so
Veritas
may
business
is
sequencing
for
so
direct-to-consumer
genetic
testing,
and
so
as
such,
we
have
very
large-scale
sequencing
operations
and
it's
important
for
us
to
be
able
to
attract.
You
know
customer
orders
as
they
go
through
the
system.
The
API
is,
can
be
used
to
integrate
with
our
various
operational
dashboards
to
have
back-end.
Workflow
processes
kicked
off
when
a
new
order
comes
in
and
deliver
the
data
when
it's
done
and
so
on.
B
The
core
definition
language
for
describing
workflows
is
called
the
common
workflow
language.
This
is
industry,
standard
language
that
came
out
of
one
of
the
open-source
conferences
about
four
or
five
years
ago,
and
it's
something
that
cure
averse.
The
a
virus
engineering
team
has
been
involved
with
helping
standardize
and
both
write
the
specification,
as
well
as
contributing
to
the
reference
implementation.
B
That's
used
as
a
core
of
a
lot
of
the
other
implementations,
as
you
can
see
from
the
list
participating
organizations,
it
spans
both
commercial
organizations,
including
some
of
our
competitors,
as
well
as
academic
organizations
and
research
institutions.
So
it's
got
a
very
broad
support
and
it's
the
ecosystem
is
continuing
to
grow.
One
of
the
things
that
I
think
is
really
powerful
about
that
is.
It
provides
for
a
community
where
you
can
share
workflow
definitions
share
tool.
Wrappers
just
saw
something
pop-up,
here's
so
when
chatting
for
me
me.
B
So,
to
go
through
the
components
is
how
you
know
before
it
a
little
bit
more
detail.
This
Gorge
layer
is
called
keep
everything
in
the
system
is
content
addressed,
which
means
that
all
of
the
data
is
run
through
a
cryptographic
hash
which
produces
a
small,
unique
value
that
can
be
used
to
address
it,
I'll
be
addressing,
and
the
system
is
done
using
that
so
one
of
the
things
that
provides
its
automatic
deduplication.
So
if
you
have
two
copies
of
any,
you
can
never
have
two
copies
of
any
piece
of
data.
B
If
someone
tries
to
create
a
new
copy,
the
system
will
recognize
that
it's
already
got
that
and
just
use
that
copy.
It
also
means
that
creating
copies
is
a
very
efficient
process,
because
it's
just
a
matter
of
moving
pointers
and
incrementing
counters
it
that
content
addressing
is
also
used
to
support.
Some
of
the
promenades
features
that
I'll
talk
about
on
the
next
slide.
It
can
be
backed
by
either
cloud
object,
storage
or
a
traditional
file
system
on
HPC
cluster,
including
cluster
file
systems.
B
So
it
can
either
be
just
you
know,
a
single
file
system
or
a
cluster
file
system
skills
up
to
petabytes
and
there's
kind
of
two
levels
of
abstractions.
Here.
One
is
a
hierarchy
of
projects
and
sub
projects
which
can
be
nested
arbitrarily
deeply
and
the
within
those,
and
those
are
the
basic
unit
for
sharing
of
data.
B
So
all
data
is
private
by
default,
but
you
can
share
a
project
and
it's
children
with
individuals
or
groups
of
individuals
and
they
can
have
the
granted
either
read
access,
read/write
access
or
manage
access
which
allows
them
to
share
on
with
other
people,
so
very
fine-grained
permission
system
there,
which
is
useful
for
managing
access
to
data
and
then
the
other
basic
abstraction
is
collection,
which
is
basically
a
virtual
folder
that
contains
a
bunch
of
files.
So
the
workflow
manager
crunch
is
built
for
reproducibility.
B
It
uses
the
content
and
addresses
that
are
maintained
for
by
the
storage
layer,
as
well
as
the
content
addresses
that
are
maintained
by
docker
to
provide
for
very
strong
provenance.
So
all
of
the
software
is
containerized.
All
jobs
run
inside
docker,
the
by
looking
at
the
content
hashes
for
those
containers
and
the
content.
Passions
for
the
inputs.
You
can
tell
easily
the
exact
constituents
of
any
output
that
you
got
so
you're
able
to
trace
from
the
outputs
all
the
way
back
to
the
various
levels
of
input
you.
B
The
other
thing
you
can
do
is
use
that
to
take
a
look
at
to
do
smart
job
reuse.
So,
if
you're
on
step
37
of
a
40
step
process
and
the
system
fails
either
due
to
a
bug
in
your
workflow
or
a
glitch
in
the
cluster
or
a
glitch
in
the
cloud
rather
than
restarting
from
the
beginning
or
having
to
manually
slice
up
your
workflow
and
run
it
in
pieces,
the
system
automatically
knows
that
it
can
start
at
the
failed
step
and
skip
all
the
previous
steps.
B
Because
by
looking
at
the
content,
hashes
of
the
inputs
and
the
content
hashes
of
the
docker
containers
and
seeing
that
they're
the
same
as
the
previous
run.
It
knows
that
as
long
as
the
computation
is
deterministic
and
use
reuse
the
outputs
without
having
to
recompute
that
so
that's
very
useful
in
the
development
environment,
where
you
want
to
be
able
to
iterate
very
quickly
and
fix
a
bug
and
restart
and
fix
the
next
button
and
keep
going.
B
The
other
thing
that
this
does
is
for
cloud
installations.
It
will
dynamically
scale
up
and
down
the
compute
capacity
so
that,
when
you
need,
if
you
have
a
workflow,
that's
paralyzed
across
all
of
your
23
chromosomes,
you
can,
you
know,
spin
up
the
peers
of
it.
If
it's
paralyzed
across
100
samples,
you
can
spin
up
a
hundred
computers
and
basically
get
as
much
compute
capacity
as
you
need.
One
of
the
nice
things
about
cloud
pricing
is
that
it's
basically
linear,
so
you
can
get
lots
of
little
computers.
B
B
Talked
a
little
bit
about
provenance,
so
this
is
kind
of
a
graphical
view
of
what
a
pro
Bono's
graph
looks
like.
So
you
can
see
the
rectangles
here
are
data
collections
and
the
ovals
are
compute
processes.
So
you
can
see
how
you
can
trace
from
any
output
back
through
the
computation
that
produced
it
and
back
to
the
original
inputs,
including
not
only
your
your
sample
data,
but
any
reference
data
that
was
used
as
well
as
the
actual
software
that
was
used.
B
B
Information
which
is
privacy
sensitive,
so
there's
often
transporter
data
laws
that
prevent
you
from
moving
things
across
national
boundaries
or
just
organizational
issues
where
you
can't
things
move
things
across
organizational
boundaries.
So
what
other
things
said
about
us
allows
you
to
do
is
to
be
able
to
push
workflows
out
to
remote
clusters
and
if
you
have
the
appropriate
privileges,
be
able
to
run
the
computation
there.
B
Some
of
the
recent
features
that
just
wanted
to
highlight
that
we're
added
for
people
that
are
familiar
with
earlier
incarnations
of
arvados
and
support
for
storage
tiers,
who
can
roll
things
off
to
cool
and
cold
storage,
tiers
to
save
money,
support
for
spot
instances
which
are
much
cheaper
pricing
distributed
workflows.
So,
in
addition
to
being
able
to
have
a
single
workflow
that
you
push
to
a
remote
cluster,
you
could
have
a
distributed
workflow.
B
That
did
some
of
its
work
remotely
and
then
some
of
its
work
locally
and
be
able
to
stitch
all
that
together
and
all
that
is
supported
by
federated
identity,
support
across
all
the
clusters.
So
this
allows
you
to
be
able
to
easily
manage
your
sharing
controls
and
what
access
you
want
to
grant
people
having
a
federated
identity
doesn't
give
you
any
additional
rights
per
se,
but
it
does
give
you
a
common
identity
across
all
of
the
clusters.
So
you
can
use
that
for
setting
up
your
various
roles
and
sharing.
B
Support
for
versioning
of
collections
so
that
if
you
have
different
metadata
associated
with
them
or
change
the
contents,
you
can
go
back
and
look
at
previous
versions
and
see
what
was
there?
The
new
version
of
the
vadis
web
interface,
which
we
call
workbench
missing
data
and
also
python
pre
support.
B
So
this
is
actually
what
the
workbench
to
beta
looks
like
you
can,
if
you're
familiar
with
Google
Drive
at
all,
you
see
it's
kind
of
similar
to
that.
It's
a
much
more
modern
implementation,
so
the
original
workbenches
Ruby
on
Rails
app.
This
is
a
single
page,
react
and
was
designed
designed
from
the
ground
up
with
a
cohesive
user
experience
and
it's
much
more
performant
and
responsive.
B
So,
just
in
terms
of
the
experience
that
Veritas
has-
and
the
engineering
team
here
has
been
working
on
this
since
2006-
have
clusters
across
a
number
of
different
continents
with
more
coming
online
petabytes
of
data
under
installation
and
use
for
very
large,
so
one
of
things
I
mentioned
is
that
Veritas
uses
it
for
all
of
its
production
work.
There
are
a
number
of
large
companies
that
have
multiple
clusters
spread
across
multiple
continents,
use
it
to
support
their
day-to-day
operations,
and
this
is
kind
of
a
recapitulation
of
the
stuff
that
I
talked
about
before
I.
B
A
Excellent
thanks,
Tom
I've
got
a
question
maybe
to
lead
things
off
well.
First
of
all,
thank
you
very
much.
That
was
that
was
a
great
overview.
I'm
interested
in
this
notion
notion
of
Federation,
and
you
know
it
certainly
addresses
some
of
the
issues
that
I
would
say
a
number
of
domains.
Experience
not
just
genomics
in
the
case
of
the
Federation
implementation.
You
have.
A
B
Absolutely
so,
in
the
the
distributed
workflow
case,
if
the
data
is
remote
and
you
have
access
to
it,
it'll
actually
be
fetched
from
the
road
system,
and
so
the
current
implementation
is
that
that's
fetched
during
processing
and
then
cache
locally.
But
you
could
certainly
imagine
scenarios
where
you
do
more
sophisticated
things.
We've
kind
of
resisted
going
crazy
with
pre-staging
data,
doing
fancy
optimization
until
we
kind
of
see
how
customers
are
using
this
in
earnest.
Some
of
these
features
are
relatively
recently
added
to
the
system.
B
So
you
know
generally,
you
would
want
to
run
your
workflow
where
the
data
is
positioned,
but
in
some
cases
like
reference
data,
you
probably
want
multiple
copies
of
it,
so
you
want
that.
You
know
is
scattered
everywhere
and
one
of
the
things
that
the
system
knows
is
because
the
workflows
have
all
of
the
information
they're
completely
self-contained
in
terms
of
what
scripts
are
being
run.
What
reference
data
is
being
used
when
you
move
a
workflow
from
place
to
place,
you
can
actually
the
system
knows
how
to
copy
all
of
the
reference
data
with
it.
A
B
So
everything's
done
over
SSL
like
so
it's
encrypted
in
flight
and
you,
the
permission
system
requires
you
to
have
read
access
to
the
source
and
write
access
to
the
destination.
So
that's
one
of
the
things
that
the
I
having
federated
identity,
the
remote
system
could
grant.
You
write
access
to
a
particular
project
hierarchy
and
you
could
put
your
stuff
there,
but
without
that
you're
not
going
to
be
able
to
write
anything
there,
gotcha
and.
A
Have
you
have
you
found?
Are
there
many
people
using
this
federated
component
of
the
system
just
dealing
with
that
federated
identity,
side
I'm,
just
curious,
because
this
is
something
at
universities
we
often
run
into
and
NSF
has
done
a
lot
to
try
to
help
us.
You
know
by
promoting
things
like
in
common
and
so
forth,
but
do
you
run
into
issues
with
across
the
Identity
Management
sphere?.
B
So
most
of
the
Federation
setups
that
we
have
now
are
owned
by
a
single
organization,
so
the
Federation
is
used
to
deal
with
geographic
diversity
as
opposed
to
organizational
diversity
right,
so
don't
have
a
ton
of
experience
with
that
it.
The
other
thing
I
guess
I
should
say-
and
this
front
is
that
there's
a
bunch
of
standards
work
going
on
in
the
ga4gh,
which
is
the
genomic
Alliance
for
the
Global
Alliance
for
genomics
and
health.
B
That
is
looking
closely
at
federated
identity
issues
as
well,
so
we're
tracking
that
work
very
closely
to
be
able
to
fit
in
with
that,
and
they
they
have
concepts
of
research
passports
that
can
be
used
for
helping
to
support
data
access
committees.
Decisions
on
whether
or
not
access
should
be
granted,
and
things
like
that.
So
there's
a
lot
of
kind
of
policy
machinery
in
addition
to
the
technical
machinery.
That's
needed
for
these
things,
yeah.
C
A
question
in
a
different
direction.
You
said
at
the
beginning
that
the
platform
was
being
used
kind
of
outside
of
life
sciences,
but
clearly
your
origins
are
life
sciences
and
all
the
examples
that
you
gave
were
life
sciences,
there's
always
a
tension
between
the
desire
to
serve
everyone
and
the
desire
to
do
something.
One
thing
well
I'm
wondering
if
you
could
comment
on
how
well
the
platform
is
as
generalized
but
also
kind
of
the
snags
that
you've
run
up
against,
as
people
have
attempted
to
use
it
kind
of
outside
of
its
originally
intended
domain.
B
B
B
No,
it
doesn't
use
Hadoop
or
MapReduce
style
computation.
Currently,
so
the
most
of
the
genomics
tools
are
designed
to
run
standalone,
often
they're
multi-threaded,
but
there
is
some
movement
to
do
spark
based
versions,
some
of
the
tooling
and
we're
tracking
that,
but
it's
a
little
bit
kind
of
old-school
in
terms
of
that
and
the
Hadoop
style
of
the
MapReduce
style
of
computation
doesn't
apply
as
readily
to
those
okay.
C
A
Other
questions
jump
in
otherwise
I'm
going
to
ask
one
about
the
analysis.
Interface.
Some
of
the
layout
actually
looked
a
bit
jupiter-like.
Have
you
had
any
interest
in
people
actually
using
this
environment
to
support,
say
Jupiter
notebook
like
interfaces
for
developing
notebooks
and
other
shareable
products,
yeah.
B
There
is,
there
has
been
a
bunch
of
interest
in
that,
and
some
people
have
done
some
work
on
that.
I
have
not
looked
at
it
recently
to
see
it's
an
area
that
we're
interested
in
doing
tighter
integration,
with
kind
of
in
a
similar
vein
workbench
to
as
much
better
support
for
pluggable
viewers
for
different
data
types
and
stuff,
which
also
helps
so
that
you
can
kind
of
wire
up
different
viewers
for
different
data
types
in
the
system
and
be
able
to
use
those
easily.
A
B
B
So
in
that
case
they
would
need
a
BA
a
with
their
cloud
vendor
which
they
offer
the,
and
we
will
also
work
with
them
to
you
know
if
they
want
to
get
if
they're
doing
drug
development
and
the
you
know
higher
levels
of
certification
and
stuff
again
that
doesn't
apply
directly
to
us,
but
we
would
support
them
in
that,
because
those
typically
have
a
bunch
of
training
and
procedural
and
other
things.
So
we
can.