►
Description
Date: 6/7/2019
Presenter: Jim Wilgenbusch
Institution: Minnesota Supercomputing Institute
Midwest Big Data Hub
A
A
B
B
I've
got
a
lot
of
stuff
to
talk
about,
some
of
which
I'm
an
expert
in
a
lot
of
which
I'm
not
but
I.
Think
that's
the
case
with
all
of
us
when
we
start
to
get
into
these
big
cyber
infrastructure
projects
and,
and
so
what
I
hope
to
do
really
in
the
next
15
minutes
is
just
at
least
give
you
some
idea
of
what
we're
doing
and
if
there's
interest
in
terms
of
going
and
further.
B
Of
course,
we've
got
some
time
for
questions
at
the
end
and
and
then
you
know,
I'd
love
to
see
emails
from
you.
If
there's,
if
there's
something
that
we'd
like
to
go,
that
you'd
like
to
go
deeper
and
we'll
certainly
be
able
to
do
that.
So
that's
my
way
of
apologizing
a
bit
for
for
going
through
a
lot
of
information
in
a
very
little
amount
of
time.
I
want
to
really
start
off
with
Water
Resources
Center
and
just
to
sort
of
introduce
a
bit
a
few
of
the
players
here.
B
Water
Resources
Center
is
at
the
University
of
Minnesota
and
it's
its
mission
as
it
says
here
is
really
to
advance
the
science
of
clean
water.
For
Minnesotans,
through
innovation,
workforce
development
and
knowledge
exchange
beyond
the
state
of
Minnesota,
we're
also
part
of
an
IWR,
which
is
a
national
organization,
and
you
can
see
involves
all
of
the
US
as
well
as
its
territories,
and
so
so
we,
you
know
we
do
from
a
waters
perspective.
We
have
this
sort
of
overarching
mission
and
purview
and
fit
in
again
to
a
larger
national
effort.
B
B
B
I'm
going
to
take
you
through
two
projects
and
just
to
give
you
a
sort
of
a
taste
of
the
scope
of
what
we're
dealing
with
specifically
around
water.
The
first
one
is
a
socio-economic
survey
project,
specifically
we're
looking
at
surveys
of
about
4,000
Minnesota
farmers
there
across
five
of
these
designated
areas
and
I've
listed
the
areas
there.
The
PI's
are
listed
here
and
we
are
just
starting
with
this
project.
B
It's
it's
a
great
sort
of
area
for
us
to
go
into,
because
this
socio-economic
survey
data
has
sort
of
a
whole
set
of
challenges
and
opportunities
that
we've
actually
touched
upon
in
some
other
spaces.
So
you'll
see
the
relationship
there,
but
we've
just
literally
launched
this
project
within
the
last
couple
of
weeks.
Another
one
that
we're
a
little
further
into
is
the
integrated
watershed
economic
modeling.
Basically,
this
has
the
goal
of
looking
at
ways
that
different
scales
of
crop
management
can
actually
impact
nitrogen
reduction
goals.
B
The
URL
is
actually
pasted
there
at
the
bottom
of
the
slide
and
play
around
with
looking
at
various
aspects
of
the
lakes
in
Minnesota.
Moving
forward.
We're
really
interested
in
in
developing
better
means
for
integrating
these
data
with
some
of
our
other
sort
of
spatially
distributed
data
as
well
as
sort
of
temporal
trends
around
lake
water
quality.
B
So
the
types
of
data
is,
you
might
expect
in
the
project.
One.
The
socio-economic
survey
data
are
really
in
the
form
of
numeric
responses
and
typically
sort
of
stored
in
Excel
spreadsheets.
With
respect
to
the
second
project,
the
integrated
watershed
management
project-
these
are
really
get
a
little
more
diverse.
We've
got,
you
know,
spatially
explicit
data.
B
So
that's
probably
not
unfamiliar
to
many
of
you
here,
developing
CI
and
so
really.
In
a
nutshell,
the
overarching
requirements
that
we
had
were
to
to
be
able
to
deal
with
or
accommodate
licensed
software
potentially
installed
by
the
user,
build
some
new
data
analysis
pipelines,
in
other
words,
sort
of
convert
some
of
these
existing.
Maybe
commercial
software
based
workflows
into
Python
or
our
streamline
the
data
cleaning
part
so
that
there
could
be
more
of
an
automated
component
to
getting
data
into
a
platform,
make
it
easy
to
find
and
analyze.
B
These
relatively
diverse
data
types
analyze
data
over
broad
geographic
areas
and
also
sort
of
a
longitudinal
time
spans
and
move
data
between
being
really
completely
private
to
to
open
I,
know
I'm
a
big
fan
of
open
data,
but
the
reality
is
in
many
cases
there
are
very
valid
privacy.
Consider
Asians
around
the
data
that
we're
collecting,
especially
when
it
comes
to
survey
data.
B
We
is
sort
of
where
I
switch
gears
a
little
bit
and
and
now
flip
into
the
platform
mode.
We've
been
developing
a
platform
mode
that
really
focuses
on
agricultural
informatics,
but
it
has
a
lot
of
the
attributes
that
really
sort
of
fit
well,
as
we
began
to
work
with
the
Water
Resources
Center.
You
know
specifically
in
that
it
enables
these
public-private
research
collaborations
quite
nicely.
B
It
accommodates
a
broad
set
of
different
data
types,
and
so
gems
is
the
name
of
the
platform,
and
it
really
that
name
represents
the
different
types
of
data
that
gems
is
able
to
ingest
from
genomics
environmental
management
and
socio-economic
data
types.
It's
built
explicitly
to
accommodate
temporal
data
data
dispersed
over
time
as
well
as
data
dispersed
over
space,
and
so
that
was
also
something
that
fit
really
well
with
some
of
the
requirements
that
we
had
in
working
with
WRC,
the
Water
Resources
Center.
B
How
do
we
do
it
probably
weighs
very
familiar
to
to
most
people
on
this
call?
We
leverage
a
lot
of
the
growing
ecosystems
of
tools
that
makes
things
like
this
much
easier.
In
particular,
you
know
we
we
are
from
the
ground.
You
know
up
a
containerized
system,
I'll
talk
specifically
how
that
helps,
but
there
are
a
number
of
tools,
some
of
which
I'll
call
out,
some
of
which
I
just
won't
have
time
that
specifically
have
greatly
helped
us
advance
past
some
of
the
obstacles
that
I'm
you
need
sort
of
regional
data
sharing
and
efforts.
B
One
of
the
first
things
I
want
to
sort
of
focus
on
again
is
this
containerization
foundation
and
and
how?
Specifically,
it's
helped
us
around
platform
portability.
So
we
built
the
the
platform
gem
specifically
so
that
it
can
be
run
on
clusters,
workstations
of
modern
size
and
laptops,
and
everybody
here,
who's
involved
in
any
form
of
virtualization
knows
that.
B
High-End
workstations
gives
you
the
flexibility
where
you
can
work
with
groups
where
they're
very
sensitive
to
where
their
data
are
going
and
that
usually,
you
could
potentially
run
the
platform
behind
their
firewalls,
with
less
emphasis
on
on
data
sharing,
but
more
emphasis
on
the
capabilities
of
the
platform
to
manage
data
and
the
tools
that
are
available.
And
then
you
know,
I
have
to
emphasize
that
from
a
developer
standpoint.
We
started
this
way
because
we
wanted
to
be
able
to
develop
this
platform
whether
or
not
we
were
connected
to
one
another.
B
So,
just
having
that
portability
gives
you
an
opportunity
to
develop
tools
and
and
other
features
without
necessarily
being
connected
to
a
single
system,
so
switching
a
little
bit
now
to
the
specifics
of
the
platform.
One
thing
that
I
wanted
to
emphasize
on
here
is
the
importance
to
some
of
these
other
other
efforts
that
have
that
are
happening,
that
we're
probably
aware
of
and
if
not
I
wanted
to
make
sure
folks
were
and
and
so
right
away.
We
knew
one
of
the
issues
that
we
would
have
is
is
is
how
do
we?
B
So
our
tools
specifically
are
geared
around
being
able
to
select
geographic
areas
of
interest
and
then
discovering
what
what
datasets
might
be
available
for
you
to
play
with
or
to
potentially
integrate
into
your
analyses
and
I'll
talk
a
little
bit
more
about
that
piece,
because
there's
a
couple
of
layers
in
terms
of
the
way
we
make
data
discoverable.
That
I
think
are
interesting
and
somewhat
unique,
and
and
actually
this
this
sort
of
gets
to
it
is,
you
know,
once
you've
selected
data.
B
There
are
fairly
easy
and
intuitive
ways
that
many
of
the
platforms
today
allow
you
to
to
browse
and
to
filter
and
so
forth.
We
have
a
way
that
you
could
copy
the
analysis
that
the
data
once
that
they're
discovered
to
the
Container,
where
your
analyses
are
running
or
you
could
preview
the
data
before
you-
maybe
move
it
in
here
into
your
analysis,
space,
but
importantly,
there's
also
a
way
to
ask
whether
you
can
have
access.
So
we
we
disentangle
the
metadata
from
the
actual
data
recognizing
that
are
in
many
cases.
B
There's
like
there
needs
to
be
some
understanding
about
how
those
data
will
be
used
before
those
data
can
be
inspected
or
analyzed
by
a
third
party,
and
so
we
call
that
sort
of
selective
data
sharing
so
that
you
can
share
your
data
more
selectively
and
that
could
involve
also
a
timed
embargo.
So
imagine
the
case
where
you
you
just
want
to
wait,
or
you
have
to
wait
for
some
reason
before
you
can
share
that
more
broadly,
with
with
potential
collaborators,
we've
done
a
little
bit
of
work
getting
further
into
this.
B
Now
as
we
develop,
you
know
specific
collaborations
with
developing
custom
tools,
and
so
these
tools
are
again
sort
of
built
within
this
containerized
framework
and
allow
people
to
do.
You
know
some
common
filtering
of
some
of
the
data.
That's
available
that
specifically
geographically
explicit
or
has
a
temporal
component,
and
so
this
is
just
an
example
of
one
one
project
in
which
we're
working
on
where
people
have
needs
to
to
be
able
to
overlay
different
types
of
satellite
imagery
and
understand
better.
What
sort
of
you
know
crops
might
be
of
interest
based
on
these
different
bands.
B
Spectral
bands
that
are
available
in
the
data
sets
that
we
have
in
that
working
group.
Very
importantly,
we
build
this
on
a
Paterno
book,
so
so
we're
we're
heavily
leveraging
an
interface.
That's
increasingly,
fortunately
familiar
you
can
see
my
bias
already.
I
I'm
a
jupiter
fan
and
they're
a
lot,
and
there
are
a
lot
of
people
who
now
are
using
this.
So
it
gives
us
a
little
bit
of
a
head
start
in
terms
of
onboarding.
B
We've
also
were
contributing
to
Jupiter
and
by
adding
some
support
for
VNC
remote
desktop
environment.
So
through
Jupiter
you
can,
you
can
get
a
full
desktop.
This
is.
This
is
great
for
a
lot
of
our
projects,
like
the
ones
that
I
just
mentioned,
because
they've
got
a
lot
of
software
that
you
know
don't
necessarily
or
they've,
got
a
lot
of
workflows
that
don't
necessarily
easily
fit
into
a
Jupiter
notebook,
and
this
is
a
good
way
for
us
to
at
least
accommodate
some
of
their
analyses
without
having
a
completely
retool
everything
and
so
we've.
B
But
our
approach
to
it
is
we're
creating
this
wizard
like
data
integration
tool
that
takes
people
through
uploading,
what
we
call
products,
because
we
refer
to
those
broadly
as
both
data
and
and
and
analyses
or
workflows,
one
of
the
just
I'm
kind
of
skipping
over
some
steps.
But
I
want
to
give
you
again.
B
A
sense
of
what's
happening
here
is
this
is
this
is
so
so
far
been
sort
of
very
focused
on
AG
and
ecological
data,
and
so,
as
part
of
this
process
and
part
of
this,
you
know
attempt
to
to
better
be
able
to
integrate
datasets.
We
pay
attention
to
whether
these
data
might
already
be
part
of
a
standard,
ontology
or
taxonomy,
or
some
some
known
vocabulary,
so
that
we
can
begin
to
again
put
some
structure
around
what
it
is
that
you
might
be
talking
about
when
you're.
B
Another
thing
that
we're
adding
some
value
to
is
is
what
I
want.
What
my
my
stats,
professor,
when
when
I
was
in
grad
school,
used
to
sort
of
drill
into
my
head
garbage
in
garbage
out.
We
want
to
try
to
figure
out
ways
to
prevent
that
by
looking
very
early
on
in
the
upload
phase
at
where
there
might
be
some
data
error,
and
so
one
way
we
do
that
is
we
provided
some
generic
tools
to
to
try
to
alert
someone
to
potential
outliers,
and
so
you
can.
B
You
can
kind
of
select
as
you
as
you
begin
to
upload
your
data,
some
of
these
different
tools
for
visualizing
the
data
before
it's
actually
been
registered
to
the
Jemez
platform,
to
try
to
keep
keep
potential
errors
out
of
there.
Likewise,
and-
and
this
is
definitely
in
the
same
vein-
is
for
every
project
that
I've
been
involved.
That
includes
spreadsheets.
B
If
there
are
concerns
about
how
the
data
are
going
to
be
used,
and
so
so,
this
last
sort
of
step
sort
of
exemplifies.
Maybe
what
that
would
look
like
where
you
could
specifically
change
some
of
those
attributes.
Team
members
who
have
access
to
it
and-
and
you
can
see
various
metadata
components
that
would
be
part
of
what
was
advertised
to
a
broader
community
last
little
bit
here.
B
That
I
wanted
to
mention
is
we've
got
concerns
in
the
AG
space
and
I
think
this
is
true
above
other
space,
specifically
about
who
owns
the
data
and
how
its
protected.
You
know.
Importantly,
I
think
this
really
gets
into
areas
that
are
administrative,
technical
and
physical
and
in
terms
of
protecting
the
data,
and
these
are
things
that
we're
looking
at
I'd
be
happy
to
go
into
more
detail
in
another
in
another
venue.
B
The
other
way
sort
of
again
on
the
technical
side
that
we're
looking
at
these
and
were
being
helped
by
a
number
of
things
that
are
out
there
or
data
and
motion
data
and
use
and
data
at
rest.
And
then,
just
importantly,
it
you
know,
you
can
do
a
lot
to
protect
data
from
from
those
those
standpoints-
physical,
technical
and
administrative.
B
Those
same
levels
of
protection.
Future
work
we're
looking
at
better,
defining
our
API.
We
actually
have
api's
now,
but
we
haven't
advertised
those
in
a
ways
that
other
other
platforms
can
more
easily
pull
or
push
data
into
gems.
And
likewise
we
need
to
do
a
lot
more
work
when
it
comes
to
being
able
to
access
other
folks
data
and
so
we're
doing
some
work
in
that
space.
B
We
also
have
a
great
group
right
now
who
are
looking
at
integrating
IOT
platforms
and
the
data
that
they
collect
with
with
the
gems
platform,
and
so
we've
got
some
great,
really
cool
proof
of
concepts
going
on
right
now,
I
was
just
over
on
the
st.
Paul
campus,
where
some
of
this
work
is
being
done,
and
there
was
there
was
a
coterie
of
students
actually
scattering
field
sensors
all
over
the
campus
there
that
I'm,
hoping
didn't
get
run
over
by
lawn
mowers
today.
B
But
this
is
an
area
that
we're
excited
about
and
then
the
last
sort
of
future
thing
that
we're
that's
very
much
on
our
roadmap
is
federating.
We
have
an
existing
collaboration
now
with
Stellenbosch
University
and
in
South
Africa,
and
we're
beginning
to
really
develop
now
and
excitingly
make
plans
to
codify
this
Federation
so
that
data
could
actually
be
transmitted
over
very
broad
geographic
air
is,
and
so
we're
excited
about
that,
and
there
are
other
other
collaborations
along
the
Federation
lines
that
we're
working
on
right.
Now.
B
B
A
B
A
Really
appreciate
the
the
user
focus
on
discoverability
and
meeting
people
where
they
are
with
the
workflow
and
we're
always
interested
in
what
sort
of
successes
you've
had
or
opportunities
you
think
are
out
there
for
community
feedback
on
data
quality
and
actually
looking
to
improve.
You
know
the
fundamental
state-issued
or
nonprofit
issued
data
sets.
Could
you
maybe
speak
a
little
bit
to
that?
You
mentioned
linking
to
like
standard
ontology
yeah.
B
We
I
hear
you
that
Meredith,
that's
that's
kind
of
been
one
of
the
great
things
working
with
this
group
is
before
we
started
building
things
we
actually
started
about
started
when
I
say
building
things
building
technical
things
we
first
started
building
community
and
one
of
those
communities
is
the
IAA.
It's
the
International
agro,
informatics,
Alliance,
and
so
because
AG,
actually,
you
know,
very
clearly
transcends
borders.
B
We
wanted
to
make
sure
that
when
we
began
to
address
some
of
these
things
like
ontology,
that
we
had
broad
community
input
recognizing
that
we
weren't
going
to
be
able
to
force
everyone
into
one,
you
know
particular
ontology
that
we
needed
to
hear
what
was
out
there
and
adapt
to
it.
So
we've
been
working
through
that
group,
specifically
around
data
standards
and
ontologies
and
that's
been-
that's,
been
extremely
useful.
It's
opened
us
up
to
a
lot
of
groups
here
in
the
US
and
abroad,
who
are
really
the
leaders
in
that
space.
A
You
know
presumably
spent
you
know,
minutes
or
hours
or
a
huge
chunk
of
their
career
working
on
the
data.
Have
you
seen
any
successful
mechanisms
of
their?
You
know,
review
of
the
data
getting
back
to
the
data
stewards
or
that
potentially
an
opportunity
where
the
hub's
could
add
some
values,
sort
of
yeah.
B
I
think
in
fact,
I'm
glad
melissa
actually
brought
up
the
Jeff
and
his
role,
because
at
the
University
of
Minnesota
our
focus
going
forward
will
be
on
Water
Resources
and
one
of
the
one
of
the
things
that
we
specifically
put
into
the
proposal
was.
The
idea
of
convening
first
and
convening
is,
is
will
be
around
specifically
Water
Resources
the
data
standards
and
and
and
needs
of
that
community,
so
that
we
can
whatever
sort
of
development
that
we
do
would
be
focused
on
those
needs.
B
And
so
that's
top
a
list
and-
and
we
literally
are
just
kind
of
it-
Melissa
can
could
echo
this
I'm
sure
but
its
it's.
The
announcement
has
just
been
made
official,
but
there
hasn't
been
an
official
announcement
yet
because
there's
press
people
trying
to
put
that
together
various
communication
peoples
at
universities
that
are
putting
those
together.
So
that's
going
to
be
one
of
our
first
focus,
though.
So
that's
really.