►
From YouTube: Kubernetes SIG Big Data Meeting (August 9 2017)
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Everyone,
hey.
A
Hey
thanks
Eric
waiting
to
take
notes,
always
been
a
challenge
rated
it
in
time.
So
it's
time
we
try
something
different.
D
A
A
A
C
A
Okay,
so
getting
started.
I
want
to
start
today
with
one
announcement
that
we
will
be
working
more
closely
with
the
air
flow
community
in
the
future.
That's
that
of
the
different
workflow
schedulers
that
exist
in
the
you
know
for
for
orchestrating
different
Apache
workloads,
Lucy,
Azkaban,
etc.
Airflow
seems
to
be,
as
of
now
the
one
with
the
fastest
evolving
community
and
doesn't
have
any
explicit
ties.
The
traditional
Hadoop
stack,
so
it
seems
like
a
good
fit
and
it'll
be
led
by
Daniel
impermanence
join
us
from
Bloomberg.
A
He
presented
to
us
I
think
a
couple
of
meetings
ago.
He
talked
about
how
you
know
how
Tara
flow
on
communities
might
work.
I
think
there's
a
lot
of
synergy
in
how
this
effort
works
with
you,
because
it's
an
integration,
that's
going
to
live
upstream
in
a
way
that
is
very
similar
to
the
thing
that
we're
doing
with
spark.
So
there's
a
lot
of
knowledge
to
be
shared
there
and
yeah
so
welcome
Daniel.
A
D
D
Think
what
would
be
really
awesome
is
that
if
we
could
come
to
you
guys
with
some
of
the
but
like
some
of
the
things
we're
working
on
with
some
the
concerns
we're
having
and
just
kind
of
get
the
urban
age
perspective,
make
sure
we're
not
doing
anything
against
proper
proper
conventions
and
yeah.
Just
kind
of
like
is
like
a
sanity
check
to
make
sure
that
we
don't
do
anything
dumb.
That
gets
caught
up
screams.
C
D
Yeah
absolutely
I
know
that
one
thing
on
your
routine
I
have
been
discussing
yesterday,
and
this
is
actually
a
good
question
for
you.
It
is.
We
were
wondering
about
custom
resources
in
a
shift.
You
know,
if
that's
something
that
will
be
integrated
anytime
soon,
because
that's
it
that
will
define
how
we
handle
certain
design
aspects
of
the
project.
C
I
could
find
out,
I
mean
anything
that
ends
up
in
Kubb
is
pretty
much
gets
exposed.
Openshift
that
just
becomes
a
question
is
kinda
like
how
long
the
release
cadence
is
to
get
it
down.
I
win:
when
are
the
custom
resources?
Where
do
they
land
again
on
coop.
C
A
In
generally,
I
think
we'll
have
a
little
more
discussion
to
go
on
like
air
flow
there
as
more
people
who
are
interested
in
that
effort
start
to
join
the
same
lane,
so
we'll
make
sure
that
we
allocate
time
to
spark
HDFS
and
pair
of
celesteville
and
so
far
like
just
to
mention
the
off
stream.
Community
reception
has
been
great.
We
had
meeting
with
the
folks
there
and
they
loved
the
proposal,
thanks
to
Daniel,
of
course,
so
I'm
optimistic
that
we
can
make
some
headway
here
and
yeah
get
get
it
streamed
sooner.
A
A
F
C
Of
the
yeah
I'm
going
to
go
for
the
whole
list,
I
mean
I
felt
like
the
big
one
I
wanted
to
talk
about
was
a
express,
creating
line,
because
that
was
basically
hosing.
Everybody's
integration
test
runs
especially
off
of
2.2,
so
for
anybody
out
there,
who's,
like
still
having
trouble
getting
an
integration
test
run
and
on
the
rebase
off
the
latest
to
point
to
a
duck
should
fix
it
and
other
than
that
yeah
I
guess.
A
Yeah,
that
is
awesome
con,
so
it
does
even
have
concerns
about,
like
any
any
other
bug
fixes
that
they
think
they
should.
That
should
get
into
the
to
point
to
release
that
you're
cutting
because
you're,
not
in
cut
a
2.1
release
yesterday,
which
had
a
couple
of
bug,
fixes
the
DNS
fix.
That
ki-moon
wrote
just.
A
A
A
B
A
All
right
so
yeah
I
thought
that
was
a
pretty
good
overview
overall
and
that
should
probably
exist
somewhere
in
the
you
know.
The
domain
Reaper
the
first
thing
that
you
see
if
people
visit
just
to
get
an
idea
of
the
high
level.
You
know
what
what
is
happening
in
there
yeah
the
picture
was
really
good.
So
if
you
could
send
a
PR
for
that,
I
think
that
should
also
be
part
of
our
2.0
branch,
also
any
other
blockers
on
the
documentation
front.
A
Okay,
all
right,
so
the
other
thing
is
the
design
proposal.
So
I
updated
it
to
add
to
include
people's
comments
yesterday.
A
A
A
C
A
C
B
C
A
Right
yeah
I
agree,
it's
probably
like
you
will
be
rushing
it
if
we
were
to
try
and
target
this
/
point,
it
seems
very
early,
so
I'd
like
to
get
more
discussion
around
this.
If
people
can
spend
some
time
on
like
trying
it
out
or
just
feedback
on
how
like
this
pseudo
client,
most
thing
could
be
exposed
such
that
it
throws
a
meaningful
error
message
when
trying
to
run
it
from
outside
cluster
that'll
be
useful.
C
What's
going
to
happen
here,
I
mean
a
assuming.
We
get
a
cut
or
release
like
today
or
tomorrow.
My
plan
was
to
announce
that
and
then
probably
in
the
same
announcement,
I
thought
I'd
throw
in
a
teaser
for
the
s
tip,
because
that's
not
going
to
land
quite
yet,
but
I
just
want
to
say
that
will
be
arriving
yeah.
That
makes
sense.
A
I
think
yeah
in
itself
the
update
is
going
to
be
pretty
and
yeah.
It
is
something
we
should
send
out
separately.
Saying
2.2
is
something
that
we
base
to
and
then
yeah
when
we
send
up
a
subsequent
email
proposing
this
as
a
feature
you
status,
one
of
the
things
we
link
back
to
so
yeah,
let's
say
after
the
release
announcing
to
Devin
user,
make
sense.
A
B
So
yes,
on
our
n,
if
you
check
out
most
reso
I,
think
I
don't
forgot
what
gr
box,
maybe
four
oh
nine
or
arm
so
for
my
TR.
He
passes
the
integration
test
for
Kerberos
we're
having
slight
issues
in
terms
of
the
configure
the
spark
configuration
you're
passing
down
to
the
driver
and
executors
when
you
define
like
spark
Hadoop
environmental
variable,
that's
causing
some
problems
right
now,
with
example.
B
So
in
essence,
it's
asking
for
a
pretty
much
a
care
of
Kerberos
realm
configuration
file,
err,
b5
file
and
I
need
to
be
stored
in
the
driver
and
executor
well
for
it's
not
actually
necessary,
because
with
ki-moon's
investigation,
all
you
have
to
do
is
to
find
the
Hadoop
tokens
allocation
environmental
variable.
You.
F
B
Need
any
more
Kerberos
information,
and
so
the
fact
that
the
research
staging
server
is
causing
a
to
have
Kerberos
be
true.
It's
kind
of
weird
and
I
can't
seem
to
figure
out
why
so?
My
most
recent
push
is
just
pretty
much
because
locally
on
my
computer.
Everything
is
passing
and
there's
no
like
config
clashes,
I'm
wondering
what
is
going
on
Jenkins,
that
is
causing
it
to
be
different.
A
B
All
the
versions
technically
I,
don't
know
what
version
is
going
to
keep
using
on
Jenkins
I'm,
using
b15,
I,
think
or
16
I.
Don't
think
the
main
acute
vision
is
a
focus
here.
It's
interesting
that
on
whenever
you
launch
your
research
staging
server,
it's
asking
for
the
Kerberos
realm.
If
you
see
the
errors
that
are
coming
up
but
they're
not
coming
up,
you.
B
G
B
G
Personally
encounter
the
same
issue
right:
it
is
Kara
v5
configuration
issue
when
I
manually
submit
Esparza
to
our
test
environment.
That
has
a
single
node
of
Kerberos,
so
you
know
part
in
it.
So
I
think
that's
a
good
news,
because
I
think
I
can
reproduce
this
issue
independently
from
the
Bujinkan
generation
test
environment
and
maybe
dig
more
into
the
root
cause,
but
I.
G
Think,
though
this
actually
boils
down
to
a
little
bit
about
design,
a
challenge
that
we
have
so
basically
it
is
true
that
the
driver
and
executor
right
actually
do
not
need
to
use
corporate
at
all
right.
You
can
just
use
a
delegation
token
right
because
if
somehow
by
a
user,
actually
mounted
a
custom
for
the
political
files,
something
that
were
something
that
enables
cockerels
right,
then
it
will
suddenly
require
will
be
secured
by
compiling
other
things.
So
our
first
thing
is
that
we
definitely
do
not
want
to
accidentally
enable
Icarus
ourselves.
G
B
G
G
A
B
No
just
when
they're
the
staging
server
itself
doesn't
need
to
be
used
for
HDFS
stuff,
and
so
in
my
test,
I,
don't
launch
a
research
staging
server
everything
passes
and
for
all
other
use
cases
where
you
don't
launch
your
future
staging
server
on
all
the
tests
as
well
pass
and
then
the
second,
the
research
staging
server
is
launched.
For
some
reason,
it's
expecting
to
have
a
care
be
five
comma,
even
though
I
don't
explicitly
define
anything
else,
part
configuration
and
when
I
launch
the
star
configuration
arm
in
every
before
clause,
it
doesn't
dot
nikan.
B
G
Devil
I
have
one
theory
of
as
to
why
the
innovation
test
environment,
if
you
expecting
ki,
define
the
calm,
and
so
basically
we
are
trying
to
amount
how
to
config
files
inside
a
driver
and
a
particular
if
those
config
files
accidentally
enable
coverage
right
that
would
trigger
this
behavior.
Okay
I
need
Kerberos
configuration
Corral
correct,
but
we
are
not
mounting.
B
B
G
A
Think
I
think
that's
a
huge
min
because
I
don't
know
if
other
integration
tests
that
would
actually
be
be
capable
of
bringing
up
HDFS
in
order
to
test
it.
So
that's
that's
pretty
cool,
desperate,
alright,
so
any
other
concerns
I
know.
If
you
guys
know.
Ebay
were
interested
in
this
particular
feature
tool
and
HDFS
support
so
and
some
point
when
it's
beamed,
you
know
stable
and
then
they've
got
some
concrete
testing
done.
We
should
probably
cut
another
zero
point
to
point
to
release
on
like
with
this
feature
in
it
and
then
I'm
not
sure
arm.
G
I
am
one
question
about
the
Innovation
Center
environment,
so
in
our
PR
I
noticed
that
we
are
actually
using
the
darker
images
for
this
single
node
shuttlepod
and
also
maybe
a
couple
of
other
things
so
previously.
I
think
we
build
every
doctrine
is
that
we
use
in
innovation
test
ourselves
as
part
of
the
build
process.
We.
G
B
G
A
F
B
Of
data
transfer,
it
requires
all
the
client
versions
to
be
the
same
for
us
a
little
break
and
I
guess
the
encryption
is
different
across
versions,
and
so
there
is
a
hard
request
that
your
what
you're
running
on
Sparky
the
same.
Oh,
no
hood,
reporter
I.
B
So
chemo
and
actually
bump
into
a
couple
issues
and
use
running
the
key
against
its
own
cluster.
Let's
not
know
the
issues
already
ran
into
so
I'm
going
to
do
is
I'm
going
to
write
up
a
run,
spark
run
like
Clark
on
Kubb,
HDFS,
doc
and
I'll
just
cover
like
there's
that
might
come
up
and
like
what
you're
missing
kind
of
like
what
cloud
era
did
in
the
flight
security
doc.
That.
G
A
B
B
G
A
G
A
Yeah
yeah,
it
was
just
a
thought.
It
might
not
be
useful
because
bringing
up
a
more
heavyweight,
well
I
guess
you
could
do
one
note,
even
that
so
yeah,
maybe
maybe
they're,
set
up
part
of
it.
That
can
be
shared
with
what
Elan
root
for
the
integration
test,
like
a
lot
of
it
might
be
reusable
right,
right,
yep,
cool.
C
G
Issue
about
the
specific
user
account
that
is
the
driver
and
executor
is
supposed
to
use.
Oh,
you
know
when
you
descend
Ricketts
to
HDFS.
Currently
it's
only
appears
as
a
you
know,
root
UNIX
user
and
there's
this
environment
of
all
over
I
called
a
spawn
user,
and
we
need
to
test
it
that
environment,
a
variable
support
in
text
or
somehow
broken
for
criminals.
A
A
B
A
G
G
G
A
A
F
Yes,
this
is
this:
is
Daniel
white
deck
I
just
wanted
to
thank
Eric
for
the
for
the
invite
here
and
introduce
myself
so
I
work
on
the
pachyderm
open
source
project.
I,
don't
know
if
that
has
ever
been
brought
up
here
or
if
you
guys
are
familiar,
but
basically
we're
doing
distributed
data
pipelining
and
data
versioning
on
kubernetes,
so
you
can
set
up
data
pipelines
with
stages
in
any
language
or
framework
you
want
and
we
basically
handle
the
pipelining
and
the
versioning
on
top
of
kubernetes.
F
So
I
just
wanted
to
introduce
myself
and
vodka
the
body
to
be
part
of
this
conversation,
especially
related
to
airflow
things
and
pipelining
things.
We
have
users
running
all
sorts
of
pipelines
that
are
triggered
and
various
interesting
sorts
of
ways
on
kubernetes,
including
like
tensorflow,
machine
learning,
pipelines
and
we've
had
people
run
sparked
in
pipelines
and
or
just
like
distributed
ETL
sort
of
things
so
happy
to
be
part
of
this
and
happy
to
you
know,
give
my
perspective
from
from
that
direction.
Awesome
you.
C
F
F
So
it's
so
basically
like
each
each
state,
so
we're
not
opinionated
at
all
about
what
what
data
is
in
those
repositories
or
what
code
you
run.
So
you
can
definitely
break
your
own
pipeline
if
you
make
a
change
to
your
data
format
or
something
like
that,
but
we
also
meant
for
you
to
update
your
pipeline
stages
so
like,
if
you
added
a
new
column
in
a
dataset
or
something
like
that.
What
you
just
need
to
do
is
update
the
respective
processing
step
to
be
able
to
handle
that
new
column.
A
F
Yeah
so
basically
there's
a
pachyderm
daemon
that
runs
as
a
pod
in
kubernetes,
along
with
sed
and
the
data
itself
lives
in
a
backing
object
stores
you
can
use,
you
can
use
like
s3
or
one
of
the
cloud
offerings
or
you
can
use
like
an
open
source
like
Mineo
or
something
running
in
kubernetes
anyways.
What
happens
is
when
you
put
your
data
in
pachyderm
content
addresses
that
data
creates
a
hash
and
then
stores
the
data
in
the
object
store,
and
this
is
under
a
minh
that's
running
as
a
pod
and
cubed.
F
Yeah
yeah,
so
basically
the
PSS
is
the
idea
of
this
packing
PSS
is
our
pachyderm
file
system,
which
is
basically
this
distributed
file
system,
that's
backed
by
backed
by
the
object
store,
so
it
in
PFS
is
one
one
of
the
pieces
is
metadata
associated
with
jobs
and
other
things,
and
but
the
rest
is
really
the
users
data,
so
that's
again
like
stored
via
hash,
and
then
we
kind
of
organize
that
data
into
collections
of
data
called
repositories
which
are
really
just
just
a
collection.
Its
metadata.
All
of
the
data
is
just
kind
of
stored.
A
F
It's
either
the
whole
repository
or
it's
a
part
of
the
repository.
So
basically
because
we're
aware
of
what
data
has
changed,
if
you
only
need
to
run
your
processing
on
new
data
like
you're,
doing
a
map
operation
or
something
like
that,
then
pachyderm
will
only
shim
in
the
new
data
from
the
repository
input
repository
or
if
you're
doing
like
model
training,
where
you
maybe
need
all
of
the
data
from
a
repository,
you
can
also
specify
that
and
that
will
be
given
to
each
to
their
respective
worker
to
to
process.
Ok,.
F
Yeah
yeah,
so
basically,
we
I
mean
we
kind
of
take
the
view
that
you
know,
storage
via
object
stores
is
relatively
cheap,
but
there's
definitely
cases
where
you
don't
necessarily
want
to
store
everything
for
long
periods
of
time.
So
you
know
for
files
that
are
deleted
over
time.
We
do
have
a
garbage
collection
functionality
to
where
you
can
clean
those
up
and
get
rid
of
them
that
so
one
of
the
things
that
we
really
stress
is
reproducibility
and
provenance.
A
F
Yeah,
so
we're
working
on
adding
yamo
right
now
you
provide
a
JSON
specification,
but
declaratively
says
basically
create
a
pipeline
stage
name
this
that
uses
this
docker
image
and
runs
this
command
in
the
docker
image
and
processes.
This
input
is
the
basic
format,
and
you
can
also
specify
things
in
there
like
hey
I,
want
to
run
this
on
a
GPU
node,
or
this
requires
these
resources
and
behind-the-scenes
pachyderm
again
uses
the
kubernetes
api
to
make
sure
that
things
are
scheduled
on
the
proper
resources.
B
F
A
F
Definitely
that's
why
I
think
it's
good
I
am
happy
to
be
kind
of
joining
in
at
this
point,
because
we
can
definitely
you
know.
A
lot
of
people
are
using
air
flow
and
you
know
there's
probably
some
complimentary
things
as
well,
but
also
like
you
know,
we've
kind
of
dealt
with
some
of
these
issues
around
cheap.
You
know,
and
that
sort
of
thing
so
it'd
be
helpful
to
be
part
of
that
discussion,
and
hopefully
you
know
at
least
Express
how
we've
done
things
and
hopefully
get
some
feedback
on
that
as
well.
We.
A
D
And
also
from
our
perspectives
like
Daniel,
if
there's
any,
you
know,
if
there's
any
healthy,
to
give
us
on
our
end
with
for
our
PR,
like
one
thing
I'm
personally,
imagining
is
that
there's
two
types
of
customers
for
like
air
flow
versus
pachyderms
like
if
there's
any
way
that
we
can
help
you
out
and
Wow,
like
just
interchange,
ideas.
I'd,
definitely
be
very
interested
in
that
yeah.
F
F
Not
not,
as
of
yet
most
of
the
airflow
users
that
have
come
to
pachyderm,
have
kind
of
you
know
manually
translated
their
things
into
pachyderm
the
pachyderm
world,
but
there
has
been
talk
about
like
potentially
like
doing
some
sort
of
like
transpiling
to
where,
like
you
could
run
an
airflow
pipeline
as
a
pachyderm
pipeline,
and
there
hasn't
been
any
like
formalization
around
that.
Yet,
okay.
C
F
And
yeah,
and
part
of
that
is
because
you
know
these
individual
stages
are
spun
up
again
as
like
sets
of
pods
on
kubernetes,
and
this
is
how
we
also
do
like
data
sharding
and
parallelism,
and
you
know,
scheduling
like
distributed
pipeline.
Containerized
pipelines
is
really
you
know
what
we've
focused
on,
which
you
know
as
far
as
like
other
schedulers
kubernetes
is
really
the
only
one
that
you
know
really
provides
enough
support
for
that
sort
of
thing
right
now
that
and
that's
why
we're
kind
of
pretty
tied
to
it,
but
yeah
I,
totally
agree.
A
A
Okay
looks
like
we're
going
to
end
early
this
week,
looking
forward
to
finishing
up
stuff
on
spark
sending
that
up
too,
though
so
maybe
early
next
week,
and
if
people
call
for
action
be
like
PRS
reviews
any
of
that
stuff.
If,
if
they're,
if
you
have
cycles
please
to
help
us
with
that,
that's
all
I
heard
three.