►
From YouTube: Centaurus Monthly TSC Meeting 11/30/2021
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Okay,
okay,
so
so
we
had
two
agenda
items.
One
agenda
item
was
prashant
was
gonna
provide
his
overview
of
distributed
cloud.
You
know
the
the
one
from
a
use
case
and
business
perspective
you
know,
but
where
the
the
direction
this
whole
ledge
computing
and
distributed
cloud
is
going
so
it
looks
like
he's
not
there.
A
So
the
second
agenda
item
was
we
invited
zappo
to
kind
of
there's
been
a
lot
of
work
done
as
part
of
the
layer
project
ai
project.
So
we
thought
maybe
it's
a
good
time
to
kind
of
checkpoint
and
let
zabo
kind
of
present
where
we
are
and
what
work.
I
think
she
presented
about
two
or
three
months
ago
too.
Actually
so
it
will
be
kind
of
useful
for
the
for
the
our.
C
Oh
okay,
yeah,
I'm
still
updating
the
slides,
but
I
can
start
and
we
can
just
discuss
okay.
So
what
I
let
me
see
what's
going?
Okay,.
C
Yeah,
just
a
little
recap
before
other
people
forget
what
a
eleanor
is.
It
is
a
elastic,
intelligent,
ai
platform
within
centaurus,
so
we
currently
focus
we're
currently
based
on
kubernetes
and
focus
on
improving
the
efficiency
of
running
training
jobs,
but
for
ai
jobs.
There
are
mainly
two
types:
one.
C
The
training
and
the
other
industry
serving
workload
is
under
planning
as
well,
and
we
have
three
or
four
main
components:
the
the
fourth
one
I
haven't
typed
in
yet
a
little
update
from
a
previous
update,
I
think,
is
late
drawn
the
its
meeting.
We
add
a
couple
of
modules
in
the
main
framework.
One
is
one:
is
the
unified
training
framework
before
we
when
we
prototype,
we
only
support
the
elastic
horowitz,
which
is
one
type
of
distributed,
training
jobs.
But
now
we
redesign
the
training
framework.
C
It's
a
unified
framework
can
support
portfolio
elastic
and
non-elastic
drugs.
So
all
the
jobs
are
managed
with
one
same
simple
yammer
file,
and
you
just
use
a
framework
key
to
wait.
Your
type
of
training
jobs
you
want
to
run
and
under
the
hood,
we
have
different
branches
to
handle
different
frameworks,
and
with
that
we
have
a
global
overview
of
all
the
jobs
which
is
easier
for
us
to
do.
C
The
dynamic
size,
size,
control
like
a
scales,
go
up
or
scale
down,
and
another
component
we
added,
since
the
previous
update,
is
the
the
scheduler
which
is
in
the
original
architecture,
but
we
haven't
designed
in
the
first
phase.
In
the
second
phase,
we
implemented
the
scheduler
one
main
thing
for
the
ai
workload
is
it's.
C
It
will
require
a
group
of
a
path
to
work
together,
especially
in
the
distributed
training,
so
to
avoid
the
resource
starvation,
like
some
of
the
part
launched
the
same
as
pending,
we
need
this
kind
of
a
co-scheduling
plugin
to
launch
the
pod
or
nothing
it
just
is
done
already
done
in
the
kubernetes
governance
side.
We
just
trimmed
down
the
code
and
integrated
in
our
eleanor
project
and
another
special
thing
about
this
scheduler
is
the
data
driven.
We
have
a
component,
I
haven't
read
finished,
which
I
said
is
the
fourth
one.
C
A
profiler
component
can
lively
profiling,
the
resource
usage
in
the
real
time.
So
when
we
calculate
the
score
during
this
scheduling,
we
know
every
node
is
not
how
the
gpu
has
been
used.
So
if
some
of
the
gpu
utilization
is
low,
we
can
pack
a
more
drop
to
one
httpo
card
yeah,
that's
the
two
main
thing
in
the
scheduler
and
another
main
achievement
that
we
did
is
for
the
gpu
sharing
beforehand,
we
evaluated
the
mps
we,
which
is
provided
by
media
to
share
gpu,
which
they
pack
many
process
together.
C
But
there
are
interference
between
the
process,
can
cause
unknown
crash
and
another
way
is
to
use
the
the
cuda
interceptor
way,
which
means
when
the
application
calls
the
cuda
api
driver.
We
can
intercept
within
the
user
container
and
then
to
monitor
how
much
memory
and
the
computer
resources
used
by
that
container
and
if
it
used
more
than
is
requested,
we
can
reject
that
request.
C
It's
relatively
easy
to
intercept
the
the
memory
my
cuda
memory
analog
api
call,
but
for
the
kernel
launch,
which
is
the
slide
related
with
the
compute,
is
a
little
difficult.
We
are
still
working
on
that.
So
that's
the
progress
for
this
three
component
and
the
fourth
one,
the
comp,
the
profiler.
C
We
haven't
added
too
much
to
too
much
additional
feature.
Profiler,
we
just
add
another
database.
We
use
the
nosql
database
to
store
all
the
platform
data
so
later
on.
When
we
try
to
study
and
learn
the
history,
we
can
pull
the
history
data
easily
from
the
database
as
the
progress
in
the
profiler,
so
yeah
we'll
still
try
to
complete
the
whole
scope.
The
whole
the
whole
pro
architecture
in
in
within
in
the
scope,
but
another
thing
we
are
looking
at
is
the
testing
framework.
C
So
by
next
release
we
like
to
see
the
owner
can
run
daily
and
the
benchmark
can
give
us
some
performance
results
so
and
that's
an
overall,
updated
effort.
Anybody
has
any
questions
for
the
detail.
We
can
dive
in.
A
Trouble
you
may
want
to
talk
about
the
collaboration
work,
the
open
source
thing
we're
doing
with
the
uw
by
the
symphony
project
as
well.
The
gpu
utilization,
oh.
A
A
Yeah,
so
I
think
just
to
give
you
folks
a
background,
actually
we're
doing
a
lot
of
interesting
work
with
with
the
uw
university
of
washington
folks.
Actually,
so
these
folks
have
developed
a
framework
called
symphony
for
inference
ai
influencing
framework.
Actually,
so
this
was
a
project
called
nexus
before
and
now
they
can
have
added
a
lot
of
new
new
functionality
to
it
to
make
it
even
better.
A
So
I
I
so
there
are
two
or
three
other
nexus
was
the
previous
project
and
a
symphony
is
a
new
version
of
that
and
there's
another
project
called
clockwork
also
so
there
these
guys
are.
This
is
the
folks
in
germany,
I
think
so
they're
trying
to
address
the
same
issue,
but
this
is
a
pretty
hard
problem
as
well,
though,
because
I
think
every
there's
been
a
lot
of
emphasis
on
optimizing.
A
The
training
my
model
training
thing,
but
the
inferencing
is
a
big
problem
as
well,
though
you
see,
as
the
models
are
getting
bigger
and
bigger.
How
do
you
kind
of
have
a
subset
of
models
running
across
multiple
gpus
cluster
of
gpu,
then
how
to
kind
of
dynamically
dispatch
the
the
the
the
incoming
arrival
and
of
the
traffic
based
on
the
qos?
You
know
the
patterns
and
the
deadlines,
and
all
that.
So
that's
this
project
is
all
about
and
they're.
A
Currently
it's
not
really
open,
so
it
is
kind
of
open
source,
but
we're
trying
to
work
with
them
and
that
there's
some
approval
needs
to
happen
and
it's
going
to
be
apache
2.0
and
we
will
eventually
the
goal
is
to
integrate
that
as
part
of
our
tools.
You
know
the
project
as
part
of
santora's
overall
umbrella.
Basically,
so
this
will
be
our
inferencing
capability
as
part
of
our
centaurus
project.
So
that's
pretty
interesting.
So
we
will.
A
We
will
do
another
detail
kind
of
overview
as
we
go
along,
but
that's
where
we
are
actually
so
essentially
just
the
bottom
line
is
that
we
will
have
the
project
open
source
as
apache
2.0
and
incorporate
it
as
part
of
santorus
project.
A
Yes,
yes,
yes,
everything
is
yeah,
nexus
is
working,
but
they
they
moved
away
from
nexus
to
symphony
project,
which
is
the
evolution
of
that
much
much
better.
You
know
the
gpu
utilization
project
called
symphony.
I
think
you
were
aware
of
that.
You
know.
Certainly
I
only.
A
D
D
Okay,
just
curious
in
the
paper
they
mentioned
that
in
the
nexus
they
are,
they
are
doing
actually
some
very
deep
level-
optimization,
for
example,
they're
trying
to
they're
trying
they
were
trying
to
combine
first
first
kind
of
they
are
breaking
down
a
model
into
different
layers
and
then
combine
the
execution
of.
D
C
Yeah
yeah,
I
yeah,
I
haven't
read
the
detail
like
why
they
need
to
break
the
model
to
achieve
the
better
results.
Yeah,
I'm
not
quite
sure
for
that
part
in
the
next,
but
for
the
symphony
they
claim
it's
mainly
centralized
load
balancer
so
basically
about
how
to
dispatch.
D
E
D
We
x,
I
mean
we
executed
the
same
model,
inference
probably
like
100
requests.
Both
are
using
the
same
model
to
do
the
inference.
We
batch
the
equation
together
into
a
same
node,
but
they
are
trying
to
do
even
find
a
fine
granular
batch
or
parallelism.
They
break
down
the
model
into
different
layers,
different.
D
Yeah,
that
kind
of
subsets
then
only
combine
the
same
layer
from
different
models,
but
because
because
what
a
user
submitted
is
just
a
single
program
for
a
model
like
a
python
python
code,
if
you
want
to
break
them
into
different
layers,
logically,
it's
doable,
but
from
a
coding
perspective,
especially
from
cloud
provider
perspective.
It's
a
little
difficult
to
do.
That.
A
A
Send
your
request,
incoming
request
to
the
to
the
appropriate
cpu,
depending
on
the
qos
and
the
deadlines,
and
all
that
that's
what
the
key
capability
of
symfony
is
yeah
so
and
then
the
other
thing
was
the
couple
of
things
which
were
there
in
in
the
nexus
project,
because
I
think
that
the
emphasis
on
symphony
was
this
actually
what
I
just
described,
but
there
were
a
couple
of
things
which
were
very
good
in
features
in
nexus.
They
haven't
ported
that
so
they
need
to.
A
I
think
there
was
a
concept
of
what
he
called
delta
model
thing.
So
if
you
have
a
different
variations,
customization
of
a
model,
so
you
don't
need
to
so
they
do
optimization
at
that
level,
see
as
opposed
to
having
each
model
treated
differently.
They
just
work
at
the
delta
level.
So
that
feature
so
there's
some
work.
So
that's
why
we
propose
this
collaboration,
because
once
we
understand
you
know
what
symphony
project
is
and
then
we
can
add
with
the
zabo
team
is
already
doing
a
lot
of
gpu
optimization
work.
Actually,
so
they
they
current.
A
The
scheduling
they
use
is
the
time
slice
phase
of
scheduling
id.
So
maybe
we
can
add
more
value.
So
that's
the
purpose
of
our
collaboration.
Basically,
and
then
this
will
be
part
of
open
source,
yeah
yeah,
so
their
students
will
work
with
us,
professor
arvind's
student,
so
that's
the
kind
of
goal.
So
this
is
not
going
to
happen
now,
maybe
around
february
march
we're
trying
to
understand
the
code
base.
A
We
get
getting
ourselves
familiar
with
the
old
nexus
code
and
the
the
new
symphony
code
and
then
see
what
value
we
can
add
and
there's
a
lot
of
work
needs
to
be
done
so
right.
Now,
it's
not
in
a
good.
It's
not
in
a
shape
to
be
released.
You
know
publicly
like
apache
2.0
and
all
that.
So
that's.
Why
we're
doing
this
work
with
that.
C
Another
training,
as
not
all
done
and
still
still,
we
are
still
building
like
the
unified
thingy
job.
E
C
Intern
is
writing
the
code.
I
think
a
he
will
do
the
final
checking
this
this
wednesday
and
for
the
scheduler,
since
we
added
this
viewpo
manager
plugin
thing
so
the
scheduler
has
to
handle
the
the
fractional
gpus
gradually
with,
which
is
a
little
different
than
the
regular
scheduling
job
so
yeah.
There.
D
C
C
Lastly,
here
we
just
give
the
whole
car
like
a
distribution
yeah
and
therefore
for
the
sharing
yeah
before
like
when
we
talk
about.
We
want
to
like
share
the
review
during
the
training
and
and
after,
like
some
practice
and
testing.
We
think
may
not
be
that
very
necessary,
because
these
days
the
models
are
big
and
so
yeah
when
they
use
distributed
training.
They
already
like
ask
for
like
a
4
8
like
16
cars.
So
sometimes
you
know,
maybe
in
the
inference
there
should
be
a
serving
scenario.
D
For
for
the
inferences
narrow,
in
fact
several
years
ago,
when
the
the
gpu
sharing
project
that
we
did,
which
landed
in
production,
was
also
designed
for
for
influence
for
influence.
Oh
only
after
that
hq
guy
told
us.
Oh
there
there's
a
lot
of
small
models
of
training
if
it's
possible
for
training,
but
today
I
don't
know
if
that's
still
that
valuable.
C
C
D
A
Okay,
good,
so
I
think
that's
pretty
much
it
actually.
So
stefan
anybody,
you
guys
have
anything
to
share.
We
can
we
still
have
a
lot
of
time,
but
I
think
other
than
that.
I
think
we
pretty
much
done.
A
Yeah,
okay,
so
I
think
we
can
end
this
meeting
then
in
that
case
I
think
most
likely,
because
next
month
is
gonna,
be
a
holiday.
I
mean
this
december,
so
our
next
meeting
will
most
probably
be
around
january
and
kind
of
a
time
frame.
Yeah.
D
Exactly
yeah,
it
looks
like
we
do
not
have
many
topics
recently,
we
will
skip
next
month's
meeting
if
yeah
yeah.
A
D
A
We
will
push
it
to
january
and
I
think
the
there's
a
lot
of
work
being
done
in
the
edge
edge
cloud
side
in
our
you
know.
I
think
that
a
lot
of
work,
we're
gonna,
be
collaborating
with
stefan's
team
as
well,
so
we'll
give
give
kind
of
overview
as
we
go
along
actually
the
whole
programming
model
and
on
the
edge
cloud
and
all
that
so
yeah,
so
that
would
be
yeah.
Okay,.
D
And
I
are
working
on
paper
and
there
we
are
using
the
itv.
I
attribute
latex
template
for
some
code
slips.
We-
and
maybe
I
can
use
a
couple
minutes
this
year
to
see.
What's
what's
the
temperature
you
guys
are
using,
it
looks
like
the
code
embedded
there.
It
looks
another
nice
we
are
seeing.
What's
what's
the
right
template
we
should
use?
D
Are
you
guys
able
to
see
the
screen
yeah
yeah,
so
you
can
see
for
the
this?
Is
I
three
latex
template
that
we
got
and
for
embedded
code
and
the
listing
part?
The
font
looks
very
weird,
so
just
a
quick,
quick
check
and
see.
What's
the
usual
element
that
you
guys
are
using
for
this
code
coding
and
then
in
this
key.
E
Okay,
that's
a
good,
that's
a
good
question.
I
I
usually
use
listing
no,
not
listing.
I
usually
use
a
cursive
adjusted.
I
changed
different
font
basically
and
then
I
formatted,
but
I
can
send
you
if
you
will
just
some
examples.
I
I
can't
remember
now
from
the
top
of
my
head,
exactly
the
configuration,
but
I
can
send
you
some
examples
from
some
papers
and
and
how
I
configured
those
okay.
D
E
A
Okay,
great
thanks
everybody
and
then
have
a
great
happy
holidays
and
and
happy
new
year
to
everybody.
Yeah.