►
From YouTube: Intro to GPU: 04 NVidia Software Stack, Part 2
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
All
right,
let's
continue
with
the
session
so
where
we
left
off
I
had
described
two
of
the
three
approaches
for
doing
accelerated
computing.
We
had
the
libraries
and
we
had
the
directives
and
libraries
were
the
kind
of
most
straightforward
and
easy
to
use.
They
offer
high
quality
implementations
of
common
things,
but
are
not
that
flexible.
They
have
a
set
set
number
of
things
that
they
can
do
that
are
relatively
well
defined.
Mathematical
operations
directives
allow
you
to
target
particular
loops
in
your
application
and
identify
a
way
for
the
competitive
expose
to
parallelism
to
the
hardware.
A
So
you
don't
have
to
know
anything
about
the
specific
low-level
programming
languages
or
models
that
are
used
for
the
GPU,
of
course,
having
some
familiar
with
that
will
benefit,
you
I
think.
So,
if
you
decided
to
use
opening
C
C
as
your
programming
model,
it
would
still
I
think
behoove
you
to
learn
CUDA,
for
example,
in
courtesy
or
CUDA
Fortran,
because
in
order
to
in
doing
so,
you
would
learn
a
little
bit
more
about
the
architecture
and
how
how
work
is
mapped
to
an
architecture.
So
certainly
a
benefit
to
you
to
do
so.
A
But
you're
not
required
to
do
so
for
open,
ACC
you
and
you
can
still
get
a
reasonably
high
performance
implementation
programming
languages
are
for
maximum
flexibility
right,
so
you
have
some
workload
that
either
is
not
easily
exposed
as
a
sort
of
loop
that
can
be
paralyzed
well
by
a
compiler,
or
you
want
to
control
how
that
parallelism
is
mapped
to
the
hardware,
because
you
think
that
you
can
do
a
particularly
good
job
for
that
work.
Now
this
is
not
a
trivial
thing
to
do.
It
is
very
possible
to
get
this
wrong.
A
I
would
say,
because
GPUs
are
complicated
right.
In
fact,
any
modern
processor
hardware
is
complicated,
but
GPUs
are
complicated
and
the
performance
on
GPUs
is
complicated.
So
do
not
assume
that
on
day
one
you
will
be
able
to
write
a
code
that
is,
can
kind
of
achieve
that
seven
teraflop,
a
second
the
GPU
potentially
exposes,
but
it's
also
not
hard
I
would
say.
Cuda
is
not,
for
example,
a
very
hard
language
to
learn
just
require
some
time
and
practice.
A
So
this
is
a
kind
of
overview
of
different.
What
the
different
programming
languages
are
that
are
available
for
working
on
GPUs.
Of
course,
we
have
bindings
in
a
standard
HPC
languages,
C,
C++
and
Fortran,
and
those
are
actually
exposed
in
a
couple
different
ways.
Cuda
is
the
the
Nvidia
under
a
kind
of
framework
for
our
architecture
for
doing
GPU
computing
and
so
CUDA
directly
has
bindings
included
in
Fortran.
A
If
those
are
the
sorts
of
things
that
you
prefer
to
use,
so
here's
an
example
in
C
for
the
CUDA
implementation
and
see
for
how
you'd
pay
relies
some
work
on
GPU.
So
on
the
Left,
we
have
our
serial
implementation
again
we're
coming
back
to
this
XP
operation
and
what
we
have
is
our
interface,
where
we're
given
vectors,
x
and
y-
and
we
have
said
the
length
of
the
arrays
and
also
this
scaling
operation
for
the
ax
plus
pride
in
C
code
and
serial
C
code.
A
Language
like
this
is
that
you
have
to
identify
how
to
target
the
the
parallelism
on
the
GPU
to
the
work.
So,
whereas
in
an
approach
the
köppen
ECC,
you
can
put
a
directive
on
top
of
this
loop
and
the
compiler
would
figure
out
how
to
do
that
in
this
code.
Here
you
have
to
explicitly
identify
what
threaded
my
device
does
what
index
in
the
loop.
A
We
were
now
in
control
of
that,
rather
than
the
compiler
doing
that,
for
us
to
other
bits
of
bits
of
different
syntax
are
that
we
now
have
to
use
this
syntax
here
with
triple
Chevron's
to
launch
work
on
the
GPU.
I
won't
go
into
too
many
details
about
what
this
means,
but
essentially,
if
you
multiply
these
two
numbers
together,
that
says
how
many
threads
respond
on
the
GPU
to
do
parallel
work.
So
you
can
see.
A
This
is
actually
quite
a
large
number
of
threads
and
that's
very
typical
for
GPU
programming
you're
supporting
a
very
large
number
of
threads.
Ideally
hundreds
of
thousands
or
something
in
that
ballpark
to
do
parallel
work
and
then
the
other
piece
of
change
that
we
make
is
that
we
add
this
keyword
is
attribute
global,
which
is
the
CUDA
syntax.
For
basically
saying
this
is
a
function
that
can
be
launched
to
do
work
on
the
GPU
and
then
inside
of
it
we
can
expose
work
to
parallel
threads.
A
Probably
the
most
popular
example
today
is
that
you
can
do
define
lambda
functions
in
the
C++
code
that
can
then
be
captured
and
run
on
GPUs,
and
so
this
is
the
fun
the
fundamental
piece
of
technology-
that's
underlying,
for
example,
Coco's
and
Raja,
which
are
two
of
the
big
performance
portability
layers
that
are
being
developed
by
do
e.
So
the
idea
is
that
you
identify
a
chunk
of
a
loop,
a
loop,
iteration
array
with
an
index.
A
You
capture
that
in
a
lambda
and
then
the
the
underlying
library
that
performs
portability
model
does
all
the
work
of
distributing
that
work
across
parallel
threads
right.
So
that's
kind
of
kind
of
the
same
effect.
You
would
get
from
using
something
like
open,
ACC
openmp,
where
somebody
else
figures
out
how
to
map
work
to
the
hardware.
Your
job
is
to
tell
the
the
the
interface
how
much
work
do
you
have
to
do
and
then
for
each
element
of
the
work.
A
What
is
the
thing
that
you
want
to
do,
and
so
this
is
an
example
of
like
templated,
a
C++
code
for
a
functor
or
a
class
where
you
could
do
similar
types
of
operations
on
that,
and
so
this
is
very
expressive
and
it's
not
fully
featured.
There
are
certainly
some
C++
things
you
cannot
do
and
the
because
of
the
way
CUDA
is
architected.
There's
there's
limitations
on
this,
but
a
lot
of
modern
C++
can
be
done
in
include
a
C++
one
question.
Yes,.
A
I
can
give
a
couple
different
answers
to
that
question.
The
practical
answer,
like
the
the
political
or
practical
answer
that
question
might
be
that
open,
ACC
and
Koko's
are
developed
by
different
entities
that
have
different
goals,
so
open
NCC
is
primarily
a
vendor
supported
functionality.
So,
like
the
video
/pg,
I
implement
open,
ECC
Cray
has
implemented
open
A
to
Z
in
the
past.
Gcc
now
implements
open
ACC,
and
so
that's
really
a
community
vendor
supported
thing
that
had
that
is
quite
generally
expressive
and
target
specific
loops.
A
But
it's
not
necessarily
the
right
approach
for
complicated
pieces
of
work
or
ones
that
use,
for
example,
advanced
tables
plus
features,
although
you
can
do
C++,
for
example,
in
these
approaches,
Koko's
is
developed
within
D
OE
as
an
example,
and
it's
really
developed
in
particular
by
Sandia,
primarily,
although
they
have
collaborators
at
plenty
of
other
labs,
including
neural,
scan
Oak
Ridge,
and
they
are
targeting
problems
of
interest
to
do
e.
So,
for
example,
they
they're
kind
of
fundamental
workload.
The
thing
that
they
care
most
about
is
kind
of
unstructured
mesh
problems
that
we're
on
that.
A
That's
really
where
it
got
starts
like,
for
example,
in
the
tree,
Lino's
library,
that's
India,
and
then
they
kind
of
built
up
from
that,
and
so
they
they
have
pieces
of
work
like,
for
example,
multi-dimensional
loops
that
are
really
targeted
to
work
well
for
kind
of
typical
deal.we
problems
of
interest,
and
that
means
they
do
some
things
very
well,
then
some
things
are
not
part
of
their.
What
they
try
to
achieve,
and
the
other
thing
I
would
say
is
that
they
so
that
gives
do
we
some
control
right.
A
It
gives
the
community
control
over
like
how
this
works
and
Koko's
is
implemented
on
many
major
backends.
They
work
closely
with
the
vendors
to
do
it.
So
I
would
say
you
know
both
have
their
strengths
and
like
one
is
I'm,
not
gonna,
say
one
is
obviously
better
than
the
other,
but
you
might
consider
using
openings.
You
see
for
very
simple
for
loops.
They
have
Cecil
or
simple
Fortran
loops.
Where
you
want
maximum
ease
of
implementation.
Koko's
is
much
more
high-powered.
A
It
gives
you
really
complicated
or
really
advanced
controls
over
like
memory
layouts
and
just
and
work
implementation,
but
requires
more
work
right.
It
is
complicated.
C++
code
takes
some
training
to
figure
it
out,
so
there
is
a
trade-off
there
and
the
one
other
thing
that
I
would
say
is
different
is
that
Koko's
has
built
an
ecosystem
around
it,
and
so,
for
example,
they
have
a
product
club,
Cocos
kernels,
which
is
basically
the
implementation
of
various
traditional
high
performance
computing
math
operations
like
matrix
multipliers,
for
example,
that
can
be
portably
run
on
any
architectures.
A
You
could
have
that
one
interface
that
can
then
run
on
multiple
places
right
where,
whereas,
if
you're
using
like
blas
on
on
some
system,
you
often
you
can
just
link
against
a
different
library
and
then
get
it
right.
But,
for
example
like
with
cou
blas,
you
have
to
actually
do
some
changes
to
your
implementation.
A
You
have
to
write
some
code
differently,
the
promise
of
something
like
that
as
one
interface
where
they
can
take
control
over
all
of
the
work
distributing
it
to
the
dispatching
the
work
to
different
backends,
so
they
they
both
are
solving
similar
problems.
But
in
different
ways,
and
targeting
slightly
different
audiences
and
I'd,
be
happy
to
have
a
conversation
with
any
of
you
offline.
If
you're
curious,
like
which
approach
makes
the
most
sense
for
me
and
my
good.
A
Cuda
also
has
an
implementation
or
exposure,
an
API
exposure
and
Fortran,
and
so
it
works
a
kind
of
similar
way
where
the
idea
is
that
you
mark
up
a
subroutine
with
a
global
attribute,
which
says
this
is
work
that
can
be
launched
on
the
GPU.
You
can
pick
out,
which
thread
you
are
and
then
do
a
piece
of
work
on
that
thread.
You
launched
the
work
using
this.
This
CUDA
syntax
with
triple
Chevron's,
tells
you
how
many
threads
to
launch,
and
it
also
it's
very
Fortran
like
this
interface.
A
So
you
can
add
an
attribute
to
an
array,
for
example,
device
in
this
case,
which
basically
says
allocate
this
on
the
device.
So
it
is
very
Fortran
like
syntax,
and
this
was
developed
originally
about
PGI,
and
there
is
one
other
implementation
of
it
by
the
IBM
Excel
compiler.
So
if
you're
using
summit,
for
example,
this
is
available
to
you.
There
too.
A
So
these
are
all
available.
Hopefully,
oh
I
skipped
a
slide.
Sorry,
while
I
was
answering
the
Cocos
question.
So
what
I
wanted
to
say
is
that
one
thing
I
mentioned
there
were
limitations
in
the
c++
approach.
Right
and
a
big
limitation
is
that
a
lot
of
STL
type
objects
are
not
available.
On
GPUs,
for
example,
there
is
no
implementation
of
standard
vector
on
GPUs.
That
does
not
exist.
A
It's
a
very
complicated
piece
of
code
does
not
map
well
to
GPU
parallelism
right,
so
you
cannot
write
like
who
to
see
device
code
and
then
use
create
standard
vector
inside
it
right.
So
that
is
a
limitation
that
we
have
to
come
up
with
clever
ways
to
deal
with.
One
of
the
approaches
for
this
is
thrust.
So
thrust
is
a
library
developed
by
Nvidia,
which
allows
you
to
write
STL
like
algorithms
on
GPUs,
and
this
is
very
typically
a
host
centric
approach.
A
So
you
do
something
like
on
your
CPU
code,
create
a
vector
so
there's
ap
is
for
host
vector,
which
means
on
the
CPU
and
device
vector
which
means
on
the
GPU.
So
this
creates
it
allocates
memory
in
both
places.
We
can
do
point
like
we
can
do
standard
operations
SK
like
operations,
by
getting
pointers
to
the
beginning
and
end
of
like
a
list,
for
example
or
a
vector,
and
then
we
can
do
operations
like
sorting
on
the
list
or
copying
arrays
that
sort
of
thing.
So
often
this
is
the
way
to
go.
A
If
you
have
very
STL
a
heavy
code
that
you
just
want
to
get
started
on
GPUs
and
some
cases,
this
will
not
solve
every
problem.
Right
and
I
have
seen
plenty
of
cases
and
scientific
computing
codes
where
this
ended
up
not
being
a
great
approach
and
they
really
had
to
just
rework
it
to
look
more
see
like
right.
So
I
can't
promise.
A
So
it's
kind
of
just
six
different
programming
models
that
help
you
do
this
you've
seen
most
of
these
already,
but
this
I
think
helps
crystallize.
What
are
the
different
approaches
that
are
available
to
you
so
again,
I've
already
described
sexby
single
precision
in
ax
plus
y,
and
this
is
again
just
how
do
we
do
this
on
GPUs
in
different
ways,
so
with
open
ACC?
A
This
is
a
notional
code,
doesn't
not
fully
work
our
example,
but
shows
you
what
you
need
to
do
or
you
take
your
serial
C
code,
your
slap
fragment,
htc
kernels
on
top
of
the
loop
or
in
Fortran
you
do
if
c
criminals
and
ACM
kernels,
and
then
that
the
compiler
figure
out
how
to
do
the
work.
For
you,
that's
version
one.
The
second
option
was
coup
Blas.
So
this
is
library
approach.
You
can
basically
create
vectors
through
qu
blast,
allocate
the
memory
copy
data
from
the
CPU
to
the
GPU.
A
Do
the
work
on
the
GPU
and
then
copy
the
data
back
so
again,
similar
same
a
same
result
right
we
started
on
the
CPU.
We
allocated
memory
copy
to
the
GPU.
We
copied
it
back,
but
using
a
library.
Now,
instead
of
the
directive
based
programming
model,
include
a
C.
We
are
explicitly
targeting
the
parallelism
to
the
work,
so
we
now
take
responsibility
and
we
say
this
thread.
Does
this
piece
of
work
right
and
that's
something?
That's
not
particularly
familiar,
probably
to
you.
A
If
you're
used
to
OpenMP
for
CPU
threading
write
an
open
MP
like
with
openings,
you
see,
the
idea
is
that
the
compiler
does
that
work
for
you
right,
you
don't
have
to
explicitly
at
work.
So
CUDA
requires
a
little
bit
more
buy-in
from
you
as
the
programmer
as
a
developer,
but
also
gives
you
much
more
flexibility.
A
If
you
can
say
exactly
what
piece
of
work
I
want
to
do,
carry
for
each
thing
and
although
I
think
there's
probably
a
you,
know:
community
law
out
there
that
CUDA
is
hard
or
that
you
shouldn't
use
food
I,
don't
mean
to
imply
that
I
think
that
CUDA
is
very
straightforward
to
learn,
especially
the
modern
CUDA,
be
aware
of
stack
overflow
post
from
2010
right.
The
language
has
evolved
quite
a
bit
since
then,
and
so
I
think
this
is
actually
pretty
straightforward,
but
also
keep
in
mind
performance
portability
may
matter
to
you
right.
A
That
make
you
force
you
to
make
some
choices.
I
wouldn't
say
that
CUDA
is
not
portable
I
think
you
can
see
this
in
the
case
of
frontier,
where
AMD's
hip
implementation
looks
a
lot
like
CUDA
right,
and
so
one
of
the
things
that
I
recommend
is
that
if
you
want
to
prepare
for
front
here,
right
do
well
on
Summit
right,
because
the
architectures
look
very
similar
and
so
CUDA
as
an
example
where
there
is
some
convergence,
I
think
and
how
these
programming
models
are
being
exposed
on
different
vendors
right.
A
A
So
this
is
the
thrust
approach
that
I
described.
You
create
a
host
vector
and
device
vector,
and
you
can
do
some
operation
like
in
this
case
reusing
thrust,
transform
to
loop
from
the
beginning
and
to
the
end
of
the
arrays,
and
then
do
some
operation
like
two
times
the
first
one
plus
the
second
one.
A
These
are
the
types
of
operations
that
you
can
do
with
this
STL,
like
approach
include
a
Fortran,
as
I
said,
it's
very
similar
to
qu
to
see
but
Fortran
like
so
you
again
you
identify
which
threads
you
want
to
do
to
with
which
work
you
launch
the
kernel
using
this
syntax,
and
then
you
can
do
very
nice
things
that
are
Fortran
like
like
basically
set
an
array
to
a
value,
and
if
this
is
about
an
array,
that's
on
the
GPU.
All
of
the
work
is
done.
A
Another
hood
of
figuring
out
like
how
to
actually
do
that
sort
of
copy
right
so
to
come.
The
Pilar
knows
how
to
interpret
that
operation,
so
you
can
do
array
assignments
like
you
would
in
standard
Fortran
and
then
in
Python
at
the
time
uses
the
first
Python
example
I've,
given
where
I
number
so
number
as
a
library
for
doing
parallelization
of
loops
or
or
in
more
commonly
universal
function,
type
approaches
in
Python
in
Python.
A
A
very
pythonic
way
to
write
a
Saxby
operation
would
be
like,
or
one
way
to
write,
maybe
not
the
most
pathetic
way.
But
one
way
to
write
it
is
you
define
your
function
where
x
and
y
are
arrays
are
numpy
arrays,
and
then
you
just
do
like
a
you.
Just
do
an
implied
loop
over
the
elements
you
return
a
times,
X
plus
y
right,
and
that's
you
call
that
in
Python
in
number
number
is
its
bread
and
butter
is
universal
functions.
A
A
So
this
decorator
at
vectorize,
which
comes
from
the
number
package,
says
I,
want
to
have
a
vectorized
implementation
of
this
code
and
then
all
the
work
of
figuring
out
how
to
paralyze
that
or
vectorize
that
it's
done
by
the
Python
runtime
by
the
number
runtime
the
syntax
is
you
have
to
give
it
the
output
that
is
returning
and
then
the
inputs
and
so
these
objects.
You
have
to
specify
the
data
types,
that's
important,
because
GPUs
don't
require
arbitrary
data
types
right,
I
mentioned
they're,
primarily
good
at
numerical
data,
and
they.
A
A
We
all
can
find
easy
performance
bottlenecks
that
help
the
CPU
code,
as
well
as
the
GPU
code.
Of
course,
I
always
get
a
little
bit
sad
when
I
do
that,
because
then
GPU
doesn't
look
as
good.
You
know
it's
hard
for
me
to
get
paid,
but
but
but
more
seriously,
that's
that's
a
big
benefit
right
is
that
you,
it
gives
you
a
fresh
eye
at
your
code.
A
It
was
not
at
all
where
they
expected,
and
so
these
types
of
operations
this
this
workflow
of
profiling
of
code,
finding
the
bottleneck
and
then
making
it
faster
is
very
generic
and
I.
Think
one
nice
thing
about
the
change
to
big
celebrity
computing
is
fortunate
to
do
that.
Right.
Forced
me
to
take
a
fresh
eye
at
your
code,
thinking
very
carefully
about
where
the
parallelism
is
in
your
code,
then
how
to
target
your
work
through
that
parallelism.
It's
not
all
roses.
A
A
Whatever
you
know
it's
hard
to
get
that
wrong,
but
there
are
more
complicated
algorithms,
where
the
most
efficient
way
to
write
it
on
GPUs
may
not
be
the
same
as
those
efficient
way
to
write
it
on
CPUs,
and
you
know,
compiler
directives
can't
solve
that
problem
for
you,
because
it
may
be
implicit
in
the
algorithm
that
you
wrote
or
the
way
that
you
wrote
the
thing
that
you're
trying
to
do
so
I
don't
want
pretend,
if
there's
nothing
to
do
here,
but
at
the
same
time.
This
is
also
why
it's
fun
to
be.
A
You
know
scientific
computing,
developer
right,
you
get
you
get
an
opportunity
to
think
about
how
to
apply
your
work
and
take
advantage.
Modern
architectures,
so
the
story
that
I
want
to
tell
you
is
that
it's
easy
to
get
on
GPUs
and
with
a
little
bit
of
care,
do
very
well
in
them.
It
takes
more
effort
to
get
that
peak
performance
right
and
sometimes
it's
not
even
achievable
for
your
algorithm
but
good
performance.
A
A
So
iBM
has
an
implementation
clang
as
an
implementation
of
open,
MP
offload
that
is
not
written
by
Nvidia,
but
nevertheless
takes
advantage
of
the
fact
that
our
tools
and
our
API
can
be
used
to
implement
those
things
on
the
EP
use.
So
that's
basically
what
I
wanted
to
describe
that?
This
is
a
platform
that
is
continuing
to
expand
and
get
better,
and
it
occurs
to
me
that
I've
never
actually
introduced
myself.
A
So
sorry
about
that,
I
could
do
that
better
late
than
never
so
I'm
max
Katz
and
the
reason
I'm
talking
to
you
is
of
course,
my
job
as
in
videos,
a
Solutions
Architect,
so
I
work
with
developers
to
help
them
understand
the
Nvidia
platform
and
become
more
successful,
but
in
particular
I'm
the
Solutions
Architect
for
department,
energy
right,
and
so
my
job
is
really
to
work
closely
with
you.
If
places
like
North
to
make
your
job
successful,
and
so
you
should
absolutely
feel
free
to
reach
out
to
me.
A
A
A
But
what
he's
asking
about
is
that
there
are
there's
a
hierarchy
of
parallelism
on
GPUs,
and
this
is
I
think
this
is
true
in
every
GPU
platform,
for
example,
this
is
a
very
publicly
documented
for
AMD's
GPUs
that
they
have
this,
and
the
reason
that
there's
a
hierarchy
of
parallelism
is
that
modern
chips
are
very
hard
to
make
right
and
it
is
hard
to
make
a
monolithic
chip
with
5,000
threads
right.
That
is
a
very
hard
thing
to
do.
A
We
get
very
low
yield
as
a
prosecutor,
and
so
the
way
that
we
GPU
implementers
do,
that
is
we
make
smaller
units
which,
for
NVIDIA,
are
called
streaming
multi
processors
that
we
can
then
title
across
a
diet.
Essentially,
so
we
take
one
unit,
one
fundamental
computer,
you
know
the
SM
or
the
stream
LT
processor
and
then
put
a
bunch
of
them
on
the
GPU,
and
then
they
can
coordinate
to
do
parallel
work
and
the
number
of
SMS
or
sweet
multi
processors.
A
On
the
device
essentially
determines
the
compute
power
of
the
device,
so
our
lower
lower
end
GPUs,
like
for
the
gaming
GPUs,
typically
have
fewer
ones.
They
have
less
total
compute
power,
less
Rock,
compute
power
and
the
bigger
GPUs
have
more
of
these
multi
processors
and
essentially
the
same
architecture,
but
with
fewer
or
more,
and
that
determines
how
much
compute
capabilities
available
to
you.
A
So
that
makes
it
easier
to
write
to
build
a
chip
that
has
massive
amount
of
parallelism
to
have
this
hierarchy
of
parallelism,
and
then
that
plays
into
like
the
memory
structure,
to
where
each
multiprocessor
has
l1
cache
right.
That
is
independent
of
the
l1
cache
for
other
stream,
LG
processors,
and
so
it
that
way
you
can
have
those
threads
communicate
with
each
other
with
on
that
multiprocessor,
but
cannot
directly
communicate
with
each
other
on
the
other
parts
of
the
chip,
and
so
in
this
syntax.
A
That
we
saw
before
in
CUDA
here
to
see
we
have
to
do
is
specify.
How
is
our
presence
tributed
across
those
two
levels
of
parallelism?
So
the
first
number
in
the
in
the
triple
chevron?
Syntax
is
the
number
of
teams,
or
groups
or
included,
speak
thread
blocks.
So
this
is
this:
has
teams
of
threads
or
groups
of
threads
that
do
work
in
concert
and
those
threads
are
targeted?
They
live
on
a
particular
multiprocessor,
a
particular
SM,
and
then
the
second
number
is
how
many
threads
heard
that
for
that
team
or
a
group
right.
A
So
this
concept
is
also
exposed
and
open.
As
you
see
in
the
concept
of
gangs
or
openmp
in
the
Kosmos
of
teams,
there
are
groups
of
threads
and
then
that
second
number
is
the
number
of
threads
in
each
team
or
group
or
in
the
CUDA
speak
thread
block
and
there.
The
total
number
of
threads
that
you
can
run
in
a
group
is
about
as
1024
on
a
NVIDIA
GPU
and
then
that's
the
highest
level
parallelism
at
that
lower
level
of
parallelism.
A
And
then,
if
you
want
massive
parallelism,
you
have
to
combine
many
groups
of
threads
together.
That's
what
that
second
number
is.
So
it
requires
a
little
bit
more
work
for
you
to
understand
how
to
map
your
work
to
that
to
level
up
arrow
key,
and
that's
one
of
the
nice
things
about
opening
TC.
The
compiler
makes
an
informed
choice,
a
heuristic
guess
about
how
to
map
that
work
effectively.
A
So
you
don't
have
to
do
that,
but
you
will
probably,
as
a
parameter,
optimization
want
to
do
that
at
some
point,
because
you
will
get
much
better
performance
if
your
parallelism
is
mapped
well
to
the
architecture,
and
that
is
again
a
true
statement
across
GPU
implementations.
It
is
not
possible
to
write
a
good
GPU
for
the
most
part
at
yield
anyway,
like
Sarah
brass
has
an
example
of
a
vendor
that
has
done
one
model
at
the
chip
right,
but
for
most
modern
GPU
influence.