►
From YouTube: Coachella: 2014 Spring NuPIC Hackathon Demo
Description
Paulin Andurand, Antoine Chkaiban
A
Whatever
you
like,
so
the
goal
is
to
mix
different
types
of
data
and
about
the
same
concept.
So
in
our
case
it
was
words
and
to
feed
them
into
the
cla
to
see
if
it
would
somehow
improve
the
quality
of
the
prediction.
A
So
most
of
you
are
probably
already
familiar
with
it,
but
just
a
quick
summary
about
fluent
so
fluent.
Basically,
maybe
you
can
correct
me
subtitle
if
I'm
wrong,
but
takes
a
written
word
feeds
it
into
into
c-e-p-t,
is
retina
that
then
is
fed
into
the
temporal
pooler
in
order
to
predict.
A
A
So
what
coachella
does,
which
is
our
project
is
to
add
a
spoken
word
so
into
so
feed.
The
spoken
word
to
something
called
cmu
sphinx:
it's
an
open
source,
basically
open
source
voice,
recognition
software-
it
technically
recognizes
words
and
then
take
the
sequence.
A
So
before
you
recognize
the
word,
you
actually
break
it
up
into
phonemes
and
feed
those
phonemes
into
the
temporal
pooler.
So
actually
we
kind
of
cheated,
and
we
went
over
this
cmu
sphinx
thing
because
we
didn't
have
the
time.
So
we
just
took
the
word
and
converted
it
to
phonemes
and
fed
to
the
temporal
pooler.
So
the
important
part
is
that
we're
not
feeding
the
data
at
the
same
at
the
same
time.
A
So,
for
example,
take
so
we'll
show
a
live
live
demo
after,
but
what
we're
doing
is
we're
taking
a
word
at
time
t
and
instead
of
only
predicting
the
word
at
time,
t
plus
one
based
on
this
word-
we're
also
feeding
the
phonemes
from
type
t
plus
one.
So
we
could
look
at
this
with
this
with
the
approach.
Okay,
I'm
speaking
right
now,
I'm
gonna,
for
example,
type
on
my,
so
you
could
predict
that
was
gonna,
say
keyboard
and
the
phonemes
could
help
do
that.
A
So
we
were
just
trying
to
see
how
that
would
actually
work.
So.
A
So
here
what
you
see
is
so
it's
just,
let's
maybe
start
with
the
text
overall,
so
we
have.
We
have
basically
a
text
that
we're
feeding
into
our
model
hold
on.
Let
me
find
you,
the
text.
A
So
there
is
just
a
dummy
text
that
we're
feeding
to
a
model
and
we're
going
to
try
to
predict
the
next
word.
So
every
every
point
basically
separates
a
sequence
from
the
next
one
and
every
space
a
word
for
the
next
one.
A
So
what
we
get
is
so
the
predicted
so
the
current
term,
the
next
term
that
we
break
up
into
phonemes
and
then
we
see
if
the
model
predicts
it
so
at
the
beginning
the
model
has
never
seen
any
phoneme,
so
it
doesn't
know
how
to
map
them
to
words.
A
But
after
a
while,
you
see
that
the
model
starts
to
actually
so,
for
example,
he
got
big
starts
to
actually
make
predictions,
so
it
takes
some
time
for
the
for
the
for
the
temporal
puller
to
actually
correlate
the
the
phonemes
with
the
actual
word
it's
it's
linked
to,
but
after
a
while,
so
we
we're
not
gonna
have
time
maybe
now
to
to
feed
it.
And
that's
that's
why
we
didn't
do
it
over
the
weekend
either
because
we
didn't
have
time
to
to
feed
a
lot
of
amount
of
data
into
the
model.
A
But
after
a
while,
after
like
half
an
hour,
we
did
that
earlier
we
started
to
get
predictions
that
were
quite
accurate,
so
ugly,
what
it
works,
maybe
for
shorter
words.
I
guess
because
we're
only
feeding
four
phonemes
inside
the
model
right
now.
So
maybe
I
could
just
get
your
questions
down
because
I'm
not
going
to
get
inside
the
technical
details
of
how
many
columns
we
added,
except
if
you,
if
you
want
to.
B
So
you
you're
feeding
the
word
as
well
as
all
the
phonemes
contained
within
the
word.
A
So
it's
it's
like.
C
A
Maybe
with
that
way-
and
you
can
correct
me
if
I'm
wrong,
but
if,
if
you
look
at
those
two
approaches
which
are
kind
of
complementary,
which
is
to
break
down
an
idea
into
sentences
and
two
words
into
phonemes,
so
what
I
think
sept
is
doing
here
is
the
top
part
up
to
until
the
word,
so
the
sp,
the
sparse,
attribute
representation
done
by
by
sept,
I
think,
includes
in
in
its
in
its
bits
that
when
they're
overlapping,
that
two
words
could
be
interconnected
because
they
belong
to
the
same
concept.
A
On
the
other
hand,
we're
completely
missing
the
the
bottom
up
approach,
which
is
to
actually
take
the
what's
coming
in
inside
our
sensory
in
the
sensory
coachella
and
converting
it
into
like
higher
higher
sparse
distributed
representations
in
the
hierarchy
up
up
to
the
word.
So
I
like
to
look
at
it
that
way.
I
don't
know
if
I'm
wrong
about.
D
C
A
Categories,
so
I
I
I'm
taking
four
phonemes,
so
let's
just
say
that
we're
taking
one
phoneme
for
now,
so
this
phoneme,
I'm
just
I'm
just
taking
1024
bits
that
I'm
breaking
up
into
segments
that
each
represent
one
phoneme
into
my
alphabet
of
phonemes,
so
they
don't
overlap.
Every
phoneme
is
different.
D
A
So
you
fluent
takes
16
000
columns
that
are
only
connected
to
the
input
coming
from
the
from
the
retina.
Yes,
and
what
we
did
is
increase
this
number
from
16
000
to
20
000,
so
we're
taking
the
first
array
and
we're
appending
to
that
first
array:
four
thousand!
Well!
Actually
it's
one
thousand
twenty
four
performing
that
each
encode
one
point.
D
A
Four
phonemes
plus,
oh
plus,
the
four
phonemes,
are
presenting
the
word.
So
it's
like,
if
you
heard
the
the
first,
the
first
words
in
my
sentence.
B
D
With
the
context
given
the
phonemes
you're,
given
all
the
phonemes
for
the
next
word,
with
the
current
word,
you
can,
you
can
see
it
here.
A
So
that's
where
we
kind
of
cheat,
because
we
shouldn't
do
that
by
directly
mapping
the
word
to
its
phonemes.
We
should
use
voice
record.
So
if
we
wanted
to
improve
that
hack
that
we
did
in
a
couple
of
hours
into
an
actual
voice
recognition
software,
we
would
have
to
take
the
spoken
word
and
convert
it
to
phonemes
using
maybe
standard
sense,
standard,
machine
learning
techniques
or
just
the
spatial
puller,
which
is
kind
of
similar.
I
guess
and
try
to
try
to
map
it
to
the
context
that
sept
is
providing
us.
E
Maybe
maybe
another
way
to
look
at
this
is
or
maybe
that's
how
you're
thinking
of
it
is
you
know,
speech
recognition,
particularly
in
a
noisy
environment,
is
quite
hard
and
the
way
speech
recognition
works
is
given
an
audio
signal,
converts
it
to
phonemes,
and
then
you
try
to
map
it
to
a
word
in
here.
What
you're
trying
to
do
is
also
give
the
past
temporal
context
exactly
into
disambiguating
what
that
phoneme
actually
means.
That's.
E
D
D
E
E
It's
like
that
famous
you
know
you,
you
recognize,
can
you
recognize
speech,
or
can
you
wreck
a
nice
beach
right
and
once
you,
if
you
know
what
the
actual
words
that
were
spoken,
you
can
actually
disambiguate
what's
going
on,
but
with?
If
you
just
look
at
the
phonemes,
you
can't
tell
exactly.