Numenta 2014 Spring NuPIC Hackathon Demos, 7 May 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Coachella: 2014 Spring NuPIC Hackathon Demo

Description

Paulin Andurand, Antoine Chkaiban

A

Whatever you like, so the goal is to mix different types of data and about the same concept. So in our case it was words and to feed them into the cla to see if it would uh somehow improve the quality of the prediction.

A

So uh most of you are probably already familiar with it, but just a quick summary about fluent so fluent. uh Basically, maybe you can correct me subtitle if I'm wrong, but takes a written word feeds it into uh into c-e-p-t, is retina that then is fed into the temporal pooler in order to predict.

A

uh Basically, the next word and those predicted columns are fed back into the retina to get the reconstructed word and we wanted to get another data stream than the written word inside the temporal pooler and see how that would affect the quality of the prediction.

A

So what coachella does, which is our project is to add a spoken word uh so into so feed. The spoken word to uh something called cmu sphinx: it's an open source, um basically open source uh voice, recognition software- it technically recognizes uh words and then take the sequence.

A

So before you recognize the word, you actually break it up into phonemes and feed those phonemes into the temporal pooler. So actually we kind of cheated, and we went uh over this cmu sphinx thing because we didn't have the time. So we just took the word and converted it to phonemes and fed to the temporal pooler. So the important part is that we're not feeding uh the data at the same at the same time.

A

So, for example, uh take so we'll show a live live demo after, but what we're doing is we're taking a word at time t uh and instead of only predicting the word at time, t plus one based on this word- we're also feeding the phonemes from type t plus one. So we could look at this uh with this with the approach. Okay, I'm speaking uh right now, I'm gonna, for example, type on my, so you could predict that was gonna, say keyboard and uh the phonemes could help do that.

A

So we were just trying to see how that would uh actually work. So.

A

So here what you see is so um it's just, let's maybe start with the text overall, so we have. We have basically a text that we're feeding into our model hold on. Let me find you, the text.

A

So there is just a dummy text that we're feeding to a model and we're going to try to predict the next word. So every uh every point basically separates a sequence from the next one and every space a word for the next one.

A

So what we get is uh so the predicted uh so the current term, the next term that we break up into phonemes and then we see if the model predicts it so at the beginning the model has never seen any phoneme, so it doesn't know how to map them to words.

A

But after a while, you see that the model starts to actually so, for example, he got big starts to actually make predictions, so it takes some time for the for the for the temporal puller to actually correlate the the phonemes with the actual uh word it's it's linked to, but uh after a while, so we we're not gonna have time maybe now to to feed it. And that's that's why we didn't uh do it over the weekend either because we didn't have time to to feed a lot of amount of data into the model.

A

But after a while, after like half an hour, we did that earlier we started to get predictions that were quite accurate, so ugly, what it works, maybe for shorter words. I guess because we're only feeding four phonemes inside the model right now. um So maybe I could just get your questions down because I'm not going to get inside the technical details of uh how many columns we added, except if you, if you want to.

B

So you uh you're feeding the word as well as all the phonemes contained within the word.

A

So um it's it's like.

C

Like you were, like you were like talking, I mean you you so I mean it knows the the last word you were using and you can predict the next word but to predict the next word. We also at the beginning of this next word like if you were like talking to them.

A

So I I like to look at it.

A

Maybe with that way- and you can correct me if I'm wrong, but uh if, if you look at those two approaches which are kind of uh complementary, which is to break down an idea into sentences and two words into phonemes, so what I think sept is doing here is the top part up to until the word, so the sp, the sparse, attribute representation done by by sept, I think, uh includes in in its uh in its bits that when they're overlapping, that two words could be interconnected because they belong to the same uh concept.

A

Kind of so a dog would be close to a cat.

A

On the other hand, we're completely missing the the bottom up approach, which is to actually take the what's coming in inside our uh sensory in the sensory coachella and uh converting it into uh like higher higher um sparse distributed representations in the hierarchy up up to the word. So I like to look at it that way. I don't know if I'm wrong about.

D

It I'm still no, I'm not I'm sure, you're right, but I'm still confused exactly what's what's going on so you've taken a set of phonemes you've, given them a random sdr. I assume.

C

A

D

A

Categories, so I I I'm taking uh four phonemes, so let's just say that we're taking one phoneme for now, so this phoneme, I'm uh just I'm just taking 1024 bits that I'm breaking up into segments that each represent one phoneme into my alphabet of phonemes, so they don't overlap. Every phoneme is different.

D

Category it's like a category and then and then you're feeding, I'm I'm just I'm. I missed it. I'm sorry, I'm just love you but you're, feeding in a sequence of phonemes but you're. Also, I don't understand how you're feeding in words and phonemes I'm confused.

A

So you fluent takes 16 000 columns that are only connected to the input coming from the from the retina. Yes, and what we did is increase this number from 16 000 to 20 000, so we're taking the first array and we're appending to that first array: uh four thousand! uh Well! Actually it's one thousand twenty four performing that each encode one point.

D

So at a moment in time, I'll have a word from from step to plus a phoneme. Then the same word plus the second phoneme plus.

A

Four phonemes plus, oh plus, the four phonemes, are presenting the word. So it's like, if uh you heard the the first, the first words in my sentence.

B

D

A

Her like, or you or you're, trying to predict the next one, both.

D

With the context given the phonemes uh you're, given all the phonemes for the next word, with the current word, you can, you can see it here.

A

Basically, so help h-h-a-h-o-p help it's. This is the common notation for poems.

A

So that's where we kind of cheat, because uh we shouldn't do that by directly mapping the word to its phonemes. We should use voice record. So if we wanted to improve that uh hack that we did in a couple of hours into uh an actual voice recognition software, we would have to take the spoken word and convert it to phonemes using maybe standard sense, standard, machine learning techniques or just the spatial puller, which is kind of similar. I guess and try to uh try to map it to the context that sept is providing us.

A

You use both this and the context that is providing us so.

E

Maybe maybe another way to look at this is or maybe that's how you're thinking of it is you know, speech recognition, particularly in a noisy environment, is quite hard and the way speech recognition works is given an audio signal, converts it to phonemes, and then you try to map it to a word in here. What you're trying to do is also give the past temporal context exactly into disambiguating what that phoneme actually means. That's.

A

What I meant by.

E

This diagram, including the semantic meanings of the past context, not just the fact that it was the word cat, but all everything that goes along with cat. If I talked about.

A

A cat I might talk about milk.

E

D

Yeah yeah, so so is this right. Then. What you really ought to do is compare the performance of this system against the performance of just fluent on its own.

A

So we haven't decided is that the right thing.

D

To do that would take. Are you adding value on top of what that, ex, whatever the existing system is.

E

So you could do it that way or you could just see just with the phonemes. How well you predict the words, can categorize the words versus the phonemes plus the past context. You could see whether that helps disambiguate.

E

um It's like that famous um uh you know you, you recognize, can you recognize speech, or can you wreck a nice beach right um and once you, if you know what the actual words that were spoken, you can actually disambiguate what's going on, but with? If you just look at the phonemes, you can't tell exactly.

A

So it would be it wouldn't be. Trying to improve fluent would be rather try to improve to improve voice recognition techniques using the context that that sep is providing us and try to.

E

Do it in a weekend.

A

We started with even more like a bigger and we just narrowed it down to that. That's great.

A

So sorry about the bug, it's because internet, so I can. I can run it again, but it's going to start from the beginning. So it's not going to be very useful.

E

It's really great to see all of these, like really ambitious ideas tried out and even in its primitive form. Again, you can. You can kind of see that there's a there's things you can explore here, which is really cool.