Numenta 2014 Spring NuPIC Hackathon Demos, 7 May 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Say What?: 2014 Spring NuPIC Hackathon Demo

Description

Matt Roos, Will Gray Roncal, Dean Kleissas

A

Okay, um hey guys so my name's will. um This is matt and dean and we're gonna kind of just walk you through our demo. What we did real quick, um I know we're demo like 17 or something and it's it's 6 30. So we'll just kind of give you the highlights here. But what we wanted to do is to look at speech data. So we took some of the timid corpus and we we tried a bunch of different experiments. So we looked at a single channel version using pitch to predict male versus female turns out.

A

We could do that and then we looked at some things that were harder with multi-channel versions. Looking at things like speaker, id and sentence recognition, tasks, and one of the things that we were interested in was exploring kind of different representations of the of the audio signal. So looking you know trying to go from a raw way form to look at hit pitch or the capture coefficients or some audio spectrogram kind of information, and so uh for those of you who, who don't know temit.

A

I wanted to just play like a quick, uh quick example, but generally this is this is fairly low noise.

A

These are fairly low, noise, audio examples and there's there's um a group of speakers and all the speakers say the same phrase uh twice or they say two phrases: they each say two phrases that are common and then they say a bunch of different phrases, and so this is one of the guys uh with one of the phrases.

A

Does that work at all sort of kind of I do.

A

It's like a test to see if I know where everything is.

B

She had your dog suited.

C

So the point is like this is broken up into different dialect regions, so just because it's a man saying something it can still be hard to understand, even though it's so-called clean speech, she.

A

Had your dark suit in greasy wash water all year right, so this is one of this is one of the the common phrases um that everybody says. Oh we're done, that's all of our audio.

D

At least you're prepared, yeah.

A

No, that's nice just in time. Thank you um so, okay, so I just want to show you quickly um the the results of our our pitch prediction um so in in the the the top subplot- and I we've changed our color. So I apologize for that.

A

But in the in the top, subplot um you'll see the results of a pitch prediction um exercise, and so what we did was we framed this as sort of an anomaly detection task, because that was something that we understood kind of right away, how to how to set up the problem um and so the way the the way that this experiment worked.

A

The setup is we trained uh with multiple men so including the two that we showed you uh before across a wide range of utterances in the in the corpus, and so all the training was were men and then the test was was both men and women, and so what you can see is you can see that pitch prediction worked pretty well for both men and women and we use fairly narrow time windows, and so you typically sort of pre uh predicted.

A

What was uh you know something near to to what your current value was um and we looked at our anomaly scores and we we thought we saw something in the signal we thought we saw that there were.

A

um You know our women were sort of our anomaly class, and so we thought we saw uh some some higher anomaly values for um for the women, and so we built this uh really sort of simple uh binning uh algorithm that we called the female finder um and we found indeed that we saw a response for for most of the uh the cases where we had a female speaker.

B

What was the actual data you were sending in this case.

A

So these were these were sentences so each oh, I'm sorry right! Yes, so the input was pitch pitch values that were derived from one of these one of these sentences and sampled. How often so this was sampled. Would we wind up using 10, millisecond, yeah, 10, millisecond windows? Okay, just just pitch values, that's it yeah, and so I'm going to let team do the next slide.

D

Yeah- and so we did that because we framed this, you know we've never used new pick before since we came today and so the first thing we figured was: let's keep it. You know single one input kind of like what the example is, and so we start with the pitch, and then we try to get more complicated and add multiple channels essentially to the data, and this was an example where we did speaker id.

D

So in this case um the first block of the we train, the cla up with just one speaker, a bunch of different utterances like over. You know random things. The person was saying one male and then we tested with a chunk of that speaker again saying something and a bunch of other random males, saying different things and again, using this framework of just trying to treat it as an anomaly detection problem, um not really actually doing prediction, and then you can see that uh you know it's still very.

D

The cla kind of still had a hard time predicting, but there was it's still separable in terms of predicting that speaker at least heard before versus a speaker that never heard before um and again so we had some trouble with kind of understand it better now, but we didn't swarm for this because we didn't really understand how you could take.

D

You know these coefficients or you know, multi-channel data. How would you swarm that.

B

How many, how many coefficients were you feeding in 12, 30, 13 13, and then you swarmed and you picked? Something did not.

D

B

Oh yeah, you just threw in 13.

D

Just threw in 13. wow used some guesses on parameters, and you know tuned uh basically for all of our multi-channel stuff. I just set the only thing for the encoders was setting the range properly based on the data and then kind of guessing and trying a couple different. uh You know: bin sizes and bit sizes and whatnot.

E

So, if I understand this graph correctly, it looked like it still worked instead, right is that it actually recognized whenever you had that speaker versus the entire.

A

E

So you could find a threshold for the anomaly.

D

Right filtered median: that's what this bottom plot is. We just set a threshold there at 0.9 and said um you know any anomaly score above 0.9. We're going to say is not that speaker and so um that's what that bottom output is.

E

But you did a median filtering on the.

D

Score: okay, medium, medium, filtering on the anomaly score and then thresholding to give this like classifier output,.

D

The first yeah the first chunk that green was the first up to like about 1100 samples was that first, that the the speaker we trained, but these other points I don't know if that's the person, it's not just trust us.

D

So this is this yes, so this is, uh you know an error, and this is an error, but this whole from here to the end of the test was a different speakers. Oh.

E

So it got okay.

F

There's many different speakers throughout that time segment there.

E

So I think we did five most of the time it got it right, except for those coupled.

D

Correct so I think we did five. This is like five sentences that the speaker we trained on said and then multiple sentences from multiple other speakers.

D

We just kind of concatenate it all in and just shoved it through right.

G

And sorry, just one one question was: was that learning during the period of that or did you turn off learning turn.

D

Off learning, so we turned, we had learning on, ran the speaker, the same speaker through with a bunch of different utterances, turned off learning and then treated as like a anomaly detector that can't change and just fed a bunch of data through it cool.

G

And did you try doing the the correct speaker at the end of that test.

D

G

D

Why you would want to do that yeah? That makes sense? Okay,.

G

You know if you could, if you do that, then it definitely works right.

D

Then it just happened. Otherwise, it's just because.

G

It got confused that it start and wasn't sure and then gotcha. It was sure by the time you got to the there's two ways of getting that kind of a graph, and one of them is at the beginning. It's unsure and it gets it's get. Goods are painter, but if you put the correct speaker at the end, then that would eliminate that as possible.

D

Okay, that's something we definitely try. One thing that we did see: okay,.

H

Yeah- let's look at that, so I have two uh related questions. That green period was already with the learning turned off correct, correct and uh did you use the same utterances from that person that you trained it on or different.

E

This articles, sorry, so I actually missed that part earlier. So the first green part is not a training period. Correct.

D

Oh, this is us testing. This is the same speaker with learning turned off, but difference him saying different sentences.

E

Okay, so it didn't work, so it did.

D

Unless we have this artifact of startup.

E

D

Didn't think about testing.

E

Yeah yeah, that makes sense.

D

All right, that's a good idea.

I

Very cool, so I just want to say I'm a little upset because we both did this audio processing stuff. I swarmed for 45 minutes and I got no anomaly score and you guys didn't swarm and it looks like it was. It.

F

Was the post-processing is the key okay, so uh we were also interested in yet another sort of uh representation. So you saw just simple pitch information that last one was taken from mel frequency coefficients for those who are familiar with speech processing.

F

um What I've shown you here is the same sentence in with three different uh uh spectral representations and the top one is a linear uh frequency axis along the uh the y which you'd call y axis um and then the middle one is with a logarithmic scaling, which is a little bit closer to sort of the human perception how that works, and then the bottom one is what some people might call a cochlear spectrogram, and that is a sort of more biomedic, that's kind of like really what the your your inner ear your cochlea transforms.

F

uh It has inhibitory mechanisms, just like neurons do so that representation is a little bit uh more accurate as what would be fed to the auditory nerve.

F

uh Yeah, what literally, what it is, is it's uh you have each chant, there's like 128 channels and at that point and the higher frequency channels inhibit the lower frequency channels and that uh propagates down to the lowest frequency.

F

uh So we took that um cochlear spectrogram representation and we wanted to try what I'm going to call sentence spotting, which is, as was already stated, some of these sentences are two of the sentences are repeated by every speaker. So we have a good uh uh chunk of data to work with and the idea was to train uh the cla on just that sentence and then again turn off the learning and does it consider pretty much any other sentence to be a an anomaly and uh what I'm showing in these? These two figures.

F

um Let me show you a movie here, so there's one on the left. This is uh so. This is the regular um cochlear spectrogram, but there's really too much data here, and so I did a high pass filter of the across the frequency axis. Sorry, the resolution here is really uh poor and then another and a decimation, and likewise across the time domain and now I'm going to hit this play and it's going to show you uh what these spectrograms look like for.

F

100 male speakers and you'll see that this varies of course, but not um not drastically.

F

So, of course, you see sort of jitters in time and things like that and unfortunately, or presumably any failure you might be attributed to that. But but if we compare that to this, in which case this is our test corpus, which contains some of the training sentences by others spoken by other speakers as well as just many other different sentences, all together.

F

And so if this was running faster and if you squint your eyes, you'll see that there's a lot more variants going on there.

F

F

B

Left side is same stance. Different speakers right side is many sentences. Different speakers correct, okay, correct.

F

B

Same-Sex, yes,.

F

Well, there's there's a small it's a test corpus, so there are some of the same senses.

B

F

On the right side, but all spoken by speakers that it was not trained on uh so to be anticlimactic, we didn't really get this to work very well, but uh that may be somewhat to our uh all being novices at this. uh If dean, do you want to say more about what you tried here yeah I mean.

D

The big thing is, we didn't swarm properly, so we probably had lots of problems there and also I don't know if we're really capturing you, you could see in that last video. um You know people talk at different speeds with different pauses, and it really gives you this time. You know. There's I don't know if we really were able to capture that time and variance.

D

I don't know if the cla came in a single layer right, and so you know when that waveform that pattern, that it's expecting in terms of sequence memory is shrinking and growing and changing so yeah. We didn't really get it to work, but.

E

F

So sort of hope was just like somehow magically the speaker id sort of worked. uh We thought that we thought maybe the cla uh sequence learning would capture some aspects of the regular some of the regularity of the spoken sentence, but clearly not all um but of course, so.

B

How many total data points was in the training set.

F

For a single, well, it's it's overall 17, 000, 17 000 time points, each of which has a vector of how many did.

B

F

In there, 12 12 samples per vector, 12 element, vectors.

F

uh Any other questions about this aspect.

F

So I don't really need to read these off. uh You know we're new to new pick, but it was very educational. Clearly we saw that swarming and really learning your parameter. Space is really important and um that video may not have gotten it across. But uh if you look at that on you know, my screen run regularly. You'll see that there's very those are very consistent senses.

F

um So, given all those channels it should, you should be able to build something that would probably classify these things, but I think, given time we could make it happen all right. Thank you.