National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2019, 7 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 11 - Sequential Models - Luke de Oliveira

Description

Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda

A

And hopefully, you'll stay through the rest of the week. It's my great pleasure to introduce Luke, who is a familiar face here at Berkeley, Lab he's worked with several people here in the physics division. He currently leads AI an engineering math video which was acquired by well what he was. He was at wide technologies, which was then acquired by Twilio he's a mathematician from Yale and then a master's at Stanford he's also published a ton of work in generative modeling and today he's going to talk about sequential models. Thank you.

B

Thank you for the interview it's great to be here, yeah, so I, suppose it's good that you guys are still here. Hopefully I, don't change that I'll try my best.

B

So what we're going to talk about today, kind of my goal in today's lecture is pretty high-level. I want to set up sequence, learning kind of in the general sense. There are lots of tutorials that are flying around the medium random jupiter notebooks you'll find let's try to set up what sequence learning is from a more kind of systematic perspective. Then I'm gonna give a taxonomy of sequential models.

B

So how can we set up the inputs and outputs for the different kind of types of sequential models that you'll encounter either in papers or encounter in different problem environments? So you're gonna want to work in then finally, we'll get into the deep learning building blocks that will kind of constitute these individual components. This is gonna, be kind of a whirlwind of information, so my goal at the end of the day is to give you the key words to then later Google or when you read a paper.

B

You know kind of you can kind of tie the thread together to what we learned about in lecture today. So I don't expect people to kind of leave knowing how to go: train the fanciest sequence, a sequence model of attention, but just kind of you've heard all of these terms: kind of stitch them together and start to build a framework for understanding sequence models so particularly were gonna go through our ends, I'm going to go through cnn's and we're gonna go through transformers at a super high level.

B

There's a lot of kind of details and nitty gritty that goes into a lot of these models. We're gonna cover them from a little bit of a higher level perspective, guessing the tuition all towards the goal of having a good way to grok sequence, learning and taking models. So, let's start with the basics of sequence learning, so the typical I'm gonna put this in parentheses, supervised learning setup that we oftentimes have is a fixed vector input and a fixed vector output.

B

So this is I have readings from a sensor or multiple sensors or some sort of sensor array. I'm measuring attributes about my system, if I'm working in sociology and maybe I'm, measuring like age, race, sex income level, things like that I have a fixed set of observables about my system and then I want to predict out a fixed set of outcomes so signal or background whether or not machine is gonna fail in the next t. Time periods very fixed.

B

Okay, we have some fixed domain, some fixed Co domain, and we want to learn in the typical setup. As discussed in the earlier lectures within the summer school. We want to learn a good function that map's our inputs to our outputs matching the state of the world. So we have some F, that's actually governing this process going from domain to Co domain and we want to learn a good mapping for that turns out. Deep neural networks tend to be quite good at this problem.

B

We learn this through data, but this doesn't really extend to this eventual case so where this is fall apart, this falls apart when we have what I generally like to call very attic size, so we're dealing with features or an input space that isn't constrained to a fixed dimension. How do we? Where do we encounter these things? You encounter these a lot when you're dealing with time series when you're dealing with kind of general collections of objects, we can measure any number K observables about our system on each of these we deem to be useful.

B

This is a collection of things it doesn't fit into a fixed. This doesn't fit into a fixed, dimensional, vector another place where you end up finding this a lot of times is an unbounded spatial domains. So normally, when you're dealing with images, you've got your 32 by 32 128 by 128, it's fixed or 1024. If you're worried, you're fixed there, you aren't dealing with kind of very attic sizes but oftentimes in a lot of applications. Your images are going to be coming in in various sizes.

B

Your volumetric measurements are coming various sizes from different sensor series in when you're measuring some fisher process and like a groundswell or something so when you have very etic size, a lot of the standard techniques that you use tend to fall apart. So what often times ends up happening? Is you end up building models with what I'm gonna call summary features or reductions?

B

So this is ways of taking this variable sequence or variable sized input and reducing it to a fixed number of features, and this has some problems which we'll talk about later, but I think it's worth kind of diving a little deeper into kind of.

B

What do we call sequence learning so I defined sequence, learning as a problem domain, where at least one of the input or the output is of a sequential or very etic size nature, and the other key thing that we're going to throw into this, especially for today, is what I call a naturally ordered sequence. So we want to have an order to what we have in our sequential data, I. Think about order in two different ways: there is intrinsic ordering, which is more commonly called strong ordering.

B

This is where there is like a natural order admitted by Nature events coming in, in a sequence words in a sentence, different samples from a spatial domain over time these are have a natural admitted order from nature. um There are a whole other set of orderings that come might get a lot of work in high energy physics, and this comes up a lot in sequence: learning in high energy physics, where you have extrinsic ordering. So this is an ordering that is not coming from nature.

B

We are reading if you're familiar with high energy physics we're reading tracks in the jet. We don't actually have an ordering over those tracks, so we kind of come up with one as physicists say. The transverse momentum should be the the the ordering should be an order of transverse momentum. We think that that is the way that we should order the sequence to give it some sort of weak ordering. So you see this a lot, especially in natural science applications.

B

A lot of the things I'll talk about today extend perfectly well to this to this world, but it's important just to call out the fact that these are two very different at a philosophical approaches that you can take to understanding a sequence whether or not the ordering is actually inherent to the problem or whether or not you as a scientist or engineer, or whenever have to impose this ordering in order to use sequence modeling to kind of its full extent.

B

So when we represent sequences, we have kind of some sequence of vectors. We index these on kind of a general order. This can be time any of the extrinsic order, as we talked about, let's limit ourselves to the case, where we're doing typical, supervised learning for a quick second, how can we use traditional models to work on sequences, not saying that you should? But how could we, because this will give us a good inductive bias for how to construct a deep learning models?

B

So the answer as I was mentioning before is kind of summaries or reductions. So what do these look like you'll often times call se reductions in the NLP literature refer to this bags. You just kind of take a bag of things and reduce it into a summary feature, um but assume I have some input sequence. How do I get that into a single number, so I have kind of maybe a vector of sensor readouts over time. Maybe I want to take the sensor readout from sensor I and take the mean over my entire time. Sequence.

B

Maybe I want to take the max the min, the median some other sort of summary statistic that comes out of this sequence of individual numbers.

B

You oftentimes see this done in kind of older, older school literature, but this has some fairly kind of prominent disadvantages when you think that the ordering or the temporal component of your problem actually has meaning for determining and of the outcome of Y that you're trying to predict. So you lose ordering that's a huge, huge issue, but in certain domains that actually may be okay.

B

So one of the things that I'm not going to get into today, but is super relevant for having these sorts of reductions or unordered sequences, so actually taking like some number K of vectors associated with an event with an example and predicting something about them. They don't have an order. We don't even have an extrinsic ordering that we can impose over it.

B

How do we reason about this big collection of things um there, at least in the deep learning literature today, people use still, because we don't really have a very good mental model for how to reason about unordered sets basically yet, but this is a pretty strong inductive bias as to you'll hear me say: inductive bias a lot today. This is a very strong inductive bias for what is important to your problem. You're, basically saying I think most of the predictive power I don't need it from the actual temporal component.

B

I can just deal with the features summarized in a higher level form. So what are the pros of an approach like this? um There interpretable, but sometimes I- think interpretability is something that it's hard to pin down. It's usually very use case dependent what we mean by interpretable. People will say: linear models are not interpret or are interpretable depending on kind of their view of the world, so I'm not gonna, get into into arguments about interpretability, but people claim that reductions then using traditional machine learning.

B

Algorithms that are kind of working on fixed sized inputs and outputs are interpretable, but I'll leave that at that I personally, can't think of many other pros from a pure modeling standpoint, sometimes they're a lot faster to train. That can actually be a very useful thing if you're dealing with absolutely massive massive sequences, maybe you'd rather run your reduction job on Hadoop spark, whatever you're using and then take that and build a traditional model on that that actually might be totally valid.

B

Given the scale of a problem you work up, but well, that's a pretty use case, specific advantage that might come up. What are the cons of using these sorts of reductions and these cons will creep up as we start to build deep learning models that are essentially building continuous approximations of a lot of these sort of like some max operations that act as a reduction, so they are brittle when you create features as I'm sure everyone who's listened to a deep learning.

B

Talk has heard deep learning removes the brittleness of feature design so in general building models on sequences if you've ever looked at a sequence, model, free kind of deep learning, basic, sequential learning. You see a lot of crazy features like the value of this, like variable dips and periods. Two and five doesn't dip in period seven and then, like is twice as much in period eleven. You see a lot of these sorts of pretty pretty brutal features that maybe work but boy do they take a lot of time to test and validate um I?

B

Think the bigger kind of question here is: are you inherently limiting your limiting your performance and the answer is yes, if you have a temporal component to your problem, you're you're expressly limiting the amount of information you can get out of your data by doing these reductions very similar to kind of the first point, which is it's brittle, um you actually have a very hard time representing what you want to represent when you have a sequence, modeling problem I might have an idea that I think in 30 time periods back.

B

There might be something that will indicate what will happen 30 time periods from now. It could be right, it could be wrong, but actually building the features that get that into a model can be very time. Intensive and I would argue not the best use of time and I think the biggest problem like arises out of thinking like this is you actually have a hard time predicting sequences this way, this works fine. When you have a sequence, that is your input and a fixed thing.

B

That is your output, but now try to predict a very attic length. Sequence. That's coming out of some sort of machine learning problem. You have that's pretty hard to do with summary statistics. There are ways of doing it, but it is I would say highly non-trivial.

B

So what do we want out of models and representations when we're working with sequences? We want to consider the sequence as a sequence, no, no kind of explicit reductions.

B

We need to be able to predict sequences as an output, so most kind of machine will ever buy some machine learning when you think about it. You're like classifying you're, doing some regression from you're doing some sort of survival analysis time to fail your prediction: we need to be able to extend in the case where we're predicting a sequence of things into the future.

B

That's kind of the framework that we want to go into this and we do actually need to truly be able to handle variable length, sequences, no kind of fixed moving windows over time. These kind of do come up a lot and they can be very efficient but kind of for the purposes of this lecture.

B

We want to not work on these kind of fixed windowed models as we go through so as I'm preparing a special a, but what do I have to think about for a non temporal sequence, so I don't actually have an explicit time component, I just kind of have an order of things that happen words in a sentence for example, or like a fixed frame rate, constructing the input that goes into a sequential model.

B

History, I literally just take my my sequence of vectors, X, I and I can feed those into whatever decline a model I want. So there's nothing really to do there now throw in a temporal component work with time series. It is definitely definitely not trivial. So, as I said, if we're dealing with kind of a fixed sampling rate- and we know this because of that's- that's our setup- either we're dealing with something like audio or we actually do have like a fixed metronome, ik cadence with which were sampling from our sensors, we're fine.

B

We can actually ignore time for the most part, but most time series models and most time series domains. Excuse me: they don't have this kind of fixed sampling nature, so we have heterogeneous time differentials between samples and that can be a real real problem for deep learning models. I'll caveat this right now with I. Don't think, we've actually solved this as a field properly, yet I think this they're they're still kind of a lot of research in this direction.

B

If this is something of interest to you, I think there's actually like a lot of very high impact papers to be written down this domain. Some people have thought about this by incorporating kind of a time gate inside an RNN later, but this is a really really greenfield problem. People have hacks I'll, show you three hacks.

B

Basically, that get these heterogeneous time differential sequences to work with people I think, but at the end of the day, we're actually not there yet as a field and machine learning has had a hard time dealing with this for a long time, they're kind of traditional spline based methods that can work when you're dealing with very simple models, but for very expressive models. We're just we're just not quite there yet. So let's draw a picture: yes,.

B

Yes, the classic one is time series from the stock market. You maybe have very regular samples for for the day on the weekend, you don't have samples on national holidays. You don't have samples so constructing some sort of model. You need to interpolate between that. Another. Pretty classic thing that will happen is there's just a delay in your system. That's stochastic, you're sampling from so you're reading out from some sensor, it will come in at different times.

B

The other thing that often times you'll see in I'm, not going to say it's a scientific application which is to kind of get model when you're dealing with kind of machine arrays, where you're reading sensors kind of from an aggregate of over a lot of different machines or sensory inputs, and you actually reduce those into a single number.

B

This machine's can all be kind of out of sale, so the the sampling rate that you're getting from each of them, even if they're constant you get kind of this staggered weird pattern, so you need to be able to handle that kind of non-uniform nature, they're kind of a lot of other ones. That tend to come up when you're dealing with.

B

What's the word.

B

When you're dealing with sampling from actually this is a really really easy example, when you're dealing with audio you'll oftentimes, also in a very clean case, you'll be able to just have this kind of constant sampling rate in a lot of real-world audio, you get cutouts you get kind of you're taking. You have to stitch together, a sequence from potentially different sampling rate, so the audio was collected. Here's one at six Hertz use elastic scheme that can be really really problematic when you're trying to actually learn about.

B

So you need to be able to handle these different frequencies.

B

So, let's draw a picture for this quickly here: I have some sequence: that's coming in my X's, and these are all kind of happening at different times, staggered or not uniform. How can we deal with that? So there are basically three options. If you want to use deep learning, there are a few others, but for intents and purposes there are three. The first one is you it just ignore. The temporal component I tend to do this a lot. Most people tend to do this a lot.

B

The second case is, you can resample, so you can resample your time series to make it uniform or you can use the time Delta as a feature. So these are three very simple ways that you can handle this. So what do these look like pictorially? So if we ignore time it is and exactly what I said just strip out the time set and stick all the vectors together, assuming they're coming from a uniform.

B

So what can go wrong if we do this well, what could go right first. Is that you ignore a lot of the complexity behind the temporal component and you can feed these directly into whatever models you're using. So that's great, but the big downside is. You can actually lose a pretty critical feature of your data, and that is the time difference between current time step and the previous time step. That can actually be really really indicative of, for example, a machine failure, for example, some future time to event.

B

In your sequence, the time differential can be super. Super super important I will say that it's generally not a bad idea to try as a baseline. It is imposing the inductive bias that time differential is not important to your problem, but it's a reasonable baseline to try and if you perform well, you've saved yourself a lot of data processing work. So that's never a bad thing.

B

So another option is we can resample, so I can kind of say: hey I want a sample here. I'm sampling at 24 second intervals and I want to construct a sequence that sampled at individual 24 second intervals. This will be perfect. I now have kind of a uniform sampling rate sequence to be able to feed into my model. All is golden right, not quite what do I do with X 3 and X 4.

B

If I'm I need to have some sort of reduction to bring these into the time window, so people type people oftentimes do some sort of mean averaging they'll. Take the latest one, so they'll just take X 4 and call that X 3 people have a lot of different packs in order to get the sampling rate to work out correctly.

B

So you now, instead of kind of dealing with the temporal component, you've trade, it off and you're now dealing with yet another inductive bias that you're gonna have to put into your problem, which is how do I reduce multiple time steps within a window?

B

What happens in the gap that is going to lead to X, for how do we get a sample in for X, for when we have these and uniform time windows that are going into the sequence once again many ways of doing this oftentimes when you are dealing with continuous inputs, when my exes are continuous, we'll do some sort of interpolation to get kind of a value to impute into X 4. So, let's um sort of cubic interpolation between all the features and they'll impute that into export.

B

um You will oftentimes use kind of a backwards, filling approach where you say: hey, I, don't have anything in my sampling window that would produce my export. Let me just take the previous X 3 and just copy that over you can argue whether or not that's a good idea or not. You'll oftentimes see this particularly done for discrete inputs. So if I don't have the ability to do some sort of interpolation, because I have a discrete level set or something they'll, just kind of copy, the previous level set value over to the next time set.

B

So what does this give us? This gives us the ability to ingest data and the format that models are expecting just kind of boom boom boom boom. Discrete uniform time sets. But, oh man, we are trading a lot of that for data pipeline and data pre-processing complexity. You now also have an additional set of hyper parameters or kind of data processing parameters. Let's say to tune. How do I want to aggregate multiple samples within a window?

B

Do I, take the mean, do I, take do I, take the mean and then sample from some sort of normal distribution around that to absence of gas. This ax t do I, take them max. How do I do that? You also then have to choose the sampling frequency and my little example here. I just kind of drew some triangles that looked approximately equal, but for a real application, you're gonna, say: okay, I have all these things coming in at random intervals?

B

um How do you, how do I choose that this should be a 24 second time window I chose that because it's easy- and it works with my example, but for real data that can actually be quite quite hard and can severely impact the performance of your model and the final one is.

B

How do I do that kind of forward filling procedure that I was describing before when I don't have any samples in a time window that I need to have an x value for what do I choose to put in there by filling from the previous time step, you can have really really bad biases. Even maybe I have a really really rare event occur in my previous times. That I probably don't want that rare event to occur in my next time step, even though that's what will happen.

B

So these are all kind of things to consider, and yet another dimension that you're gonna need to kind of tweak you're a data pre-processing pipeline, which I'm sure none of us want to do more of so something to consider. But it can add a significant amount of complexity to how you handle your data before passing a model.

B

So one of the more elegant approaches to this problem is using this time Delta as a feature, so there are lots of ways to do this I'm going to go into the most basic one. So, instead of trying to resample or doing anything like this, we say: okay,.

C

B

Time Delta for x1 is 0. Now we look at the time difference between my current time step and the previous time step. What are the timestamp differentials and include that as a numeric feature, you can argue whether or not including an America feature for a time differential is a good or a bad thing. Let's assume it's a good thing for now, but this actually gives us a sense now of how long it was since our last event, our last sample from the sequence and.

C

B

B

What give us this gives us once again the ability to feed directly into a people any model that is expecting sample sample sample sample. The one thing that it definitely does do is it over index is hard on time differential as a feature. Maybe this is not relevant to your problem. You just did a bunch of extra work that was kind of why I was suggesting using a baseline with ignoring time to start I.

B

Think the other thing that can kind of happen here is this assumes that I'm only really affected by the time differential for my last step should I care about the time difference between my current sample and two time periods ago, three time periods ago, okay, time periods before um that might be very important, I, don't know this will be yet another thing you will have to tune your kind of look back window into figuring out the time differential.

B

This can actually be super super important like if you're dealing with anything that looks roughly like a stock market being able to look back multiple time periods is super super important, not an ideal solution, but it gets us the missing amount of ending information that we have in just kind of completely ignoring the time component, so very quick sidebar into representing features in sequences.

B

So we've kind of I've been drawing big blue rectangles as vectors going in, but there are a few important nuances for dealing with features and sequential models that are just worth calling up quickly as a quick, sidebar understanding your variables that are going in if you're, constructing them or and and normalizing them well are of utmost of the utmost importance or dealing with sequence models.

B

We'll talk about this a little bit later with our n ends, but there are some really really really pathological compounding effects that can happen when you deal with sequence models so having one number, that's totally outside of the range can really throw a wrench in the machine. Pardon the pun, normalized features to comparable ranges. You don't need to like normal normalize.

B

Everything to zero as your amine unit, variance, not necessary, but just kind of make sure everything is comparable like if you have a bunch of things between like zero and one good one, one zero three: zero: five, something like that: fine, don't throw something that can take a value of like 3000 in there too. That's really gonna mess up kind of the steep tracking of most of these sequence models.

B

A second point which is relevant for sequence models, but also just for, like general, fixed input, fixed output models, just general supervised learning um use embeddings for discrete data using embeddings gives you a lot of pretty key advantages.

B

Essentially what you using an embedding will do, as you say, all the levels of my discrete feature now, map to a vector in a lookup table and I'm saying for every vector in this lookup table, train this vector as I'm turning my model such that this vector kind of represents where in semantic space, this particular discrete level of my feature is so and then the classic example of this comes from NLP. Naturally, I'm processing with word matters where you, by training a model with words indexed to individual vectors.

B

You get some really really nice semantic properties in the vector space, particularly able to kind of do basic arithmetic on word vectors. This is a table that kind of floated around on Twitter and medium and it's kind of risen and fallen in popularity, as people have discovered and rediscovered word vectors. But you can do interesting. Things like take Paris, subtract, France and add Italy and you'll get Rome, so these are kind of semantic vector spaces, and the idea is, if you take your discrete feature and you embed it in a vector space.

B

While you train the vector space, should emit kind of an interpretable kind of set of arithmetic. That will happen even if it's not interpretive. All the amount of information you can encode in a vector of some dimension is like definitionally higher than that of just using a one, hot encoding, so just kind of using a dummy variable on/off, you can get some pretty big gains and models that are using dummy, yes/no flags for a discrete feature, but I just using embeddings.

B

Another cute image of what embeddings can do just to give a quick motivation is that they learn these kind of superlative relations, which is quite awesome. I, don't you can see the text from from the back, but you get interesting kind of similar relationships and vector space between slowest, slower and slow and shortest shorter and short. They kind of occupy a similar arc in vector space. So it's a cute kind of anecdote for why there's a lot of information to be stored in vectors, and you should kind of use those when dealing with discrete features.

B

So, let's take a quick detour now and go into part two of the lecture which is archetypes of sequence models. Thank you!

B

So how do we organize the world of sequin? Starting? That's far, we've only talked about how do we represent the inputs? What are some unique characteristics about the inputs of sequence models? How it should be reason about them? How can we deal with time all these sort of kind of nitty-gritty, like data level questions?

B

So there are at least in kind of my super reductivist view of the world. For the purposes of this lecture, there are four archetypes of sequence models I stole and modified this image from Andre Karpov II, as did not have the one too many I added the one too many, because I was missing so for kind of rough prototypes for how to build a model. The first I like to call predictive so we're taking sequence of inputs and predicting a fixed output.

B

The second I call abstracted, because essentially, what we're doing is we're taking a sequence as input, we're learning something about that sequence and then we're generating another sequence: we're transducing another sequence from the original sequence. So this became super popular and machine translation where you're taking a sentence in a source, language, English and outputting. The translation in French, so you're actually transducing. These sentence into a new domain, the second, so that's kind of many to many. The second many to many archetype with in sequence, learning is what I call labeling.

B

So this is saying for each time step for each element. In my sequence, let me predict something this can either be predict what the next element in the sequence should be, or this could be predict the pressure. This could be predict whether or not this is a signal or background sample from my sequence, but just in general for every element in the sequence, we're predicting a thing and the final a little bit more obscure, maybe archetype of sequence.

B

Learning is for a lack of my creativity, captioning and the reason it's casually is because it's usually used for image. Captioning kind of the idea here is I. Take a fixed dimensional input and I am able to decode a sequence off of that fixed dimensional input in images. This is taking a images, input and producing a description of what's going on in the image or answering a question about. What's going on in the image, I haven't seen examples of this in scientific domains.

B

Yet I think it would be really cool if one of you could come up with I tried to Rack my brain for the past few days to think of an interesting one, but I pretty narrow, I've only kind of worked in very specific area at that. So I think it would be really cool if we came up with something that was fixed input variable length decoding today, most of kind of the insights I think were gonna. Get are gonna come from the first week.

B

They they come up a lot more frequently and they're also just a little easier to reason about and find kind of, analogs and problem domains before I jump in Arthur. This is I want this to be useful. Are there any questions thus far? That would help clarify things before we move forward yeah.

B

So if just to repeat the question for those who didn't hear um when dealing with gaps in sequence data, should we have some sort of model that can imbue the ability for a machine learning model to fill in the gaps as we're training um I think there are kind of two schools of thought here.

B

School one is all imputation, is bad. I tend to not agree with that, but some people are really adamant that don't impute, because the missingness of data is useful.

B

The second tool is that yeah impute everything you possibly get if we have missing gap with gaps in an image in paint the image, if we're missing words and the sequence fill in the words I would say it's something to test.

B

If you do have some model that can fill in the gaps, fill them and then kind of validate yes or no, whether or not it actually affects your downstream past performance um without knowing more details, it's probably the most useful thing, I could say, but I think it's one of those things to just evaluate from your downstream metric of kind of how you how you quantify the system performance.

B

That's a good question.

B

Yes, for those who didn't hear the question is: can we do kind of a captioning approach where you have one thing in many things out, but have it be such that the for everything that goes in in a sequence you get like a mini sequence, branched off of the sequence, I'm gonna say: yes, I think it would probably fit most cleanly into the many-to-many case or you're just stacking like a captioning component in for each one of these blue dots, a blue blue rectangles I.

B

Actually, don't see why that wouldn't work I haven't seen any applications of that. Since most of these things are developed in NLP, you usually don't see some sequences, but that would be a really cool architecture to try I, don't think, wouldn't work yeah, so I would say, promote if you're not dealing with language.

B

You would learn the embeddings as part of your training procedure I'm. The idea is you're, embedding tool aligned such that they're maximizing the dimensions of information that are most important to your problem. um If you're dealing with text, not sure if you will be, you can use pre-training betting's, but kind of the idea of embeddings extends well beyond kind of dealing with pre-trained. Where to the fact that comes from google, you can use embeddings and they will align with in training to be maximally.

A

Cool, hopefully,.

B

That clarified some mid lecture, confusion or just kind of clarification cool. So let's jump back in and let's go into a little bit more detail about each of these archetypes.

B

So with many-to-one I would say this is the archetype that you will see the most. There are tons of applications of this and kept be like taking a sequence in and predicting something out about. The sequence is pretty usually pretty low hanging fruit, taking a sequence saying signal background, taking an input. Video for example, saying stating whether or not the video is I, don't know a or B, but the really really key had a thing to get out of the many-to-one archetype. There's a lot of the things you learn about fixed input, fixed output models.

B

You can use to reason about these as well. You're, probably going to be using the same loss functions. I doubt you'll be changing things there and when you start dealing with sequential outputs your losses starting getting a little bit different little bit wonky, but for variable in fixed out a lot of the mental models that you develop for kind of standard, fixed and fixed out. Learning tend to work pretty well, so you can kind of swap on a classification layer, regression layer.

B

You can do some sort of survival analysis thing tack it on top and those should all kind of work out of the box for many-to-one. One key thing about all these archetypes, and this one in particular, is the flexibility, so I kind of hinted at it just kind of rambling there a second ago, but the inputs, the kind of the red boxes that go into this archetype. They can be really really complex data. Each one of these can be an image. There's nothing!

B

That's stating that this has to be kind of like a boring vector of inputs. If you let's say you have like some readout from a telescope, that's some big, like 20, 2014, 5 2014 in it, and that's evolving over time that can be used here. You can kind of have the every input to the sequence, be an image. So there's nothing about this. That kind of restricts the modality of the red boxes, which makes all of these frameworks quite flexible.

B

So, let's talk about the second archetype quickly here, which is many to many you'll. Often times see this preferred to and kind of deep learning modernity. I'm gonna put that big scare quotes, as sequence the sequence learning, even when sometimes it's actually not sequence in sequence, I mean the general idea here, as was alluded to before, is that you're taking some length, K sequences input and producing a length, l sequence as output or K and L can vary, and they can be completely independent.

B

That can be quite a hard problem. So in the deep learning archetype for this you end up breaking down your world into an encoder and a decoder, the job of the encoder and a many-to-many model that follows the sequence, a sequence paradigm, but the job of the encoder is to take the input sequence and summarize it I said: summarization was bad at the beginning.

B

Here's where some of those inductive bias is for a reasoning early on come back to the chickens, come back to roost for lack of a better word when I summarize a variable length, sequence, I'm, inherently losing information, so in sequence, the sequence learning it's super important that I have model architectures that are flexible enough, that they can learn to summarize what's important before kind of passing them on to the decoder and that the coders job is to kind of take the summary information from the encoder and then decode the sequence in the target domain.

B

Whether this might be new weather samples for the next seven time steps. This could be taking a source sentence and returning a summary of the sentence. But I need to have this kind of summarize that this, like this summary in the middle, this vector this fixed length, vector B, maximally informative for my decoder to work well.

B

So a key key question here is: if you put on your hat thinking about kind of traditional ml, we're normally just predicting out a single fixed thing. How do I predict a thing that is a variable length that can be quite tricky. How do I know like I should stop predicting at time step. 7. Is that right? Is that wrong? Who knows? How do we reason about that?

B

So this process in NLP is usually called decoding and I think kind of more generally, it can be viewed under the lens of decoding and essentially what we learn to do in our modeling stuff.

B

As we're learning to sequence, a sequence funnel is learn to predict when to stop decoding, which is quite powerful, so we're actually saying hey after time, step 7 should I keep going or should I stop now and when you train this you're actually you're actually able to feed in kind of the truth label of that, you know when your target sequence stops for your training data, but when you then have test data and you're feeding in a test input example when you decode you kind of go step by step and ask whether or not actually continue or should I stop.

B

So this gives us kind of a differentiable approximation to this stop, and this is commonly referred to as a stop token in NLP, but I think the the application is much more broad than just NLP being able to predict when to stop decoding a sequence is super critical. If we have kind of this different length between your input and your output, most of the problems that this has worked on haven't been that long kind of as you would imagine, but I don't actually imagine this being any.

B

If the signal is there for the stop to be predicted, it should be able to learn that this. There are other kind of engineering concerns that come up with very very long sequences they're just really hard to fit on a GPU. But if the signal is there for you to stop, if there is like signaling the noise to learn when to stop I, don't see why I wouldn't kind of be able to perform that um once again, I would test it, but there's nothing kind of theoretically or from a prior.

B

That would tell me that that wouldn't work yeah so in the in the I'll talk about LLP and then non NLP and non in NLP. You just kind of have an additional word in your vocabulary, which is stop in non NLP. You end up having two losses: you'll have your loss, that's for your problem that you actually care about, which is predicting the next element of the sequence. So this could be predicting out the sensor read us: did we predicting temperature pressure or whatever, so that could be at some regression loss?

B

Then you have an additional loss, which is like a classification loss which will predict for the next took and whether or not X or not, for the next element of the sequence, whether or not exile. So you end up kind of having this two losses. Of course, then, there's a whole other set of problems which arise and how to balance these two losses, but in principle there's kind of nothing that stops you from combining a classification loss that tells you when to stop from a domain loss. That tells you what your next sequence should be.

B

We you next element in the sequence: you'll oftentimes see people use cost-sensitive learning in relation to stop tokens where they equate the very very imbalanced class a little bit to make it such that it's not completely squashed out by the super super long sequences, I think that's something that just gets tuned, but it's not it's not insurmountable.

B

So the second archetype within many to many is what I like to call sequence, labeling, so sequence, labeling is super super flexible. Actually, um there are kind of two things that people generally do with sequence: labeling the either are predicting the next time step. What is my value of my sequence going to be and time T, plus one or they're, predicting some observables that they care about from the system at time T?

B

So these are kind of the two approaches that people tend to take you'll oftentimes see for problems where we care about forecasting just for the next time step. People will set up their problem like this. So if I'm doing a stock prediction problem, let's say I, you know, stock prediction is pretty trite, but it's kind of a classic. um You would feed in the stock prices at every time step and just be predicting out the stock price at the next time step.

B

This will oftentimes come up when you are dealing with and a sequential forecasting where you just kind of care about the next, the case where you're predicting something about the current time step T. That also tends to arise pretty frequently in NLP. This is an a predicting part of speech or word, but more broadly, this can kind of be whether or not an element of my sequence of machine readouts is a failure. Mode or not.

B

I can be predicting whether or not the machine is in an arid state, at kind of any point along my time series which is reading out kind of something from a sensor greater. um So this is a pretty flexible framework for being able to reason about every state, so.

B

The one-to-many archetype- this is the one that I said: I, don't have good examples from kind of something: that's not NLP, so this in in the world of NLP. This is the case where we will take either an image or a fixed piece of data and decode a description of the data in kind of natural language. So the idea here is you, take fix input and at every time stuff, as you're decoding you're, looking back at the image or you're.

B

Looking back at the fixed amount of data and you're saying what should I predict next until I then predict you to stop decoding my sequence. I'm example that maybe doesn't fall into NLP is given a given. The frame. 1 can I predict the next 4 24 frames of video. You could do something like this to kind of generate little videos from a static image. I could imagine that potentially being useful, but, like I said, the canonical problem really is image captioning.

B

So now that we have this kind of zoo of sequence, labeling and sequence learning modules, all these different archetypes? How do we actually go about building them?

B

There are some individual ristic of sequence, learning models that we need to have kind of reflected and our deep learning building blocks temporal invariance. We need to be able to control kind of the explosion and reduction of gradients. There are kind of a lot of edge cases that come up in the sequence learning, so, let's kind of take a look through some of the basic building blocks here.

B

So this is a very, very active area of research, I'm kind of almost every day, some new big company or big research group has a new variant on an existing kind of approached for modeling sequences. We're gonna cover the main three, the first two, a little bit more in depth in the third we're going to cover rnns recurrent neural networks, convolutional neural nets and transformers. um There are a ton of other variants which you will see come up. There are quasi RI towns.

B

There are like a attentive convolutions they're, like gated convolutions, there's a whole kind of zoo of modifications of these three building blocks that have come up, but my hope is that by just kind of reasoning through the these three basic ones, you'll be able to get a sense of what the field looks like from high level and, like I said at the very beginning, be able to Google and reason about things that you read in Ivor. So, let's start with Ardennes, so the core of an RNN is.

B

We want to have a neural network unit that can learn kind of a sequence based dependency over time. And what does this look like? So, if I have some input, that's coming in X kind of over time, sequentially I want to be able to kind of loop back through modify a state and produce an output. So this is a hard diagram to reason about. Why?

B

Don't we unfold this, where it's a little bit easier to reason about every time step, I have a vector X coming in I'm, basically taking a look and saying: do I want to modify my state and the idea is this internal state, as we kind of chunk through the sequence, we'll learn useful things about my task at hand predicting why, for example, so I have this internal state? I was like chunk through my sequence: I am learning what to store.

B

What not to start a key element here is at the transformation that is kind of going and taking X and producing this. The hidden state is invariant per time step. So I basically have one neural network unit that is just getting stamped out four times up, so that's pretty powerful. That gives us kind of. We don't have to engineer features that kind take into account the temporal component anymore. We have something that just gets applied time step by taxa.

B

So what are some issues that can arise with a model such as this, so we're gonna take a quick detour into one of the internals of how a vanilla RNN works to kind of try to shed some light on this. So this is the kind of most basic RNN you can draw. Let's walk through what the math is telling us here quickly from an intuitive level. So we're saying this is the state that I'm going to next, given my input, so how do I get to that state?

B

Well, I want to do something with the previous state. What did my model know about in the last time? Step and I also want to do something with the current time set. So I look at my previous state I'll, look at my vector X now, and that will help me decide where to go. Next sounds great right, No.

B

So let's take a look again at this vanilla, RNN and try to understand what happens when we take a gradient. Yes,.

B

So the dimension of s T can that's a that's a hyper parameter, so you actually will choose the dimension of your hidden state and in your model you would tune that.

B

So what happens? We have a try to take ingredient, so you don't need to memorize the proof. This is just to kind of get intuition over on why training, our entenza or heart, so we're gonna. Take the derivative of some loss function with respect to the parameters W rec. This is the thing that is taking the previous state, I'm moving it to the next state. So there's.

C

An interesting little.

B

Element in here, which is when we're taking the derivative of current state with respect to every previous state that has happened, kind of in my sequence, but every previously thought I predicted in my sequence, so is actually a product of jacobians. So we don't need to know how we get to this proof, but let's kind of step through kind of what the math will tell us.

B

So we have this and an infinite product, this very large product and I'm over my entire length of my sequence of multiplying all of these psi psi minus 1 0 o'clock. So we can decompose it and then we can use some tricks to try to bound the norm of every single element in this very, very long product about and because we know how to decompose this as a product which ability is.

B

We can then split apart w rec itself a diagonalize component, and we can bound this above by the two eigenvalues one of w one of diagonal component. Here. We don't need to worry how we get there. But the interesting thing here is since every single element- WSI sorry D- si over D si minus one is kind of a product of two numbers that are eigenvalues when we take this really really long product you're, taking a lot of things and multiplying them together.

B

When you take a lot of things and multiplying them together, you kind of have a point of criticality around one. So if you have small eigen values, you can rapidly go to zero and the gradient can rapidly go to zero or if you have very big eigenvalues, the gradient can explode. So you have something that's getting multiplied over and over and over again, and your point of criticality ends up around one.

B

You can explode or you can kind of shrink if it doesn't ways fall in your gradient calculation, so that can be super super problematic when you're dealing with very long sequences. So this is kind of an intuitive argument for if we take the norm of kind of an element of my of the gradient I can reason about what will happen when I have very, very, very long Z. Yes,.

B

That's true, but there's no way to kind of know that a priori so yeah I can have behavior that I. Maybe my network diverges and I have no idea. Why? Maybe it's because some of them are super small? Some of them are super big, and this product is really really kind of I.

B

I can't bound this I don't know, I, don't have any guarantees that this will be either reasonably sized, big or small. So if the conditions are right, the gradient can just go to zero or can get very rated.

B

So the question arises: how do we fix us, so this is referred to commonly as the vanishing or exploding grading problem intuitively. The solution is to decide what to forget what to remember so, essentially what ends up happening kind of if you want to personify an RNN is they have really really vivid memories of things that are totally irrelevant, which is kind of a funny mental model to have about an hour now, so we need to control.

B

How updates affect that little very, very large product or very, very small product that happens in the middle of the gradients of so the solution is list yet from hope, writer and Shui Huber, so I'm just going to talk about Alice Ian's in particular, because they have they were kind of the first to control this mechanism, but there have been a lot of other developments around here, use other variants of the state management of our intent. That absolve this problem intuitively.

B

What an LS TM does is it's allowing for a trainable decision function to be embedded inside the Dora and what this will do. Is you basically now have a trainable differential ability to forget and remember state, so at every time step your model is saying: do I want to remember what happening now? Do I want to forget it?

B

We want to write it to my memory or do I read from memory, so you now have kind of four operations that can help you and have systematically process your sequence in a way that won't let you kind of dramatically forget or over remember and irrelevant time steps in your sequence. Pictorially I stole this image from Michaela.

B

Specifically, they have kind of three key mechanisms that are added, no need to know the math that goes into this directly, just kind of useful to take a look at the picture and understand the individual components that are affecting this behavior. So there's a forget, eight, the forget gate is basically saying how much of the previous state should be retained in the new state. How much should I forget the input gate is deciding how much of the current times that actually matters for the problem.

B

I might process a time step and look at it and say hey. This is totally irrelevant for my task. Just forget it and the output gate is kind of controlling the the mixture of all of these gates in composing. The final state, so you've got three trainable components. Basically that are deciding what to keep and that's a really really powerful mechanism that helps us control the gradient, but also gives us a pretty interpretable way of reasoning about the like differentiable state of Merida.

B

So we've talked about processing sequences in a forward fashion.

B

Can we have a state to process a sequence in Reverse and this ends up being actually quite important in a lot of NLP applications and actually, even in high energy physics, it was figured out that the bi-directional are n ends which we'll talk about in a second here are very, very useful for performing law. Yes, not necessarily, there is kind of this notion of memory, so you can like take a state and write it into like you have kind of a matrix that gets passed in through an LSD M.

B

That is the memory, so you can take a state and like shove, it into memory and then recall it again in a future subset, so you can, for that would be the H cool. So if we kind of run, we can run an RN n forwards and we gonna run an RN and backwards, and this gives us an interesting ability to reason about what's happening in the sequence and a forward or backward direction, which can actually be super.

B

Super powerful you'll end up finding a lot of different information is stored in a sequence when processing it in Reverse counterintuitive, but it ends up happening a lot. There was a really interesting paper where they were actually able to get very. This is a while back. They were able to get reasonable performance in machine translation, processing sequences entirely in Reverse and the places where that model produced errors were actually very different than the model that was run forwards so just kind of an interesting exercise in the orthogonality of information.

B

That's learned, depending on how your sequence is presented to the model. So how does an RNN like this fit with our archetypes? So in the many-to-one case, we can kind of take this R and n this process and things boom-boom-boom, and we can take the last hidden state and that last hidden state. Ideally, if our model is well trained, will kind of be a trained summary of everything. That's happened in the model as far so. The idea here is we take this last hidden state.

B

This is a fixed dimension and this can then be fed into whatever loss I'm. Using for my problem, I can then kind of have whatever outputs I need to reach the dimensionality for my target, but I can kind of take the last state of an RNN and use that directly and how many to one problem oops in the mini from any case I, can kind of take the last hidden state of an RNN.

B

The very very last time step and I can use that as the initial state for my decoder, so I can take okay I process, my input sequence. What do I think about the full sequence now feed that as the initial state to my decoder. That's then making reason reasoning about how to produce my output C's in the.

C

Minute, in any case, we.

B

Kind of already have our work done for us, we're predicting a hidden state at every single time step as we proceed, so why don't we just tack a layer on top of the hidden state and be able to predict something about the current time, step T. So every single time step we in our state predict something about it. Get our state predict something about, so this kind of naturally falls out of using our Don's.

B

The interesting thing about the one to many cases, we're essentially taking let's say we're dealing with image captioning you can kind of take the full image, have some feature Iser for the image. Take a pre train ComNet, for example, and you can stick that as the initial hidden state in the decoder. So then you can generate the sequence from kind of an initial state which is governed and conditioned by the fictive, so our n ends for the summary level.

B

We are processing time steps one at a time and we're modifying the internal state to keep track of. What's useful, for the task at hand are n ends are governed by this time, invariant transformation that is applied to every element in the sequence kind of regardless of if it comes first or last, and you can also traverse sequences backwards, which can be very useful. Oftentimes you'll learn relatively orthogonal pieces of information to what you would learn: processing the sequence forward.

B

So talk about the second major building block of deep learning, career sequence, earning um we'll talk about convolutional neural nets, so usually calm, Nets, I thought about for image problems where we're dealing with kind of 2d patches of an input, input, image that are having filters applied to them. This can be extended, pretty trivially to deal with sequences. So this is a standard convolution operator.

B

Essentially what we can do is we can form image if you will, where we take out every vector and we kind of line them all up in a row, that's kind of a vector image, and then we can take a convolution in one dimension across that sequence. So pictorially and of what happens as we take on the diagram on the left.

B

Our filter size would be three we're kind of taking the information from X, 1, 3, X, 3 and summarizing this using a convolutional operator and then getting some output, that's out of the red dot and we kind of take that filter and apply it to every grouping of three in the sequence, which kind of gives us a standard convolution, which kind of summarizes the features.

B

Once again, that word summarizes summarizes the features that are, in my sequence, so just to kind of illustrate how we parameterize continents, they're, usually parametrized by the stride and the filter size. The start in the filter size are controllable and end up kind of imbuing, a lot of structural priors that you might have about your data on the Left I've shown tried one filter size three, but as I increased my stride, every single feat, every single kind of element that comes out of my convolution operator is has a wider like kind of look around window.

B

If you will every element that comes up knows more about its context than if I have a smaller positive and Genesis stride parameter tells us how much we want these filter windows to overlap so I can control. Both of these will affect my dimensionality and it'll, also kind of affect the degree to which context is carried and to my window. So this is kind of a pretty powerful thing to be able to control both of these, which can affect kind of the learnability of the problem.

B

So the simplified math behind and of what will happen in CNN filter is each of those arrows that I had in the previous slide or a vector, and essentially my output is just each of these vectors dotted with the input vector, X, I, so fairly simple, just at the individual and a time step level. But once you get bigger, you want to use kind of official convolution operators that will do this and they'll scale much better.

B

So interesting trends in cnn's for sequence learning um go back about four years. Everyone was was doing only RN ends. Then there's kind of the like CN n trend was mostly kind of started by there and a few others that kind of what you used to think you had to do with Arnaz. You can actually do with pets um which is kind of an interesting trend away from the complexity. I could controlling state and all of these things that are inherent to Arnett, um and this kind of raises a pretty big question about most sequence.

B

Modeling problems, which is: do we actually need to retain state? So we have all these mechanisms that are reading and writing to state reasoning about state. Are they actually necessary because commnets don't know about state, there's no kind of notion of reading writing forgetting as we process a sequence, there's just kind of a standard convolution operator, yet they perform well so I think it kind of raises an interesting question of if RN ends our reasoning about state. Is that state really necessary?

B

So an interesting thing about convolutions is that they don't know about order in particular, they don't know about. They don't know about position, so they are translation invariant, but that also means that they don't know kind of what came before if I apply a convolution operator to time, step 17, it doesn't know about time, step 16, so kind of a way that people have gotten around this and the deep learning world is by using what's called a position encoding and a position encoding.

B

There's a very simple idea: I, basically train and embedding as I'm, discussing earlier kind of an additional feature. That is an integer kind of 1 through n. Where n is my maximum sequence, length and I learn a vector for every position. So, instead of just having my features, X I have my features: X, plus a position vector P and that P vector is either trained or fixed, but that that position, vector kind of tells me where I am in the sequence. I can form a convolution operator where it is of being applied in this place.

B

So the you're embedding that right. So every single position has an embedding. So there's no dynamic range I'll come up every single vector every single position has a vector and that vector kind of gets added to my type set. So P isn't necessarily the number 1 through 50,000 P is the look the retrieved embedding associated with that position.

B

So I have a trainable embedding per position and the embeddings you can constrain to be between negative 1 1, whatever you want, but the the raw like position, ID isn't going in yeah, so padding padding is a whole other topic which we actually won't get into today. But I will just kind of say: padding is not trivial for comments. How to do it efficiently is also non-trivial.

B

Essentially, people end up kind of tacking on additional time, steps to make the convolutions work out and then deleting them and if, after the forward pass, which is not elegant, but it's hard to do these with GPUs without doing that, I have no idea, probably gru fewer parameters.

B

Oh yeah, that's kind of a there's, a highly architecture depended question. If I have a computer architecture that confused the operators that are required for a GRU more efficiently than an Alice TM GRU is gonna be mark, but that depends on the target architecture deploying on to full suite of other things. um First app would be GRU, I, don't actually know so very quickly. How does this fit with our archetypes?

B

So if I need to get a fixed length vector to be able to use this many-to-one archetype, how do I get a fixed length vector out of my convolution outputs? Unfortunately, I need to do a reduction. The idea is, since my kind of elements that are going into the reduction are trainable, the reduction should be more informative. This is kind of a hot topic in NLP, but a lot of people who come from a linguistics background hate.

B

This I think it's kind of an atrocity, but in combats, if you need to get a summary number like a fixed length vector, you do need to have a summary that gets applied on top of it. Whether it's a min/max mean some collection of these that doesn't kind of need to get applied.

C

B

The many-to-many case you end up using what are called dilated convolutions to be able to obtain kind of the sequential dependence on the fact that time, step T can't know about time, step T plus one. So this is a great gift from B mind if it'll play yeah of kind of what a dilated convolution looks like where you're only looking at kind of the previous time, steps when constructing your convolution for the next times have to get decoded.

B

So this is kind of what's used in many to many, with kind of a sequence: the sequence transduction style approach in the many to many case, where we're doing sequence, labeling CNN's, are a very, very natural, fit you're outputting a number per time step. You might need to do some padding, but you're outputting kind of a thing per step along kind of my input sequence.

B

So you can very easily tack on additional layers or additional losses to be able to learn kind of the time step, dependent features that you want to be able to predict and then for the one-to-many case, you're once again, gonna be using dilated convolutions to get that dependence, and probably some positional encoding, but I, don't expect I, don't think anyone's actually done calmness with one too many, at least as a academic paper.

B

So a summary of CNN's we're essentially using kind of filters without a notion of state. We can use something as positional encoding and we have this time, invariant transformation that gets applied for time, step of this kind of convolution operator, taking my filter, applying it to my backers for the many to one use case, which is unfortunate. We do need this ability to reduce the the sequence into kind of a fixed length, vector so we'll be mean min max summary.

B

You need to have a summary statistic. That's kind of getting you your fixed length, Becker at the end, so our Nan's cnns. These are kind of two of the central building blocks of kind of how people build the individual components inside these archetypes we've talked about the last thing. I want to talk about. Pretty briefly, is paying attention, so you've all probably heard if you followed deep learning at all about attention.

B

It's gotten a ton of hype and intuitively what attention is, is the ability for any people in a model whether it's an RN, ancien and what-have-you, to be able to focus on other parts of the input in order to reason about what to do next, it's very anthropomorphic I know, but the math is usually painfully simple, but that's kind of the story that you'll get told about what attention is actually doing it can, like I, said, can be use with our nuns, the use of CNN's.

B

They originally gained popular popularity and machine translation where you're kind of doing these soft alignments, where you're looking back through and saying ah in my source sentence, I'm saying the word dog in French I need to say she Anna do I need to focus there. So you have these kind of very explicit kind of attention, components that are looking at the sequence.

B

Self attention, which is kind of a very google thing. They've done a lot of interesting work around transformers which we'll talk about next is basically the idea that a sequence can attend to itself. So as time step, T I could look at time. Steps 1 through t, minus 1 and t plus 1 through K or an or whatever and I can use that to form context about what I should be thinking at times at key the idea behind self attention is you don't need any recurrence? You don't need any convolutions.

B

You can just kind of look elsewhere in the sequence and make it like a a judgement about what your time step I would ship up, which should be if you're interested in kind of this approach. This is the classic paper has like it's only two years old now and has like three thousand citations so do take. A look at attention is all you need, because that was kind of the first kind of real large-scale application of self attention so quickly.

B

In the last few minutes here or open up to, questions want to talk about transformers transformer is now in 2019, are kind of the classic, fully attentive model, so no recurrence no convolutions there. Every Sep is just looking at all the surrounding time.

B

Steps and making a judgment about itself and the key idea is you kind of let sequences do key value lookups with themselves so as the vector of time step, T I do a key value lookup across all the other time, steps in the sequence, maybe I, take a dot product to get a score, something like that and I kind of construct. A summary once again for myself, given my relationship with all of the other elements of the sequence.

B

All of these models have what's called a read: hen M, which is an interesting kind of piece of nomenclature. That's gotten adopted now. Okay, apparently Siri wants to jump in.

B

The read heads are an interesting kind of construct where you for a model to be able to reason about itself. It's kind of a discrete thing. A individual vector can only take dot products with other vectors and do kind of one summation with itself. So you need to have multiple of these to be able to have multi attention. It's kind of a jargon filled word, but kind of the key idea is it allows your individual elements of your sequence to be able to focus on different things at the same time.

B

So this is the diagram of the transformer from the paper. It's a mess. It's actually a pretty simple idea, but the diagram makes it look pretty complicated just to break it down super quickly on the left. You've got this encoder. All the encoder is doing. Is it's taking its input sequence, looking at all the other sequence elements in the sequence and then outputting a new sequence, where every time stuff has just looked at all of its neighbors and the decoder is basically doing the same thing.

B

Where not only is it looking at itself, but it's also looking at the encoder.

B

It's not super complicated, but this picture like gets people all the time, the overly simplified version, if you want to kind of draw a really really convoluted arrow diagram is as follows: where you kind of have every vector X, looking at every other vector and kind of constructing a vector, H of kind of a summary statistics of all the other vectors that are in the sequence, so lots of kind of complicated operators and tensorflow I towards whatever you use but get transformers to work.

B

So we're not gonna go into all the math here, but just kind of know. The general idea of a transformer there's no recurrence no convolutions just kind of bra attention.

B

Why are transformers useful um they model long-term dependencies explicitly so a model is explicitly able to look back and say: hey this time, step like 300 times steps ago was important for my problem. It is near state-of-the-art for pretty much every transduction problem that exists today, probably not very useful for many to one style problems unless you just got like a boatload of compute that you want to throw at something kind of.

B

It was designed for a sequence transmission at the end of the day, but it's a super super powerful model so how to pick building blocks? People will usually ask about like what's the best thing to choose for my problem. I have no idea and I, don't think anyone who tells you they know usually has a hidden agenda. um The secret is to just test them.

B

If you really do want to know what's best, there is no free lunch at the end of the day, and certain problem characteristics resonate well with particular model architectures if you've got really complicated state that you need to reason about. An RNN might be good if you just need to kind of pick an element in the past of your sequence and like use that to predict maybe a an attention based transformer is kind of the right way to go, but there really is no free lunch.

B

It's impossible to know what is best so concluding remarks before we open up the questions. Sequence learning is super flexible, I, hope I gave you a pretty high level overview of what all the building blocks are. That can be fit together. What the different archetypes are. There are four main flavors many-to-one the many-to-many, where we're doing a sequence sequence, the many-to-many, where we're labeling every element in the sequence and kind of this one-to-many unicorn that I've talked about that I'm looking for an interesting application.

B

There are a lot of building blocks to choose from with more coming out every day on archive the base, fundamental ones that you might want to know about. Our are n ends, CN, NS and transformers, and the really important thing to remember when you're building a sequence model that there's no kind of best approach, should I encode time as a delta I don't know try, it should I use.

A

An RNN or CN n, no.

B

Idea, try them both. There really is no free lunch. I can't emphasize that enough. A lot of kind of blog posts that accompany papers very grandiose claims that, like this new thing, is better than all of the prior art that came before it's usually not true. They usually pick pretty poor baselines, so do kind of try out everything for yourself and yeah remember. There is no free lunch for sequence learning, so thank you.