Sci Cloj Sci-fu, 7 Dec 2020

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Time series in overview - Sci-fu community meeting 2.1

Description

A presentation and discussion about Time series and R

A

And so, if we want to do python moving toward uh closure, then I could go first.

B

Perfect, I'm fine with either one I'm uh it would take. um You'd have to wait a couple minutes for me because I'm setting up a new project to demo this stuff. You know how that is. So if you want to go ahead, tim.

A

All right, are you seeing my screen.

C

A

um I uh I won't stop and talk about the technologies as we go through, but I will we can talk about them later and I will uh try to call out when I use this particular technology. So this is windows 10 I use conda to manage environments which is conda, is a it's more environment management than than maven. So it does do dependency resolution, but it also actually sets up a shim environment, and so I use it for not only python but also java.

A

You can install java packages and it'll it'll set java home and all that stuff so and then.

A

We talked about jupiter, but here's jupiter um I use jupiter lab, which has um a few bells and whistles added on aws um a lot of the the built-in aws stuff that they're they're selling as a machine learning uh suite is just jupiter lab uh with some some stuff built in, but it's mostly jupiter lab and so jupiter lab is jupiter notebooks with some little file management and also you can start other kernels, which is the runtime environment for jupiter, including you can start a bash shell and do some other online editing which doesn't matter as much on your machine, but when you're using a remote infrastructure that comes in really handy to be able to install packages and that kind of stuff I'm going to talk about the kind of classical idea of stationarity in time series data.

A

So we were talking about what we mean by time series and there is a uh precise statistical meaning of all of these terms, so um yeah I'll try to to cover some of those stationarity is a property with with the time series data so that such that the mean vary and variance remain constant over time.

A

The reason that's important is because a lot of the classic assumptions that you can make when you're trying to predict on time series are based on um the property of stationarity. So a lot of transformations are geared toward getting time series to be stationary, so that then you can make further assumptions and from a subject matter perspective. We see that a lot in these data science areas where a lot of work of the data transformations early on, is to try to get it to to see to what extent the data has these certain statistical properties.

A

So in in a lot of things in the not time series we want to see if it's gaussian right, if it's a normal distribution, so you'll see a lot of the statistical tests assume a normal distribution, because that's where everybody has done the work. So that's where people got their phds because they proved certain things and they had a set of assumptions. And one of those is the data is basically distributed like this.

A

So stationary is another one of those technical terms that has some precise meanings, but let's see if we can get an intuitive understanding of that first, so in so this is jupiter notebook. It has a python 3 kernel running with all these libraries all already installed and you can just hit shift tab and it will execute that.

A

Cell and then highlight the next cell, so you can see which cell has run and kind of what order they ran in this uh uses, a data, that's in a csv and it's really simple it's just a month and then the number of air passengers for some location built in pandas is the library and we'll look at a little more at that and that's you know, kind of what we're we're modeling uh tablecloth on it is modeled on our data sets.

A

um So a lot of the r concepts are also reflected in pandas, but there's also more some more stuff in pandas and I'll talk about that in a little bit.

A

So this is the data it's got this month field and this number of passengers it has recognized on reading in the csv. That month is something it doesn't quite understand, and passengers is an int. So we can actually tell the the read csv to do more work.

A

So this is the lambda notation for python very unpythonic looking. So this is how you get just a function. This is a function that takes in something it's going to be called dates.

A

It's going to parse a string with yyyy mmm and turn that into a date object a datetime object, and so this is the function, it's called dateparse and then we can add some new parameters to read csv, to say, as you read this in take this month, column, that's based on the header of the csv and parse dates on it, but it's not in the default date format.

A

We want you to use the date. Parse function that we just defined as the date parser function, and we want you to do something called in make an index.

A

On the column called month uh which actually before I so I haven't run that yet so we still have the old data and I'm actually going to show what the index is so the index on this data, which is shown in this left hand, column. So it's not part of the data, it's something that the data frame added to the the csv representation and currently it's just a range of integers that start with 0 and go to 144 instead of 1..

A

So that's the default indexing that it did so now we're to tell it to do something different. So I'm going to rerun this data frame with these changes.

A

And so now we have a data frame that still has number of passengers isn't a column, but that's the only column. Now it knows that this month is now something called the index and the index is this other object that starts at 1949 date. So when it read in this month that before we saw the text was just 194901, it made it january 1st, I'm sorry it made it the first day of that month and made it a daytime time object and then yeah, here's datetime64 object and there's 144 of them.

A

So that's what the index looks like now and then let's do type of data which is from that read csv. So this is a pandas chord data frame, but in order to proceed, we're actually going to only take the passengers column which doesn't look much different.

A

We don't have a column name anymore, so that the name is something that is a part of the data frame, and now we have something that is just a series. Why? Because that's what all of the functions below expect.

A

So um we see that a lot in r also that the the reason we're doing a certain transformation is because that's what's expected by some of the functions we're going to use and these functions all expect an indexed series rather than a data frame itself. So we'd have to keep doing those transformations as we pass the data or now we just have this ts series and we can also look uh ts.index right. So it's it's it's a a series. Object which really just is is a painted.

A

Data frame is a collection of series that we call columns or that and the data frame calls them columns also, but it has an index and it's just this series.

A

So now we can see. What's in there, um I imported matplotlib and a lot of times. If you go through tutorials you'll see matplotlib.plot as plt.

A

This is pylab and it um is just some wrapper on matplotlib.plot that lets you um import that lets. You show uh plots inline, so you may see in other tutorials something that'll be like.

A

I don't remember what it is in line or something in in python notebook speak. This is called a magic, so it's got its own little mini language, where you can tell it to turn on certain features, and so you can tell you can tell jupiter notebooks to show matplotlib plots inline or if, uh if you import it with this wrapper around it, then that's what it does.

A

So this is the data we're finally getting to see that you can notice that it is something we could call a time series as time grows it grows, and we can also see that this is not. If we go back to our definition.

A

The mean and variance of this probably do not remain constant over time, but we'll add those things to our plot, so we'll be able to see exactly what we mean by that. So we can look at just one month. So this is the indexing syntax for a time series, because this is the index.

A

If, if, if this date, column was a column rather than the index, then the access to it would look completely different. So um that's something else that came from r that um I personally have to always look up when I'm trying to slice out portions of payday's data frames, either just to get certain columns or just to get certain rows, and that's something that I don't think that we need to replicate in closure. Closure has built-in syntax for accessing pieces of larger collections that hopefully we can leverage.

A

So, let's look at let's, let's look just at one month, so we'll zoom in on one of these.

A

Matplotlib, has you know automatically, you know, I just gave it this series of numbers. In fact, let me remind you this.

D

I hate uh tim, quick question now. Would you say that a characteristic of time series data is that the date column or the time column would is the index for that is the is the key or the maybe not the key, but the index.

A

um I think that a lot of the functions in these time series libraries expect there to be a time series as the index yes and that like so, for example, that is exactly why uh we put that index on this series, because the functions we're going to use expect that.

D

Yeah: okay, because I'm looking at this uh chart and the x-axis and the y-axis.

D

And I'm wondering what makes this different from any other chart uh which has numbers on some numbers on the x-axis and some numbers on the y-axis and.

D

It I can't see a difference, except that this is a. This is a specific type where the x is always where you know the that is always going to be some time, rather than some random number.

A

Yeah and we gave it some um yes, so it has built in it recognizes those things built in. So if, like ts lock only looks like I mean the the actual values are only 204 188 235, they have no meaning by themselves, but we've assigned meaning by giving it this index and telling it knows, because we told it that this index is a date. Time object. So yeah there's a lot of assumptions built in, especially in this.

A

This plot function, where, yes, if you give it something with an index, that is a time it knows exactly what to do with that, and that's probably a hint that yes, a lot of things will expect that um we'll expect that.

B

Yeah also, I think, right now, right now, you're, it's just a time on the x is the only difference, but I think another big difference in the time series is the aggregation rules, so the aggregation rules you're only allowed to aggregate based off of the rows proceeding which we haven't quite gotten to yet. I think.

A

Yes, we will get to that just right right under this yeah good yeah, there's some built-in stuff. That also relies on that assumption of timing timeiness but yeah if we zoom in on just one month.

A

So just one of those wiggles there we can see that there is some seasonality, so that'll be a term. That's a you know, specific term. That and again are there other types of data that do some kind of recurrings that have some kind of recurring cycle.

A

Sure, and if you're doing some kind of signal analysis, then you probably don't use the term seasonality, but the fact that we call this seasonality. We call this trend of seeing this shape over and over seasonality, which has a time connotation right, so yeah. So yet time is even built into what we call this type of cyclic uh occurrence.

A

So this is one of the seasons and it looks like people travel more toward the summer in 1954 and they travel less right toward the end of the year with a little peek around the holidays. So we can explain those again based on our knowledge of being subject matter. Experts in the culture of what happens over the course of the year um and again, we didn't have to tell plot much at all. It inferred a lot just based on this and gave us uh axes uh of the corrupt. You know with with some assumptions built.

B

In um random quick question, I know you didn't write the the the pandas api, uh but do you have any intuition about why you use the dot loc to do the time slicing, as opposed to just the indices directly uh like doing ts brackets? Why do you do ts.loc brackets and celsius.

A

So yes, no, I start there every time and it's always wrong. I use python normal array slicing language and it's always wrong and I always have to go. Look it up so.

B

C

Okay, I don't know, but I.

A

I think it's because wes mckinney I'll talk about him a little bit in a minute was, um you know, appealing to our users and the r access for these things has a specific again weird only to this type of object in r.

A

You do if you're going to slice uh slice it a certain way, then you use certain idioms and the python idioms kind of reflect the our way of doing it. So, yes, I always get it wrong. I never know when to use loc versus iloke versus a um a truth value expression inside some parentheses. I always have to go. Look it up. Okay, because.

B

It never makes sense. I never remember.

A

It I never remember it because it is not pythonic.

A

um Here is uh just another transformation, so we're gonna take the log, so so the pandas this will take the log of every value and the shape looks very similar, but it's actually been smoothed out a little bit the peak. Let me.

A

A

So if we can get both of these on the screen at the same time, so this is the the log transformation of this, and you can see that the um severity of the distance between this peak and this peak has been smoothed out so right. This distance here of 1960 is much bigger than the distance here in 1949, and here in 1916, the log that's been smoothed out and that's another transformation. You'll see uh just a lot in data. Science is transformations that try to smooth out extremes in a reversible way.

A

So a lot of the functions just work better they're more efficient because they don't have to jump around as much or there's some other assumptions again the subject matter. Experts mathematicians in this area have decided that they want to smooth out extremes for whatever reason, and so that's just a common transformation. So now we have this ts log series, because all we did was apply the log to every one of those numbers, and now we have this so now what is stationarity me?

A

The the the quote from above was the mean and variance remain constant over time, and this is one. This is what james was talking about so because this has a date index.

A

I can give that there's a built-in function that says for the previous 12 months, take the mean or take the standard, deviation or a whole bunch of other summary values and the reason for that. So again, that's actually built into pandas and one reason for that.

A

Is what's mckinney and now it's done on his? You have to go down to his long biography. He was quant for aqr capital management.

A

um So that's what he did was deal with. The data that he was dealing with in python at aqr was time series data. It was financial data, so there's a lot of time series stuff functionality built into the concept of the panda's data frame, and this will be one example of that that there's just a built-in function of this.

A

This series that no it knows about rolling- and so I give it a period of how far to look back- and I can give it the mean so I'll- take the the mean of the previous 12 uh occurrences, and so now we can plot all of those.

A

So the standard deviation is so uh far away that it kind of messes up the plot. So I'm gonna just comment that out so now this is the the the standard deviation again we can see. That, of course, is not the mean, I'm sorry, this is the mean the mean. Let's mess, that up in uh jupiter notebooks, you can have code, note, code, paragraphs or markdown paragraphs, so this is obviously marked down.

A

um So obviously this is not constant over time. It clearly changes over time. So this.

A

It doesn't um we could, we could do the same thing with the ts.

A

And get a very similar looking plot, except it's going to go from 100 to 600 and it's going to have wild fluctuations, whereas.

A

That's going to have go from 0 to 650 and have much more standard fluctuations between you know around this mean so again, that's just a smoothing function.

C

I think that is one of the common practices that when one sees that the size of fluctuations changes a long time, then it means that a log transformation could help typically typically, and that is what they call heteroskedasticity write, that name that says the size of fluctuation changes and, and so a common common cue, for that is log transformation, because some yeah sorry.

D

Sorry go and finish what you're saying, because I have a question after that.

C

um Because some processes tend to to grow in a way that is proportional to the size. So if my.

C

Budget is bigger. I may tend to earn more, for example, if I'm this kind of business, so some economical processes have this property the as much as as they're bigger. They will tend to grow faster and then they the curve, would not be linear, as we see in this plot, but also the variance of increments. The various the variance of changes will grow as the process grows, and typically many economic processes tend to have this property and after the log transformation they sometimes become more linear and the variances become more stationary if it makes sense.

A

And we can see that here, if, if I had to, if I wanted to write a a function right- like maybe a fourth degree, polynomial right that had this four or fifth that had this many turns in it, it's hard to see that I could write a polynomial that described this. What was going on here and what was going on here, but after the log transformation, it's much easier to see that the same equation potentially could govern here.

A

You know 1955 and govern 1959 right, so I could could write an equation that could approximate the seasonality for any given given season and that's much clearer to see um when we take when we log we wouldn't do. The log transformation like as daniel said part of what we're seeing here is just the natural growth. Even so so the the economic growth of the airline industry we're seeing that it really is the same equation. It's just that over time.

A

It gets stretched because of the passage of time is going to be an assumption, we're going to make, and that may be a good assumption, because when we do the log transformation, we get very similar. Looking curves, regardless of where we are in time.

D

Okay, I think that kind of answered, uh a question that I had where it was like. Yeah, you know, is the main purpose. I wasn't seeing the purpose of this in terms of. Is it just to make the graph or output you know the visualization fit on on a on a chart more more easier or was there an actual purpose?

D

And I think tim you answered my question there where, yes, it is to fit it easier, but it also allows you to uh discern some patterns that may not be easily visible when you have a huge difference in the scale uh so in in this example. Actually it uh may you know if uh it's it's not too bad right, the the normal one.

D

Also, you can discern a pattern, but if the scale was much bigger right between the mints and the max, then this uh you know figuring out a pattern is, is a little bit more difficult, and this seems to allow that you know a way to understand. Okay, they seem similarity, even though the values are different. The pattern is still the same.

A

Yes and then it it's hard, it's something that that's something that that's hard to come across in these tutorials, because they already know the answer and they're just showing us the answer, but how they got there a lot of times they don't unless you go through a full class and really see.

A

Oh, they tried this and this and this and this and all of these transformations and the one that made sense- and that's really the art of this, no matter how much we try to automate things is some people, with their set of experiences, are just better at figuring out the transformations that are going to reveal the pattern right and fortunately, like daniel, said because of you know, log transformations are based on natural log. Natural log e governs the growth of many many natural processes, including in economics.

A

So a log transformation tends to reveal some kind of patterns where the the thing that is growing is also one of the population right. So the people who are traveling up here are also the people who are traveling down here.

A

So more people came, but the fact that there were this many people traveling in 1954, actually makes the fact that there are this. Many people travel in 1960, bigger right and the same thing. If rabbits are having babies down here, then those same rabbits are having babies up here and their babies are having babies right, so they they they grow exponentially, and it turns out that's governed by e, which we can undo what the growth that e does by taking a natural log.

A

So that's just with this anything in time series natural log tends to be at least the first thing to try, because it happens to govern a lot of these things. If we took the airline growth between two other dates not early in the growth of the cycle of airlines, it probably wouldn't show this it.

A

Wouldn't the log transformation wouldn't help us between 1996 and 19 and 2006, and I'm just making that I have no idea but early in the the the the air travel it did matter, and it did reveal this underlying pattern, which actually is going to be even uh clearer here in a second.

D

A

D

So tim, if I understand correctly, what you just said is for for rabbit airlines, where you have rabbits which follow a natural order, this kind of makes sense, but for a people, airlines, with uh with access to contraception and other measures, this may not make sense and not because it's it doesn't follow a natural uh or a nature. Driven. uh You know, uh rules.

A

Right right and I'm not an economist, but I think yes, I think, with people airlines at a different time in the airline industry. This might not have been true and yes, uh non-natural.

A

Other other ways of growth would not do this. But again, that's part of the art of and experience that data sciences bring to. This is what um so in um in webs a lot of times, we'll see a gamma distribution, which is people waiting at a bus. Stop a gamma distribution falls off. It has bigger things over on the left and it tends to fall off.

A

Well, there's other types of behavior like people visiting a website soon after the launch of a a product where there's going to be a lot of interest early on and then it's going to fall off.

A

um So all of those things you know so and so data scientists spend a lot of time, and the thing for us is that's why plotting is really important, because in real life coming to data that you don't understand, you have to try to determine these uh patterns and the reason you're looking for a particular distribution is because the tools that you try to use later to predict are predicated on one of these distributions, because the mathematicians who did that that's what they studied. They said.

A

If you can get this into a gamma distribution, then I can say pretty certainly that this is going to happen and that's the same thing with stationarity they're going to say. If you can get this to a. If you can get this time series to a stationary situation, then you can make these predictions about it with certain uh confidence values.

A

And so that's what.

D

We're doing here would you call that fitting a model to the data.

A

I believe that is a correct term and wouldn't it daniel, wouldn't you say, that's correct.

C

So that is a step in the way, but still we do not have a complete model of how these data were generated, so we could go further and model them more completely right, but but these are a few steps on the way if it makes sense and past models of typical data of this kind. Help us reason about what kind of transformation would be useful like we all we. We all know that people tend to say that the pandemic is exponential and it's certain phases of it.

C

So this existing model is useful in reasoning about the data and seeking a more accurate model.

C

B

You want me to there's the lack.

A

B

B

Sorry, just one quick question does that? Well, maybe it's a quick question. You go ahead. It's not a quick question. Well,.

A

I was gonna say if you want me to wrap up or uh for james or continue until the hour.

C

I think it is very useful and- and if james is here on the second session, then we could have james presentation there. Yeah.

B

I'm fine. I want to see the dramatic conclusion of this. uh We don't have stationarity. Are we screwed what's going to happen right.

C

A

No we're going to go we're going to how are we going to.

B

Deal with this, I don't know I'm on the edge of my seat here.

D

B

C

D

For continuing this.

A

A naive thing we could do is we have this trend? We we we have these up and downs around this trend. We can just subtract so in pandas um subtracting um see one series from another series just gives you the the results in another series. So I can say you know, for every point here subtract the mean and that's going to squish everything down. So, let's see what that does.

A

And, let's see what is actually so, it's you know. It's just numbers except now: they're squished around this line and we're missing some. So you can see in the red we're missing some down here, because for the first 12 months we couldn't, we couldn't do anything and that shows up as not a number in uh other languages that could be called missing. It could be represented by a dot, um but in pandas it's it's represented by this.

A

This nan thing not a number um so, but we can still, oh um so yeah, we'll actually drop we'll drop those later so now we can plot our moving average diff. So this is going to be this. The result of that subtraction and now we look a lot better right. If we have this goal of getting the mean and variance to remain constant over time, we've zoomed in the plot.

A

So here's zero and we're only going up to point three and to negative point two, and we can see that the mean and variance now look much more like they remain constant over time, but stationarity, and that idea of remaining constant over time is actually a technical term that has technical definitions so stationarity. We can't just look and say well that that does look better. It looks more stationary right, just as a value judgment we might bring on.

A

This curve looks like it has a more constant mean and variance than this curve right I mean that it seems like that's true, but we want to be more precise and in order to use the tools that this community has developed, it has to be more precise.

A

So if you want to so so we'll look at the stats model, uh python library, so there's time series analysis called tsa statsmodels.tsa. It has a bunch of essentially fancier ways to do what we did when we did this subtraction. So we did this subtraction. That's why I called it naive. Is we just took this rolling mean and we subtracted it out?

A

Really smart people who've been studying this for years, have come up with fancier ways of doing those same things and they have terms like univariate, auto regressive process and um heteroskedasticity.

A

So we will use the the tools that they've given us and try to understand them as they go along and a precise test for what stationarity means and that's called the dickey fuller test, which are two guys. I thought it would be cool if it was just a guy named dicky fuller, um but it's two guys.

A

Two p, I'm assuming guys two people and if you really want to know what that means, it has to do with the presence of a unit root, um and it follows this equation and all of this stuff, and then you have the advanced, dickey, fuller test, where's my link for advanced, augmented, sorry, augmented, dickey, filler test, which does even more stuff.

A

But that is wrapped up for us in this library and we can see what it does. So I we imported the adfuller method from this library and now we can apply it to our moving average diff. So we're going to we're going to see is this stationary and we're passing it at a parameter? When you hear data science, people talk about tuning hyper parameters. This is what they mean. There are actually a bunch of default values uh for this ad4 test.

A

There may be a bunch of different auto lag parameters, and you could test over all of those we're just going to look at one and it blows up, because we still have these null values. There's places where uh there's no values so you'll see a clean up called drop n a in uh r and other languages in any any situation. Dealing with missing values is a big deal. If somebody does not answer a question on a survey, you have to decide what to do with that. Do you drop it? Do you make it a zero?

A

Do you average all the other guesses or all the other, the other answers and give it that so um cleaning up data is a really big deal, but in time series you will often see just drop it just drop it we'll fill it in later. We don't care so drop in a is another built-in time series thing. So that's something else we should.

A

We might consider is what kind of functions work on these time series things that should be automatically available and then- um and it just says everywhere- in python, in pandas, you never know whether you are mutating something in place or getting a new one.

A

So uh yeah, that's just one of the frustrations of python after coming to closure and then back to python, is you're always kind of guessing or have to read the docs or do experiments to see whether you're mutating or not. So in this one we're going to mutate in place, so we're not going to give it a new name and then um so we're going to do that in place.

A

If tested is not developed, I must have copied and pasted something wrong all right, and so the the output of the the 80 fuller method is just this tuple, but fortunately in the tutorial they tell us what all of this does.

A

So I can just do that, and this is just a nicer printing of that and um daniel you can. You can definitely pick apart my explanation as a non-statistician about p-values, but here's my explanation of p-values um is that the the null hypothesis, so so what we're trying to disprove is that this is um not stationary and um is it fair to say, nope, that's wrong. This is stationary and the p-value.

A

If it's low and it's you know a lot of times, we'll have an arbitrary value of 0.5, we'll say yes, it is fair for me to say that this is stationary, even though there's actually a chance that just randomly some other thing, I don't understand, is making this non-stationary, I'm going to say yeah. This is good enough. This test is good enough to say that I'm going to reject the null hypothesis of that.

A

This is not stationary and I'm going to say yeah, I'm comfortable with saying this is stationary, because this p-value is below some cut-off. How was that daniel.

C

A

So now, let's look at some others, so this is just going to put together a few of those things we're going to pass in just any time series as a function and um calculate that mean that standard deviation plot them all together, so that when we look at.

D

Can I can I introduce for a minute? Can you go back to that visual of mean and stand deviation, the chart and yeah? I just want to understand what is going on here between the the red and the black lines.

D

So the red is red. Is the mean right and the so the black is the standard deviation, so why are they diverging in so so? If you look at that, where 1954 right around 1954, they seem to have gone in opposite direction.

A

Right and is it, and that would be something to look for whether that's a real observation or not is are they is that related, maybe because they sure look opposite here, but then, when we look here, here's another um place where the mean you know gets lower again and it doesn't seem like we have that.

A

So I you know, I think, maybe that you, you could conclude that's random, but there's other tests. You could do to say. Okay, I have this thing. I have this thing. Are they correlated right? So you would say you know just looking at this part here, I might say they're negatively correlated right when this goes down. This goes up, so you can run a test between these two series and look at the r squared value and say uh yeah.

A

These are correlated or not correlated and same thing, you'd get a p-value right and if it's low enough, you may say, oh you know I'm comfortable with saying that these are negatively correlated or just looking at the plot. I think we might say no, even though they look like it here. Overall, I'm not going to say these are negatively correlated because look here. This guy goes down and down down down and this guy still stays about the same. So I'm going to say you know this little, it's interesting, but we've we've.

A

We put that interpretation on it.

D

Yeah- and this is the rolling average right- the previous 12 months.

A

Right, which is why, like they couldn't start to hear so the average of all 12 of these data points is you know, 0.15 or whatever. That is the average of all the previous 12., so we can't start it until there yeah okay and that's why for dickie fuller, we had to drop those because dickie fuller did not like all those all those n a's. So we ended up having to to drop those out of our series in order to get these dickey fuller result.

D

All right thank.

A

You so now we can test the stationarity of this, this first, one where we didn't de-trend it at all.

A

Well, the the null hypothesis is. This is not stationary, and this p-value is really really big right. It's 10 times bigger than we would want to. If we were going to reject that, so we're going to say no, the null hypothesis is probably correct.

A

This is not stationary right, so this is a just a different way of saying what we observe is that this trend is not stationary, but when we look at this same one, we did before where the movie average diff, then we're going to say: oh that's more likely to be stationary and, of course we I call those naive ways of subtracting out things. Well, there's more sophisticated ones, and because seasonality is such a an important part of this time.

A

Series stuff, there's a whole seasonality library in this tsa one of the things there is seasonal decompose. So it's going to do some stuff and give us a decomposition.

A

And the decomposition is again just um time series, it's just another series which of course is is mostly uh at the beginning and the end, not numbers, but we can see. Well, let's, let's go look at the plot, so this is just going to plot all of these seasonality results together. So here's our original here's, the trend they probably did not use rolling 12 months. They did something fancier. I don't know what they did now we're back to.

A

uh With that log transformation. They were able to come up with a consistent seasonality function again. I don't know what it is, but they didn't tell me, but here's what they did. They took out the seasonality out of that data and were able to come up with a consistent function that governs that and then they also subtracted all of those things out for us. So they subtracted out the trend they subtracted out the seasonality. They gave us the differences, the residuals, and now we can test the stationarity of that.

B

That is so clever.

A

B

Up with that, it's.

A

The smart people, man, that's why I want to that's why we want to show these things yeah. So that's super smart and also just the fact of this plot is super smart right, which is also just pipeline. You can to see it all at one time.

A

Well, this looks very similar, except we can see we're zoomed in by at least 10 times.

B

A

B

Daniel had something to to chime in there on that, no.

C

No! No! No! No! No! No! No! No! No! No! No! No! No! No nipple! Oh! Go ahead!.

A

Okay, so yeah so yeah. This is way zoomed in right. This is goes from negative point two to point three. This one goes from negative point, one to point: zero, zero five, so this is zoomed in by a factor of ten at least, and so, even though this doesn't look like a straight line at this uh granularity. If we zoomed out it would look like a straight line and we can see on dickie fuller now this is two times ten to the negative eight.

A

So it's point: zero, zero, zero, there's a zero two, so that is very low.

A

Also, it's very far left another way to look at this is these critical values which I didn't look at up there, but as these get farther to the left of zero y to the left is zero, because if you go to the dickey, fuller documentation, it says we're only going to look at the left hand, you know how far this is from zero.

A

Well, our confidence that this is stationary gets better. The farther left we get from zero and negative six is very far away from zero. Compared to these. So we'll say that we're very confident that when they de-trended and de-seasoned this residual, that this is stationary so now.

A

uh It's it's what's left after they subtracted the trend and the seasonality.

A

So we did it manually yeah. We did it manually by this subtraction up here or wherever we did it yeah. We did. We just subtracted the values. They did the same thing except it's just built into that library that when we get this decomposition object inside it, it has a series called trend, a series called seasonal, a series called resid, okay, and so we just plotted those.

A

So this is what's left of the original after subtracting these out and we dickey fuller, tested the residuals and found that, yes, that is stationary now we can make predictions on what might happen so.

D

Tim, I like this graph in.

A

D

I, like I, like this graph, the the the blue and the red and the black lines one, and I can all I mean tell me if my interpretation is correct, because this is. I think this is part of the learning right. um So the blue lines are the original values, not the log values right.

A

Nope, the blue lines are the log values with these trends subtracted out. Oh.

D

A

Yeah so because we did we did this trend on, we did a seasonality on ts log, so we did the seasonal decompose on this c t s log series and then it gave us back these other three series. So this original is the log.

D

A

And then- and that's what I was saying now- we can see because it was the log it was able to find this very stable, seasonal.

D

A

Which means it was able to subtract it out efficiently, whereas if we hadn't given it the log now, maybe it would have done it on its own, but but we we can.

A

We can see that um there really is in a single equation that governs the seasonality over this time period and they were able to subtract that out, um which means that now we can make some predictions, because the other methods that we can see which are fancier, so we can introspect in python by doing a dur, and we can see some fancier ways and we, you know so somebody get got their phd for all of these things and we can read about these processes and see these fancier ways of either de-trending or of taking a stationary set of data and making predictions on it.

A

So that's really why we did all that data transformation is because the data scientist is now going to be able to say. Okay with this assumption of stationarity that now I can be confident in now. I can make predictions in the future and give you a confidence range of I'm pretty sure this is what's going to happen in march of 1962.

A

right, we can say with 80 confidence in 1962. I think there are going to be.

A

You know 480 travelers or whatever, so that that's the whole reason we we might do this stationarity thing is so that we can then leverage the uh other work. That makes that as an assumption so that I will say that is it for me.

D

So the trend here is, I mean the the uh so the red line is the mean of the values and what the standard deviation is showing, with its with its fluctuations. Right is saying how much variation there is between uh the min and the max right off the.

A

D

A

12 right at the previous 12 blue lines, right.

D

So, where the, where the, where the height of the blue lines, is very low, very small right in the middle, like in 1956 54 to 58 right the the right yep, the standard deviation is very little yeah.

A

Right see how tight, yet how tight those peaks and valleys are yep, whereas they're really much broader here, so we.

D

A

That the variance went up yeah.

D

Yeah, the standard deviation went up, yeah, yeah, yeah, okay, good. This almost makes sense to me.

A

And the reason we look, and why did we look? Why do we even care about that? Because the people who came before us said stationarity means a statistical property where mean and variance remain constant.

A

So you know we're not looking at for fun we're looking at that, because those people said we need to look at mean and variance and we're using standard deviation as our variance. So we can put it on the plot and it makes sense um so uh yeah. That's why we're focusing on those is because that's the subject matter. Experts gave us those measures as something to be interested in.

B

C

B

Yeah, thank you. Tim appreciate it. That was really good.

D

Yep, yes, that was really good.

B

Well, I'm glad you went first, I mean I should have gone first.

A

Sorry, I you know I, when you practice it in your head. It goes faster, yeah.

B

It was that was that that was excellent. It was a very good, uh very good explanation too, for um for those of us who are just starting to pick some of this work up, uh because I never would have guessed any. I never would have guessed stationarity. For instance, you know uh I'd be applying things on data where it didn't make sense or anything.

A

Yeah for me, when I treat it like a um you know, just a business problem with their subject matter, experts and I treat those people as having knowledge that I don't have, even though I might be somewhat familiar with some things right. So I may understand terminology from a mathematical perspective, but that doesn't mean I understand the context of their jargon and why they're doing certain things- and that was why I started getting this in the first place- is in big data stuff in you know, 2012 2013 people were asking me to make.

A

What I thought were these crazy transformations, because I built data warehouses on star schemas for reporting, and so when people were saying you need to pivot this out and have a column. That is, you know.

A

Color red, yes, color, red, no color blue, yes, color blue! No, instead of just a color column, I thought they were crazy. I thought they didn't know what they were doing and then, once I learned, no, they do know what they're doing and, just like dana told us before hot one encoding.

A

Well, there are certain the the phd work. Somebody else did that we're going to leverage expects the data in this format, and there are probably really really good reasons for that, but that I don't need to understand. I just need to understand that my subject matter. Experts were not crazy that they were building on the people who came before them and not reinventing things and the format that this uh data needs is. This way, that's why they were asking me to make new tables with literally thousands of columns.

A

They were you know so so I just treat it as subject subject that I need to learn about. As a software engineer,.

D

So tim, I think you gave a very good example of where a uh uh an application, specific information model is needed, based on the core data going back to the the ontology information model, and all that I was mentioning earlier right.

D

You are taking the same data that was there in a in a normalized form and a fully normalized form with the star schema and all that, and then you are creating a denormalized view of that right now, whether you put those extra columns in the same place or you put it into a different data store, which is like a derived data store and provide that to your uh to your statisticians right. So that is an information model you're serving for the purpose.

D

Some statistical analysis.

A

Right and and my interest in it now is so so back, then we have certain massively parallel systems that were database systems, and if you didn't want to do hadoop and write mapreduce yourself, then you used one of these other systems, which essentially meant in order to take these models that had been built on very small um subsets and apply them to a bigger set in a parallel fashion.

A

I had do rewrite them in sql, which some models can and some models can't, and you can't just you kind of have to you- have to learn that subject area to know which of those is possible, which of them is feasible and to estimate projects on how long it's going to take and whether that's worth the investment in order to be able to run this model on live data in 10 seconds, and sometimes it's worth a really really big investment to be able to say, um like I did at a at a car, an automotive search company um when I pop this ad in front of these, if I pop it randomly in front of just random site users, only 0.01 of people are going to click on it.

A

But if I pre-qualify these people based on their search history, you know that I've seen on my site, they're searching for red sedans or whatever their search history is based on five five elements. I can pop this and it's now point zero zero one percent chance they're going to click on it. You know I made it like a million times more likely of all these unlikely things they're going to click on this ad.

A

Then it's worth the investment of rewriting taking months or weeks, at least to rewrite a model that this data scientist the subject matter, expert came up with in r or sas rewrite it in this massively parallel system. Just to get that result well now we don't have to rewrite that stuff. We can kind of converge on tools and that's what we're talking about is. How can we get these subject matter, experts to use a tool set that we can then apply um distribu distributed computing? We can apply to a bunch of new situations.

A

We can take the work they've done and much easily much more easily leverage uh the technology than we could seven or eight years ago.

A

So um you know, I think, that's one thing we're doing and and the fact of the work that that chris and chris and james have been doing with uh you- know web python and now there's lib julia, so data scientists who the new cool kids. I don't know if you know, if you all know the cool kids now use julia, they they moved beyond python, they use julia. Why is it better?

A

No there's a certain element of this that it's just the cool kids want to do something different and the cool kids are using julia and chris is now much closer behind them than we were with python, and we can now leverage the work that they're doing at julia in closure with the closure library that was just announced last week that says hey, I can import your julia stuff and the super smart stuff they're doing is, I don't have to copy data around.

A

I can actually access your julia memory and do pipe do closure stuff on it. We're much closer to being able to let the subject matter. Experts work in their field and then, when they say please make it go fast. We can just make it go fast.

B

That's because the coolest kids they're doing.

A

The coolest kids want to do that stuff and then throw a thousand gpus at it.

C

Yeah, so maybe maybe let us think for a moment, we are past the end of this session. uh Maybe a few you can stay a bit more uh tim. Was there anything else to present it.

A

C

Fantastic really, really, I think this really pushes us and our understanding of what we are doing so much, and so one thing we could do, I think not today, is actually implement everything that you did in closure in the jvm.

C

In this case we do have all the functions, and so, if anybody wants to meet later this week and hack some team-like notebook enclosure, then let us do it, and- and we can also try to do it offline and and talk about it but yeah here. In this case it was an example of something that is all accessible and and really enlightening.

C

D

Would like to do that as a live session, because I think it would be good to have to see the thought process about what libraries and what functions you choose.

C

Somewhere during the week or.

D

C

Yeah yeah yeah. Would you like to suggest a day.

D

uh Any days early morning would be fine for me like, if you want to you, know I'll just have.

C

D

C

Perfect yeah, so tomorrow's ifram and I will try to go after tim and try to implement his enclosure and tell you if it happened or not.

B

What time is it when we say early morning what time is early morning.

D

B

D

Today this session was at 7. 00 am my time, um so I'm just looking at my calendar for tomorrow and I am good I can start at a.m an hour earlier. If that works for daniel yeah yeah.

C

Yeah um so yeah, let us call it shifu 2.3 and let us write it at the stream and maybe more people will join and we will be focused on just making it happen. I hope and okay so what about today? So today we will have a session in three hours right, a little less than three hours, and then we will have a demo by james about sequential things happening in the uh in the tecmo dataset platform.

C

And after that, maybe we can make some plan about what we're building and if you have time before the next session, then maybe some of you could look for data sets for like some pro problems that we would like to be able to solve. These could be like notebooks of kaggle that you find like clear and enlightening and something that would be helpful to have on our implementation too, or maybe something that some data problem that you find interesting and then we will have this collection of things that we need to solve.

C

Does it make sense? Maybe someone has has time to to search for some data sets and I will try to prepare some short demo of uh just something short that so that we can see the ergonomics and- and I think one thing we didn't discuss about, the the demo of team is- is what an index is.

C

So we saw how we use it in this case, but that is something that, for me, is still not clear. What is inside what is a pandas index, a pandas daytime index, what functionality it provides for us? I have no idea, so it may remain, or maybe.

A

C

Yeah yeah, because maybe it is something more powerful, maybe not I don't know yeah.

C

Yeah, do we want to discuss anything else, then.

C

No, so let us meet in three hours.

B

Okay, see you soon.

C

C

C