Token Engineering Commons Lab!, 19 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: W11 TEC Lab! cadCAD and Stablebaselines for Energy Web

Description

🙏 Thank you for watching! Hit 👍 and subscribe 🚩 to support this work

🌱Join the Community🌱
on Discord https://discord.gg/DDr5kYU
or say hello on Telegram http://t.me/CommonsStack
Join the conversation https://forum.tecommons.org/
Follow us on Twitter: http://twitter.com/CommonsStack
Learn more http://tecommons.org/

A

The web3 sustainability loop is an idea put out by ocean protocol. uh Trent mcconaughey writes about this and um it's this idea of sort of manage how to manage an economy and it's inspired by the corporation and the government. These kind of models that we've seen over the past 100 years and it's basically monetary policy- it's like money, should be minted and allocated effectively. So it's this difference between effective allocation of money. It's like. If money is printed and not efficiently allocated, then that is inflation.

A

That's what causes inflation and if money is printed and allocated efficiently, then that is growth. um So this is this is the key thing to nail down.

A

Right is like it's not necessarily about how many tokens we're making, but it's really about how how they're distributed, um how they're issued and what they go towards and if they go towards productive means, then this production should be able to produce enough sort of inflows, and then this is where it's been really nice tracking the one hive honey model, because it's really about inflows outflows and production, and this is that's, that's it that's the recipe it's like.

A

If you can take the tokens that are being minted and make stuff with them fund the community fund projects, I mean exactly what we're doing here at the tec, with with the hatch and the bonding curve and the proposals and this whole process, which um I was sort of like lecturing some people last night about all this stuff, and I realized how many holes there are in my own knowledge, like I'm, not even sure, if um what one hive has with the common pool is that did they launch an augmented bonding curve?

A

Is that what you would call that no.

B

No, not at all they just have issuance well, so they want to change it to dynamic issuance where it has awareness of how much is in the funding pool versus the total supply, but they just have issuance pumping into the pool. They took out the bonding curve and they only have conviction and then they, then they just print money.

B

Currently they just print money at a steady rate into the funding pool and and really they print money and it it goes, and I mean it's like there's: there's a there's, an issuance uh contract and at any time someone wants to. They can push a button and based off that time and the last time the button was pushed money will be minted and sent to the fund.

A

Well, that's a manual process, it's not automatic.

B

It can't so the the automated part is the quantification of how much can be set at any moment, uh but it's, but unfortunately the evm requires an external account to take action on contracts for actions to occur. So it's like the best tech doesn't always win. Let's just put it that way. Ethereum has a lot of problems.

B

Man there's a lot of really annoying technical, like hacks to make it due to all the cool stuff we wanted to do you know, um and one of the big things is you just can't: have a contract just like at a certain time, do something you can have a bot that has an external key and then that bot? You know the web 2 bot just like triggers. You know it has gas money and it configures something you can do that, but you can't have the contract.

A

Interesting yeah, I never thought about that.

A

um But yeah overall bozon's been great and um it's a lot of work to just keep up with everything that's happening, but I think their vision is wonderful and um the founders are are great. They have good mindsets and I think, there's going to be a lot of collaboration between the tec and boson. I see angela just joined.

C

A

Hi angela welcome.

C

Hi guys thanks.

B

C

The reminder mark, I almost forget about it and since I was assuming it's, it was last week, but it's great to have this presentation today. Yeah.

A

It's a two-part series, this one.

C

ah Therefore, so for session number one was last week.

A

C

Okay, great awesome.

A

C

Switch off video.

A

um Hey welcome mark.

B

A

So um I'll just uh go over the ritual to begin so I'll, go ahead and share my screen and just remind everyone, so take a look at the tec. Labs, um cool tec, labs, channel and you'll see in the pinned messages a link to the notion, workspace.

A

And from there go ahead and open up the attendance sheet.

A

And pick your favorite emoji for the.

C

C

So obviously I should know what happens with this sheet, but I don't. Can you give me a clue.

A

Nothing happens with the sheet, yet it's just a fun uh project really to see what kind of emojis people uh enjoy, but then it's available, I think the idea is um we could do a. I think we could do a lab sometime. Well, there's not that much rich data here other than I guess the social media platforms or operating systems.

A

Maybe we could do something with that, but the idea was just to generate a data set um of of the tec lab that maybe one day we can actually go ahead and analyze and do some data science on this data set itself.

A

But uh nothing's been done yet, but it's nice to track this attendance, so we could definitely do a co-app or something along those lines. um Yeah.

C

A

Yeah, it's nice to see. I really like the operating system distribution to see what everyone's using.

C

A

All right so right, um good and if there's any questions, anyone can just let us know and the lab worksheet. So if you check the calendar- and today is february 19th.

A

Nice, okay, so we have this workspace here, available and I'll hand it over to you mark.

D

Well, uh thanks, um or should I sean what do you? What do you prefer yeah, you can call me.

D

Ick means me in dutch.

A

Oh that's interesting. I didn't know that.

D

Yeah or ich in german okay, why do angela um yeah sure um today we're going to talk about our reinforcement, learning agents in the loop of energy web um last last lap, I talked about how we went on with the energy web submission.

D

As a matter of fact, we are already collaborating on that, together with lior from the data guys and he reached out to uh to to join uh and well basically join efforts in in order to maybe get something uh up and running with energy web and data down together we're in the process of setting up a two page or something which I will share in due time for this lab.

D

I think we need to go a bit again through the intro of the token spice catkat migration, just a short overview of what we discussed last time and then head over to the reinforcement stuff, but maybe sean you can give a quick intro about the reinforcement, learning agents you did for us for the te academy and with your notion, notion page this. I think, to get everyone up to speed. What what are we talking about when we talk about reinforcement, learning engines.

A

Sure good good idea.

A

Okay, so I'll link this in the um I think I'll. Just stick: there's lots of good resources in last week's notion, page in the lab space here, so I'm just going to keep building on this one. Let's see do I have permissions to edit.

A

Seems like I'm.

C

A

D

Maybe that's my fault.

D

A

I think it's just a notion thing: um okay, I'm in now I.

C

A

To click a button.

A

Okay, so in this page from last week, I'll link to that there also just drop it in the tec labs channel uh yeah. This was a fun presentation that I gave in the ocean protocol study group about the potential of combining reinforcement, learning frameworks with token engineering frameworks, particularly token spice, and so I gave some background to reinforcement learning, uh maybe I'll sort of go through this.

A

Briefly, I talk about this primary framework, slash library that is available, it's produced by openai, and they have a library called openai gym which creates these so back up a little bit. What is reinforcement learning it's this? It's two standard data structures. We have an agent and we have an environment.

A

An agent has an action space that it can operate in this environment and when you're thinking about token engineering. This is really useful. It's a nice matching, because the action space are all the contract functions. That's that's! How, when you think about the evm, the state of the the blockchain in the world, all the things that we can do are essentially the functions that are defined in the contracts or the or the protocol itself. So we can interact with the protocol itself.

A

We can do things like deploy contracts or once those contracts are deployed, we can interact with them and so the action space. I I like to actually show people this if we go open zeppelin, um maybe here so if you don't know open zeppelin, it's a standard audited, secure contracts and if we go into contracts and token then erc20, then this this is an awesome resource, and if we take check out the interface, then we have all the functions available on the standard erc20.

A

And so if we were making an agent in a token engineering simulation where maybe there's only erc20s deployed, then this would actually be the action space uh for the agent to interact with the contracts, and some of these are sort of read-only.

A

uh Where they have this view keyword, um then we can sort of um inspect the state of the contract um like getting the balance of an account on these erc20 tokens, uh and then some actions actually change the state uh like transferring an amount of tokens to a recipient.

A

So you can think of these as a vector. You know like a list in python, so you would have a list of actions.

A

You could encode each action with a number one, two, three four five six and you could have a basic agent that, like randomly samples, so you could have a uniform distribution like rolling a dice from one to six and it. Maybe this agent takes a random action and you could have thousands of these agents that just continuously take random actions and then you'd be able to see um what are all the wallets of these agents. uh Do how many tokens do they have?

A

Are they transferring these tokens and you'd get to see how this system plays out? Maybe if every agent was just acting randomly and that's usually your baseline when running and deploying ai algorithms is like what would happen if we had a completely random policy, and once you get that working, then you can start to actually use the sort of reinforcement, learning, algorithms that will change the policy over time based on based off feedback from the environment.

A

So that's the action space. So this each agent has the opportunity to take an action. um An action affects the environment so, like we said some of these, uh if you do a transfer, then you're actually changing the state of maybe the evm in this case, and then the agent gets in return, some reward. uh So if the agent takes a series of actions that result in maybe a reward function is maximize.

A

My number of tokens that I hold in my wallet so the more tokens you have, the more reward the agent's going to get and then again the agent is going to get to observe the environment. So this is encoded similar to the action space we have a vector and the observation space is going to be maybe a vector or a matrix.

A

um The observation space could simply be um what's the balance of my wallet, so we could encode a very simple erc20, interacting agent, where it's taking random actions on an erc20 token and after every action it takes, it gets to see what what it's, what it's uh balance is. So it's good and if it's tr and if the reward is to maximize its balance, then it's going to learn quickly to stop transferring funds.

A

So this is this. Would be a very simple implementation of a reinforcement, learning agent in a in a in a token engineering simulation or a token economy, and the purpose of this presentation was to give this background on reinforcement, learning and then also talk about the opportunities with token spice, because it connects directly to the ethereum virtual machine. We can actually execute these. We can do exactly what I was just describing.

A

We can define the action space of an agent to be actual smart contract interactions, um and then we can run these these agents through our simulations to test our smart contracts and the token ecosystem, and so what I did as an example is I have here- is all all the agents defined in token spice to test the ocean protocol ecosystem and do some tokenomics analysis and verification and simulation to get these sort of results where we can say well, what's the annual revenue of the dow over time.

A

So we have monthly ocean, dow income ocean minted and burned, and this is interesting, griff and I were just talking about one hive and the honey issuance before the call. So I'm not sure exactly their issuance rate, but trent found that there's you can do sort of a linear issuance. You just print the same amount of tokens over time I say monthly and uh on the opposite side of that, you could do like an exponential decay in your token issuance. So this is what we see in bitcoin and then trent found.

A

He says by far the optimal is a sort of ratcheted exponential. So you do this um ratcheting. He calls it and I I don't. I quite know the precise definition of what ratcheting is, but you can clearly see it in this graph here where it's this. I think it's like a pseudo manual, token minting process over time in the early days and then an exponential decay over time and trent mentions that in the simulations that they run for ocean protocol.

A

This is by far the best uh issuance policy and how they modeled this in the simulator. Is that actually the agent they have an agent? That is a minting agent. So this agent gets to decide how tokens are minted over time and they just had three pro pre-programmed options. They had the linear, the exponential decay and the ratcheted, and so they found these results that the ratchet did the best. But this agent itself could be a reinforcement, learning policy that you could train over simulations.

A

So typically in rl we have this idea of an episode and many rl problems are trained on video games, and so one episode is: maybe the agent will play through an entire level of a game and then at the end, it'll have its total reward and then there'll be some credit assignment um part of the algorithm that decides at each step of that game. What what was the reward? What was the credit attribution of the that of the action that the agent took in various states at various points in time?

A

So we could have a mentor agent that we could run like hundreds or thousands of simulations, and we could have an rl algorithm that sort of optimizes this minting process over time.

A

So this this is a overview of this rl process and then here I have a coded up. Example. I took one of the agents and showcased how it could how we could uh modify this and augment this to have some reinforcement, learning um and I'm just wondering mark. Do you want to? How am I doing do you want to jump in at this point or.

A

A

uh Mark if you're there you're on mute.

D

Yeah, sorry uh yeah. This is a good point to uh to jump in um yeah. As a matter of fact, I looked at your example uh first because they're not not so many examples about um reinforcement environments, besides the let's say, the the standard gym environments like the carpo and this kind of stuff, so it was really a journey in in how to set up custom environments, because this is what we need for uh for our um for our um reinforcement.

D

Learning agents, as you can see here in this picture, um sean is showing here that he makes use of a model it's given into the into the initialization function of the of the agent of the protocol. Speculator agent, you give him a model and this model is being trained beforehand. I guess, and this model will predict the next action to take and if you scroll down a bit further, as shown, I.

A

Don't see that I import did I forget to import the model, uh the actual yeah.

D

A

Okay, yeah yeah, so.

D

The model somewhere has to be there's no, no, no problem there, because we suppose that there is a model available, a trained model available for this agent, and you see then, in the tech step function, that's a normal token spice a simulation step function. You see that you first calculate the reward. You append this reward and then you do an action and this action is being predicted by the model. So the the model itself has a function. It's called predict.

D

This is a function I really needed to flesh out and to see. Where does this predict function is coming from? I think this is a nice um opportunity to share my screen in how my journey went. This far.

A

D

Okay, I will share this one.

D

D

I think you see my entire screen now is that correct, yeah, okay,.

C

D

Can you see my vs code stuff.

A

No we're seeing discord.

D

I think I forgot to share the whole screen.

C

C

D

Okay, now you should see um the code screen correct, yeah, no.

A

That's right, yeah.

D

All right, yeah so and to pick up where we left. uh Last week we talked about the the energy web agents, quick, refresher, energy, web agents, tokenized power, balancing. We need to power the balance within a microgrid of energy consumers and producers.

D

So we use the energy web origin toolkit for that, and we need some sort of external interaction and also the ocean market to to do that. um Well, basically, if you, this is a high level overview of the the synergy I I presumed for my submission. On the. On the one hand, you have a energy web with where you can register power devices, and you have some sort of marketplace over there- that you can sell and buy power.

D

On the other hand, you have the data marketplace of ocean the ocean market, where you can publish and consume data sets with data token tools, so the neat thing would be if we can pack the metered, the the registered metered data of a power device in energy web.

D

We pack that, to a data token pool in order to stake on that, and the idea is that staker agents will stake on a data token pool packed to a power device if that power device is behaving in a right manner, which means that it's correctly producing results according to what they predicted and here's the thing the prediction itself is also has also some value, because when a power device prediction about his power produ producing profile is correct, then you would say: okay, this power producing device is reliable.

D

So we can predict what the outcome will be, what the power production will be in the next hour or the next day. Maybe- and of course, you can have all sorts of staker agents around that that are going to predict how that power producing device will behave in the future for the next hour or the next day. So it's quite a dynamic system. We talked about that last week.

D

I briefly introduced the concept, so the idea is how about staking staking, on these data token pools in order to have this prediction or a signal, because if you stake on the data token pool that's a signal that that data token pool is worth something so the idea of signaling uh staking behavior.

D

We take that to the next level. We say if you stake, on a data token full of power producing device. Actually, you signal that this power producing device is behaving in the right manner, and this, of course, is has some value in in the whole decentralized energy system as a whole.

D

Okay, that's the business side of it now the technical side, and last week I said. Okay, we have some energy energy web agents. um I fleshed it out in a separate directory. So, first of all, I, of course, in the cat-cat migration. I just copied and pasted all the agent of the token spice environment with this directory.

D

We still need some agent to start the simulation up, so that these are the agents I kept in the correct directory. But next to it, I fleshed out a new directory. Energyweb agents, ew agents and- and on this part I will fill a flesh out the real energy web agent implementations.

D

So we talked about a bit about the pool agents. We talked about publisher agents. The publisher agent is the agent that is really responsible for for publishing data token pools.

D

So this is basically the agent that is really um is interacting with the ocean uh marketplace and after his action of publishing a data token pool, we have a date, an energy web pool agent, which is basically the manager of the dave talking pool and we can interact with the pool agent when we want to stay committed that is being done by the energy web staker agent.

D

I tweaked a bit about his staking behavior at this. At this moment, I'm I am going to simulate a randomized staker. That means that, once in a while it stakes on the data, token pool and the other, and in 50 percent of the cases, the mistake and in the other 50 percent of the cases it will unstay and it's there is some sort of sophistication in the how he will stake or unstake.

D

That's in this function to select the pool to stake and unstake. It's really a really simple strategy. He is going to sort the the pools that are being staked upon and he will stake on a pool that has.

C

D

Let's say the fewest ocean staked on and and on the other way, a way around. It will unstake on a pool that has the most ocean state form. So it's a real. Basically it's a randomized behavior, but we need some some sort of dynamics into the simulation. So this is what I modeled you can. You know tweak around and, let's say, alter the magic numbers um that was some sort of maybe sean you can. You can uh elaborate on this a bit. What a magic number is.

A

Yeah, so I mentioned this last time. This is a nice um aspect. Of the token spice framework is that we leave these. So in general software engineering, a magic number is a bad thing. You don't want to have any magic numbers, it's when you just have a floating, int or or float in your code somewhere, and usually people don't know what it means or why it's there.

A

So you want to abstract those out into sort of a configuration space where things are, they have variable names and you can get all these magic numbers all in one place in this configuration file that then gets loaded into your system. But trent uh inverted this principle. He actually leaves all these magic numbers in the engine in the system itself and so to really augment your experiments.

A

You need to go through the source code and look find all these magic numbers and uh get to tweak them appropriately, and this is kind of unorthodox from like a software engineering practice, but from a simulations practice.

A

It makes a lot of sense because it reduces the abstractions we're not abstracting everything, away, we're actually leaving the the knobs in the system where they make the most sense right directly on the policies themselves, on the agents themselves on the environment themselves, and it ends up working quite nicely because it does reduce these like abstractions, that you have to follow with all these imports and sort of separating the components. It leaves the components together with their parameters that can be tuned and at any point you could you could take these out.

A

You could find all the magic numbers and you could put them in one configuration file and then import them, um and maybe I could see that happening when people are applying token spice to very large scale simulations large-scale environments, but as of now it works great. It makes it really easy to just modify the code on the fly.

D

Yeah so, um as you said, as you see here, I put some print statements. Just you know, for for debugging purposes or just to to show what's happening. um I will start out a a simulation now, so you can see what's happening.

D

D

Simulation so when the simulation is starting up, it starts to publish two data, token pools to stake on, and you see that we have some energy web publishers staking and unstaking. Also and uh later on. We can see that um energy web stakers are also going to stake on it. So we have an energy web, staker david who is going to stake on that and they yeah an energy webster in brazil and once in a while every three hours, new data, token pools are going to be um published.

D

So basically, this is the setup for the reinforcement learning agent to have some sort of a couple of data. Token pools to stake on according to some sort of strategic behavior, of course, and I will let it run for a while- you can play with it as I will. I will commit this this also to um to the repo, so he can fork it or clone it, and you can play around with it with it yourself.

D

um But basically, um this is the idea I will shut it down for a moment.

A

So mark we have 30 minutes. Do you think we should try to get this reinforcement, learning going yeah.

D

Yeah, I'm going to there.

C

No problem, no problem.

D

So what you need to remind is: okay, we have some some sort of pool staked on. We have the energy red pool zero until three four, maybe, and we have some oceans staked on and you see we had some strategic behavior of the staker agent, but also the publisher agent. I forgot to mention that is showing some sort of dynamics within the ocean being staked. So this number is going to uh to change okay, so not to keep you waiting too long.

D

Now we head over to the reinforcement, learning stuff and actually that's in here, so I will shut this down. First of all, if we want to go and implement some sort of reinforcement, learning agent within this simulation, I'll I'll, try to mimic what uh what sean has shown you before.

D

So we need some sort. I call it the energy web optimizer agent. He is going to be the reinforcement, learning agent that is staking or is showing some sort of strategic staking behavior, and I tweaked a bit around with the action spaces, and here you can already see that it's, it's being quite um already quite complicated uh sean. You mentioned okay, the action space is a tuple I played around with it, but actually I couldn't get it working, so I resided to a multi-discrete action space, which is also fine.

D

So basically, these two things the action space in the observation space are the thing things you need to take care of in the first place. So what do I mean by a multi-discrete space? We have multi-discrete means you have an array of discrete spaces, so we have, for instance, three distinct action types, four, distinct energy web pools and four distinct, maybe staking person percentages.

D

We could model this differently, but for for the sake of simplicity, I kept it for four discrete values and, as sean mentioned, the action types are just plain: integers zero, for I do nothing one. I stake two I unstake and where do I stake or unstake? These are the is the second that's the second action space and parameter uh one of the five tools. Okay, I I presume we have five tools to stake on or to unstage from the observation space. On the other end, that's what I showed why I showed this simulation is.

D

uh We need to take care of how how on what kind of pool will the optimizer agent stay, calm or unstake from, and that is dependent on how good or how bad the pool is behaving or the pool is being staked upon. So it needs basically the amounts of ocean being staked on the pool to get some sense of.

D

Is this pool good to stake on or bad to stake on, so I have to unstrike from that a really simple, strategic behavior, I'm not sure if that's really working, but this is the the idea I had in the first place to model my reinforcement learning agent, and I did that here in a sort of testing environment.

D

I really spent a lot of time setting this up. As I mentioned, there are not a lot of examples in the stable baselines of um let's say, custom environments, because you need for this. This optimizer agent. You need some sort of customized environment and you need that. You need to set that up. Quite by yourself.

D

There is some documentation in the stable baselines, but it's not really that that fleshed out in my opinion.

D

So I really took it really took me a long time to to set this up, but then, as you can see, you can look at this this line here, and this is, of course, this resembles the action space in the observation space. I just talked talked about in the in the optimizer agent. These are the same.

D

So this is. uh This is the setup. We have an episode length. Sean talked about you train in episodes. Well, basically, these these are all kinds of this kind of stuff is quite standard. I also set a balance so for in order to stake on you need to. You need to have some some um some ocean, of course, or another cryptocurrency, and I uh you need to have a reset function. Basically, this you can, you can look up in the in documentation of the stable baselines, customized environment section, and what I did is.

D

um I also played around with some sort of a chaos monkey. um I have some sort of a randomized action that is, let's say, is influencing influencing the state, but first of all, let's see how we define the state after a step function, and basically I say: well, I take the action space. The first parameter of the action space is, of course, do I do something or not: do nothing stake or unstack? That's the first parameter the zero, the second parameter. The one is, of course, which pool I need to stake on.

D

So, if you imagine that the state, the observation space is actually the state- and these are, this is a box space and we say it has a shape of five. It means that we have five values in it and they're continuously uh growing from zero to a hundred, but I noticed they're, not that's, not a real threshold in this stable base lines and what I also did is put a gradient in it. What do I mean by that? I need to have some sort of idea of okay.

D

How should that strategic, behavior look like and how I modeled it is to put the uh to have some sort of a gradient of um the observation state of all the stake tools. So basically the state or the observation space is being modeled by five pools.

D

Five energy wet pools, and we fill this up with an amount of 10 ocean to start and, of course, if you can, if you are going to stake on it or unstake, these values are going to change and the the manner or the the the um let's say the rate of change I put in the gradient. So basically, if this is this state is growing from 10 to 11. Let's say we have 10 ocean and in the next step it's uh 11 ocean.

D

I take the gradient of that of that increment and that gradient, I will take to let's say, to calculate the reward.

D

So if the gradient is big, so if there's a big change in it and um that would that could be, you could see that as a signal of hey, this pool is behaving correctly because the staking is really increasing fast.

D

You could say: okay, this is something I need to stake on, because this pool currently has the best holds the best cars.

D

So the reward is a function of this gradient and also, if you unstack from it, of course, if the gradient is negative, so people are taking their stake away. They're unstaking, it's a signal of hope. This pool is behaving badly. I need to get out of it. So basically, it's the other way around. The reward is a negative if you unstake on it, but the gradient is positive. So then you have a negative reward, but of course, if the gradient is negative, you have a positive reward. So you do the right thing.

D

Maybe I'm I'm all I'm! This is not making sense sean. This is not my cup of tea, it's more in your cup of tea, but basically this is my idea of how to model the reward function. Any comments on that sean.

A

uh Yeah, I'm definitely following the intuition and it seems to make a lot of sense to me and I'm sure there's I mean this is really a space for creativity in modeling the reward function. But for me I would yeah. I would just run this. It's always good to have a baseline.

A

Hence the name, stable, baselines right and see what this, what behavior emerges out of this kind of reward function and.

B

A

Yeah, well, you notice this about reinforcement, learning, there's a lot of sort of rationalizing. That happens because uh there is in the very middle of what's going on, there's a black box right at the end of the day, we have an rl algorithm and it would takes a lot of time and energy to sit down and read the paper behind it and understand all the math and and really think about what's happening.

A

So we have this sort of black box in the middle and on one end we put our intuition behind the reward function and then, on the other end, we run the experiment and get the results and then there's this sort of rationalizing that happens between um what we expected given the reward function and then the actual results that are output and um often there's surprises so.

D

Yeah, so I'm so I noticed because I played around this with this all day. I I should mention that, uh because I couldn't get it right and maybe that's maybe that's the whole idea of playing around with it. You know just I will just start it up. You can see it here in the control. What's.

D

Happening so now it's training based on this reward function and the observation space. It's training the agent for uh 25 000 steps. I I guess that's I I took that number just from from an example. I guess that's uh that's! Okay and I model the first 10 steps, so I basically what I did is.

D

Actually this is the the outcome of the of the training step, a mean reward of negative, so doesn't show too many and too much confidence, um but I need to let's see um and say something about what what kind of what kind of algorithm I used for trading this I used. The ppo 2 model doesn't know what that means, but I took your ppo algorithm out of your notion page as sean.

D

So I guessed this was a some sort of a right algorithm to take, but there are several, and um so you can you can do as you like. I had some difficulties in saving and loading the model, the complaints about vectorizing environments. This is the stuff. I really need to dig into further, but then, when I I did, I could use the model uh without saving it. Just you know to go through it and, let's see, um do some 10 steps of predictions.

D

So, um as you can look here, this is something you might have noticed out of the presentation sean gave. This is the actual prediction of the action. What to take for what observation. So the observations are here: that's the state and it takes some actions according to the gradient, and if the gradient is positive for some pool, he needs to stay calm, then he will stay calm. So if you see here the accident space in step, one means zero is no action taken on pool one with a um fifty percent uh um staking amount.

D

Next, one is action. Two is unstake on pool one with a 25 percent staking amount or we take away 20 25 of the ocean.

D

So basically, this is how I put the uh the simulation uh and the uh put together the the stuff, and what you can see is that I um I try to accumulate the the reward and see what's happening with the reward and it's really not really happening for 10 steps. But maybe if we can increase this with under and sorry I am, I have to train it again, but because I didn't flesh it out.

A

So you were having.

C

A

Mark you're, having an issue with trying to save the model, was that right.

D

Yeah, because it's complaining complaining about factorized environments, you need to have, if you want to save it and load it again. um Yeah.

A

Yeah I've seen this I've seen this before it's somewhat familiar, uh there's like a wrapping and unwrapping I'm just looking at an old example that I had ran on a project, I'm just trying to get it running kind of in the background here.

A

um So I don't mean to take away too much attention, but I do remember, there's some key concepts in here that I had dealt with so see. If I can get this running. Oh that might work.

D

Yeah, so basically what I did. I tried to tweak some parameters here in the in the mobile right here, so I tweaked around a bit with you with the reward function.

D

D

uh Yeah, I also said okay, if you, if you are out of balance, if your amount of ocean is getting too low, you're also penalized by uh subtracting the reward with one. um So basically you can. I did the most of the modeling. I did within the reward function, but you also can can do something, of course, about the observation space. I have a real, simple observation. Space now of you know five, um I I'm looking at five uh energy web tools, but you know in reality there could be hundreds of thousands of these.

D

These things and the reinforcement learning agent just need to pick one in order to maximize its reward, so you can model. Basically, these these are the. um These are the tuning parameters, I would say, of your reinforcement, learning agent, so the action space and the the observation space and the reward function. Of course, I think you need to spend a lot of time um tweaking around the reward function, but you don't, as I recall, sean you, you shouldn't be too specific about the reward function.

D

I recall because otherwise you are going to give the agents too many clues um and you you need to to have him figure it out by himself right.

A

Yeah yeah, exactly because uh essentially a reinforcement, learning algorithm is a search algorithm through a very large search space. It's like a. We have we're generating a function that maps actions to um no, it maps, observations to actions, and these um we want to.

A

We want to do a comprehensive search of that search, space of that possible function and if we pre-program in the the reward function to be too specific, then we're sort of constraining the search of all possible um behaviors of all possible policies uh to something that we're kind of imposing our own bias on, and you know maybe sometimes we do want to do that- to search like a local area or a specific area of policies, but in general, reinforcement learning is going to shine when it has this sort of organic search through diverse policy, a diverse policy space.

A

So it's it is a yeah it's. This is one of the key aspects that we have at our disposal is like how we engineer the reward function. um So there is this balance of like not being too specific, but um and- and so one way you want to might you might want to do? This? Is you could code up? You could kind of save this. uh You could have a whole, maybe even a whole file, or you could just make various reward functions.

A

You could have a whole collection of them and then in the actual reward function here you could just sort of call one of your uh collection of reward functions, so we could only do like this a b kind of testing, but what I'm thinking about here off the bat is the training, because um how much training are we running on this? How like how many episodes, for example,.

D

25 000 steps, but uh the episodes are 100 time steps. I think I modeled wonder 100. um Let's see.

D

Yeah here so basically the episode length is 100. Should I increase that.

A

So is this for each sim for each episode we're running a simulation that is 100 steps.

C

A

um Yeah, so that's an interesting point to consider, so I think in the token spice modeling it comes with this idea that one time step is one day. Was that right or was it a month?

A

I mean it doesn't actually change the.

D

Yeah one day, I think yeah.

A

I think it's one day, so we can think we're running this simulation for a hundred days, so we're theorizing that we have these agents and they can stake and unstake on these data pools and we're giving them 100 days to do so and getting some summation of rewards over those 100 days and then we're running them on. uh Was it 25, 000 episodes.

D

Yeah yeah well uh total time steps. I'm I'm not sure if that's 25 000 episodes super time steps because then you need to divide that by 100. Maybe I haven't figured that out yet. Oh.

A

Okay, what are we doing here? Okay, so, let's see.

D

So basically, I'm taking here the episode length here in the in the implementation of the class. um Let's see here, so we have an episode length of standard 20 and it's done. If we have the episode length here, then it's done of course. So you, you just train for 100 time, steps each episode. I think 20., no okay, yeah! If I, if I don't, if I don't give them this parameter, you're overwriting.

C

A

D

Well, we can try it without. Let's see, what's happening.

A

So that'll be 20., so um what what we want to do is.

A

To get some insights into this, I my mind always goes to visualizations, so I would want to run this multiple times with varying episode lengths. So maybe.

C

A

And then 100 and then 500 and then 1000 and sort of plot, the performance or maybe the final reward um against how many time steps we take, and then you can start to get this idea of. Okay are more time steps increasing the performance.

D

D

So I see the main reward is a bit better. I think now. So maybe if you can, let's see.

C

Can I ask a question in the meantime.

D

C

um I was wondering since you're, starting um from scratch here with not that much heuristics on the pools, the pool sizes, um number of actors, a number of agents and so on. Is there a best practice in in the process of optimization to track all the various parameters? You are changing here and then comparing the results, since I think it's there's kind of the danger to go all over the place and don't understand the results anymore or are not that sharp on interpretations.

C

You know what I mean.

A

Yeah, I know what you mean I'll I'll jump in on that one. So it's this idea that there's so many ways to set up these experiments.

B

A

Are the observation? What's the observation space of the agents, and so it's like? We could run experiment after experiment. We could tweak things. We could um uh go through this sort of rabbit hole of running so many experiments and then at the end of the day. Well, how do we track all of our progress? How do we remember um that it was important for the agents to track the number of other agents or the number of pools or the total staked like all these different ways,.

C

Or do it vice versa? I don't know, create a roadmap and- and uh I don't know best practice- follow your roadmap, um at least for 10 variations until you change your roadmap. Whatever I.

A

C

Know if there are any kinds of frameworks or um best practices established.

A

Yeah, I don't know of any particular uh not like a framework that I can point to, but I think it's, the general scientific method and I think what we've been seeing uh in the the git coin. um With the get coin methodology with danilo, uh I really enjoy how he starts all his sessions with the hackmd file, um with everything laid out.

A

You know the experiment is designed uh and then coded and then ran and then there's this review process, so yeah um sort of taking a step back and outlining the expectations of the observation, space and the reward function and the agents and then and then running those experiments and comparing the results. So I think we do see this when you look at the literature in reinforcement learning there is this process there's also sort of a more programmatic way, which is like a grid search.

A

You know, so there are algorithms like hyper or or meta algorithms, that you can throw on top of your reinforcement, learning algorithm that will strategically search the parameter space. You can define how this is a lot like what cad cad does with a b testing in the system parameters, um but instead of just a b testing, there's algorithms that will actually search through the hyper parameter space and there's uh often the class of algorithms that get applied to this is evolutionary inspired by evolution.

A

So you can set up multiple reinforcement, learning experiments um with differing hyper parameters and then uh combine the results in a way that is similar to that of like evolutionary traits. So you can randomly you can randomly scramble.

A

So you could take your uh observation space or you could have your set of parameters that you can can be included in your observation space and you can do a random selection over them or maybe you could do 100 random selections and then you could run all of these reinforcement, learning, simulations and sort of select for the top 10 and then take maybe an average between their hyper parameters.

A

So there's this sort of automated approach and then there's the methodology approach uh and I think both are very important because of the complexity, the like the nature of the complexity of these problems. It's so important to document along the way. I I just this is speaking from experience, how many thousands of hours, I've uh you know, had the most beautiful experiment results uh and then, after all night of hacking, I completely forget how I set it up or or what I was what you know what I had to begin with so yeah.

A

The documentation process is really important.

C

Yeah cool thanks for sharing that's interesting that you can even run optimization on your on your experiments, parameters, cool.

A

Yeah, there's I'll give a shout out: there's a really cool python library, it's called teapot and it's for genetic algorithms applied to hyperparameter. Optimization, really interesting.

C

Cool, maybe you can share a link.

A

Sure I'll yeah I'll drop a link, um yeah good stuff mark. So it's awesome that you got this running. um You've got this reinforcement.

C

Learning setup going inside of.

A

Cad cad inside of token spice. uh This is quite the the feat uh for for a hacker, very inspiring and um well.

D

Actually, I I haven't brought it into token spice, yet because I was, of course still training this this stuff, and the next step is of course, once you have a good model trying to get it into into your optimizer agent. But that's next level.

A

Yeah I'm wondering if we should continue these lab sessions. The series um we can do some chat offline, there's nothing really rigid blocked. I have a couple ideas of of slots to be filled with the labs, like some more work with one hive, but it's pretty flexible.

A

So I think maybe now that we have this foundation and that you have here, maybe we could do a whole lab session where we actually just um program on this and try different reward functions and maybe do some of this experiment methodology that we've been learning from the git coin track.

D

Yeah, it would be nice um as an observant.

A

I am happy to open this up.

A

What was that? Angela.

C

I was saying we could at least pseudo code. I mean yeah doesn't like creativity, I guess.

A

Yeah yep, that's important too yeah, okay, everyone! That's the hour, so mark. Thank you so much for this uh amazing two-part series. I think you've blown everyone's minds, you've put in uh token spice inside of cad cad inside inside of stable base lines. It's quite amazing and I think it would be awesome to continue this work in the labs, so we'll um figure out how we can set that up and uh yeah and thanks everyone for coming and have a wonderful weekend.

C

Okay. Thank you. Amazing. Thanks for sharing, okay, guys, bye, bye, bye, bye.

B

Always great, thank you guys, awesome bye.

B

B