Token Engineering Commons Lab!, 24 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: TEC Co-Lab - Praise System Data Analysis Week 2

Description

🙏 Thank you for watching! Hit 👍 and subscribe 🚩 to support this work

🌱Join the Community🌱
on Discord https://discord.gg/uM4ZWDjNfK
or say hello on Telegram https://t.me/tecommons
Join the conversation https://forum.tecommons.org/
Follow us on Twitter: https://twitter.com/tecmns
Learn more http://tecommons.org/

A

It's always fun.

A

So uh I can start I'm a little distracted because I'm not in my normal, like home situation, so uh trying to figure things out and have to eat a little bit during this call, because I haven't had breakfast yet and uh also my girlfriend.

A

And my intentions, for the call are to just like understand impact our analysis, invite everyone to a pram debate after the call after this two-hour session. If anyone has times it's really fun to uh to debate the parameters and debate the four choices that we have to initialize our economies, uh so that's our economy. So that's super cool and really just to follow the follow the plan for for the rest of the groups and I'll pass it to jeff.

B

Cool yeah intentions, for this call um really interested to understand and see.

C

B

Has been done on the data science side so far and um yeah to kind of have, I guess, a focus session that gets these um data scientists whizzes uh anything they need to continue doing the work and maybe see what we're um you know. Looking for as outcomes to the process, so distractions, as I mentioned, I'm still in the car for another five minutes, uh and I will pass to jessica because she's right here next.

D

Everybody, if you're having a we just yeah intentions, are just to yeah not get into too much discussion and just really focus on work and welcome. Angela thanks for coming hello, hello, uh so yeah, everyone coming and I'll pass it to him.

B

I didn't hear who just passed it to so maybe I'll just take it. um My intentions are mainly to listen in uh hear, what's going on and how people are thinking about things and how the data scientists approach things and my distractions are just a million things around the home, so uh I'll mostly be listening in and I'll pass to. Angela.

C

Hello: everyone I just dropped off yeah thanks for having me thanks chess for inviting me, I haven't been in the sessions for a while. I just looked at the the phrase quantification at the moment, and I thought okay, um if I can help in any way, I'm happy to do so distractions, all things around te academy um with many activities at the moment that are also leading in absolutely the right direction for token engineering and te commons. It's just another. Let's say another area of working and I pass on to zaptimus.

B

Thank you, angela yeah, my intentions is to yeah, see the process, understand how it works and yeah see how how are the variations of the what with it and how it come up and yeah just keep learning and no distractions and I'll pass it to one.

B

Thanks septimus yeah, my intention is to learn and also to follow the process and to facilitate also the dialogue and the expression of everyone. So I am super excited about um having this opportunity and I will pass to bitter no distractions.

E

Okay, uh yeah, my intentions are to see how is this brace configurations going and maybe try to.

A

Help with some insights and work on it- and I have no distractions right now- oh.

E

B

F

Everyone, my intentions from this call, is to show up, get updated on the parameters working group and pick up any tasks that need to be done and I will pass it.

A

Back to griff, because I joined late.

A

Sure and I'll just say that the normal parameters working group uh work- that happens is basically on pause this weekend. So we can uh do a deep dive into the impact hour analysis and so with that uh with that kalia I'll pass it to negan.

B

Yeah, my intentions are also just listening and learning as much as much as I can um yeah. I was hoping to catch up a bit on the on the parents, because I was um off a bit this week and I kind of feel like fall out of the loop, but.

C

But yeah I'll just listen and learn as.

B

Much as I can and distractions, I don't have any big ones. Maybe I'll have to leave for like a couple of minutes intermittently, but that's a big deal and I'll pass it to ygg.

E

Hey everyone uh nice to see so many people here. um I think I me and stem have one hour that we're going to work on bonding curve modeling today and the rest is for praise, analysis um or maybe it'll all be phrase analysis, but I really want to uh just help people get started for anyone who wants to look at that data and attempt some different things. I know octopus has put in a lot of work in cleaning it.

E

um So I'm not sure if he's pushed his updates or if we'll see him today, but I hope so and I want to help people get started on what they how they might want to look at that data and anything they might want to analyze.

E

I think we should go through the notes that we have from last week and help people get inspired on, maybe breaking off uh pieces that they can focus on and then personally I want to uh use a library called d3 which has a force directed graph, and I want to see if we can get some network structure from the data by by putting it into that format.

E

Yeah and um I'll pass it over to angela if she hasn't gone yet.

C

I already did thanks sean.

E

Awesome I'll pass over to johann.

A

Hey guys, this is my first time in one of these meetings.

A

I'll just be listening in like the others and pick up what I can. I was brought here by sean.

B

um I'm one of the junior token engineers in his team.

E

It's nice to meet you guys: okay I'll, pass it on to whoever hasn't gone yet.

A

Yeah well, octopus join in hey octopus, uh very curious. If you have any intentions or distractions to to call out uh for joining the call.

F

My intention is to find ways to help the community see the data that they're interested in my distraction is at the moment I am driving with my kids in the backseat.

A

That is a great distraction. Did everyone get a chance to say to do intended? Distractions?

A

Oh, hey dan. uh I think you might be last uh intentions, distractions,.

B

Oh, I just and just folding some laundry man and yeah. I basically just want to listen in mostly.

A

Back to you cool: well, actually you know I I feel like I'm not really leading this call. I I just uh am buying time so that jeff and jess can land land uh there, but um maybe that's.

D

All good, I was going to say sean you lead away. You an octopus, well octopus is driving, but he can maybe tell what he did so far.

D

If you just look in the hack md, I should have synthesized all of them, but there, though, not that so you lead and start with the octopus without too much distraction driving I'm driving too.

E

Hard, okay, thanks jess, um so yeah just a little background as well. These are the sunday hack sessions, so they're, usually pretty casual uh some people drop in and out. We we kind of hold it open for four hours, which is a pretty long stretch, so sometimes people can only make it for an hour or they'll they'll come at different times. It's pretty casual.

E

We try to have it like a like a laboratory or a kind of a research environment, make the scientists comfortable so yeah I'll, just I'll just get started at uh looking at this praise day, so probably jump into the parameters group. Here I see some hack md files. um Let's see um was that what's this one.

F

Yeah, that's the pram uh runoff post.

E

Okay, uh so just that hackmd file, do you know, maybe you said it to me directly or.

D

It should be, it should be on the uh forum post. I've been logging for angela.

E

Okay, I'm just checking out the forum post impact hours, distribution analysis right at the top here awesome by jeff. Yes, okay,.

E

So yeah I haven't read this yet. I should read this um I'll drop this in the params chat.

F

Sean, you don't have to use this at all, but I tried to create a cleaned up thing where all the user names were standardized and the the rooms and the dates were standardized, and it's in my fork of your github repository.

F

I don't know if that would be useful to you, but I just wanted to mention it.

E

Yeah, that's I'm excited to see that octopus. I think that'll be very useful.

E

Okay, so we have the commons, build praise, analysis and I'll check out forks, and we have dr penland here or dr octopus and uh so I'll I'll link the original uh commons build repository in the chat, so everyone can find that and from there you can check out the forks at the top right.

E

There's been a couple forks. So that's interesting. So I think what I would like to do is definitely check out octopus's work and uh see what we got to get started um yeah and anyone. If, if anyone just wants to start hacking, like you, just feel free to mute me or um yeah. If anyone has any questions or anything.

C

Yeah, can you can you provide just a couple of sentences? I have. um I don't know exactly how how long you have been working on this analysis. What's the current most important question to explore, that would be really helpful.

E

Yeah good, so we have some notes. um I think there's just a couple things I want to open up. We took a lot of notes last time directly in in a notebook, so the as far as in terms of how far along are we, I think we just had one session. One week ago on sunday uh we all came together and opened up the data and there's a few background resources that we went through. There is, I think, two forum posts, uh one by griff and one by me that explain the background.

E

These are from like no november, uh so they explain the background on how the impact hours have been collected and maintained or and how praise has been converted into impact hours, how um the volunteers, who quantify the impact hours, how they get compensated and then there's the other topic, which is the fact that some people are staff with the common stack or with the token engineering commons.

E

So they have this sort of consistent compensation, and so the community has tried to balance that so there's a percentage reduction in quantified impact hours based on someone's, basically salary or or income rate. um So we went over all that last time and and we we went through the data set uh which I'll see if I can bring that up here, increase quantification. I think I'll link that this is this is the data source.

E

And so what what I did offline is I downloaded this worksheet and then just opened up a jupiter notebook that could load this worksheet up and I was attempting to because, because it's in so many sheets uh it basically had to be concatenated into one dataset, and so we got mostly through that last time uh other than a few outliers that just weren't playing nice uh in jupiter.

E

And so oh, I think that's where we left off last time and we'll see how far octopus has got in terms of consolidate and and yeah. When I tried to do start doing some analysis. I noticed this is pretty dirty data, like I tried to plot things on a time axis and there was lots of weird inputs in the in the dates. So so I think there's a good case for data cleaning and yeah. So I think we'll have a good launch point from what octopus has put together.

E

But that's where we're at this is basically the second session here and we have notes. We have a google doc and we.

A

E

Notes directly in the jupyter notebook so I'll bring those up now.

C

Awesome, that's amazing, and as far as I understood, but please correct me if I'm wrong so there's two areas of discussion at the moment. One is better understand the data to see if, if dishing praise resulting in impact hours recently, resulting in hatch, let's say um the starting point for everyone is what we want to see. This is number one understanding the existing praise and impact hours and, and then second area that seemed to me somehow mixed into this is for the hatch face.

C

Is this I mean, of course anything that happens in praise was based on on contributions on building te commons, but there might be other important token engineering contributions that aren't covered there at all, and the question might be if this should be reflected in the hedge as well. Is this correct or wrong.

E

I think that's very correct yep. Those are two there's there's. Those are two topics that have come up and another one is just general fairness and this ratio of this this discount to people who are on a salary.

E

I think that it was um you know put in in hindsight it was kind of arbitrary, so I think it could use some analysis to see like I th. Maybe we could quantify like actually based on how many people how much people were paid does that equate to the reduction in in wrapped x die equivalent.

E

You know, uh tec, converted impact hours tec token. um So if we could quantify that, that would be really interesting.

E

So I think that's a number three and actually jeff emmett proposed this concept of an intervention, um just some sort of filter, basically applied or or transformation applied to the impact hours set, and so what he showed is that uh when he simply looked at the data, uh the and- and I don't know if I think I have his spreadsheet here- but he found that the mean was very far from the mode um of the data, and he showed that that could be that that difference between those two could be reduced by actually just applying a standard rate.

E

So basically a base rate of impact hours to everyone to like kind of a base, a base level um underneath the impact hour, distribution that would be applied to everyone.

C

um This is where the term universal basic impact- our income comes from right, awesome.

D

E

Universal basic impact.

C

E

So that was from some interesting results that jeff found and- and I just like that idea of sort of he called it- an intervention that can be applied and I'm thinking. Okay, that's really interesting. What other interventions could we apply to make this maybe a more fair uh distribution? So I think those are four points, so one is general visibility of the data.

E

Let's see what it looks like number two is the focus on building the tec commons, but what about all the all the effort that has been contributed to token engineering uh prior and how do we see that and how do we recognize that number three was possible imbalances from the discount on people who accept salaries or ubis from from the common stack or the tec and number four? Is this idea of applying filters or interventions or transformations during impact hour, distribution.

A

And that uh number five that has come up a lot is understanding what kind of work gets a lot of what the natural like um inherent um like laws are in the in from like subjective processing of this data during the during the analysis so like, for instance, our meetings being over rewarded is twitter being over rewarded versus. uh Maybe one praise for a large uh for a large body of work.

A

You know uh what's being under rewarded with being over rewarded and that I think that's less about uh that's more about understanding the system and maybe not necessarily going to result in interventions, but for long term like how do we take what lessons learned from this data set and apply them to potential future applications of the system.

C

Absolutely- and I guess this is pretty much related to number one understanding- the data yeah.

D

And yeah sean, I posted the link to the hackmd and the prams channel on my I'm on my laptop now. um Are you looking for the top slices? Because we had the list here? um Double.

C

D

But yeah, I think that was one of the main things is just checking to see if there were any like double praise for the same task by the same people and the meeting. The meeting phrase would be pretty easy to cluster to see, if that's repetitive, with the regular.

D

D

Oh god, okay, you found it thanks.

E

Mm-Hmm thanks jess, I'm just trying to capture everything we just went over. I think I think I can get it.

B

There are also.

B

Oh yeah yeah, and instead um this is a very sensitive topic, so um like any possibility that that of change that we are studying um should be also like voted through the community and also that.

B

We have like a high um feeling, a great feeling about the praise and we feel that yeah it can be enhanced and improved, but um we we like um there are people that still like uh defend a lot, the work that that's been done and and yeah. um It's like very, um like um attentive to uh the changes that that may be done.

B

So um one of the things that that we said is like um it would be good if we have certain boundaries um around what changes are possible uh so that uh there is like a lot of of less of lack of information and miscommunication around this topic, and also another thing that was mentioned.

B

That day is that um yeah maybe like it, can be given in fact hours for people who has to have less, but it would be like really bad if we think of taking like impact hours from people who already have like um yeah. So it's better like adding that than resting, because uh that uh would help respect the sense of profession that it people has made around um their contributions and yeah. It also respects center sensitivities around payment, because um there there were people uh who yeah who made certain set of expectations.

B

According to to to yeah to the payment, so those are some um variables that are like not so much present in the data set but like that they are in the community like um this is a very sensitive topic and and yeah. It should be really great to uh take into account all all the points that uh these uh discussions.

E

Thank you thanks juan.

B

Yeah, I I sorry I was just trying to find the mouse to uh un unmute here um definitely agreed and I don't think there's anything in the analysis. That is inherently saying we are going to change anything at all um and that's you know that the beauty of the scientific process is, we can analyze this and then choose to do what we want with the things that we uncover in the analysis process.

B

So yeah definitely no um no proposed action just through the sake of the analysis itself, but I imagine once we have more information, it will give us a better idea of you know what uh how to uh align the outputs of the system with the with the goals that we had for it in the first place, and I just have to go rescue my dog. I don't know if you can hear that in the background, but I'll be right back.

A

uh Yeah and I'll just add a little bit to that, because uh every two weeks we do, uh we do an impact, our assessment and and quantification. We just did one yesterday the results are out. uh I can afford those well they're in the phrase quantity, and uh actually we had a really interesting discussion about whether or not we should divert from how we normally do things. Given this new analysis, that has happened, and we actually decided not to and I'd say we it's really. uh It was uh yesterday.

A

It was pam, eduardo and juan were quantifying, and they said that uh we can. Probably you know in the two weeks from now. We have another uh quantification and we're really it. It was it's better not to divert because we don't have the results of this analysis yet so uh it was better, even though we thought that it was flawed, that certain certain things are flawed, uh just to continue with the status quo for this round, uh but consider that you know once we have some data that we can uh change.

A

We can make an impact on the next quantification to adjust for the finding.

A

But it was hard. I'll be honest. It was pretty hard because uh you know one the big one, for me at least, is this 85 deduction? uh That's given to people and it's like?

A

Okay, this person has 30 hours and now they get five and it's like. Oh my gosh, that's just too much but we're doing it anyway. I guess because that's the status quo, so it was hard yeah.

B

Yeah and to add to that, since I was there yesterday, uh two uh it's true, the the question came up about whether we should um modify those percentages for one or more people and uh eduardo voiced his sort of opposition to doing it.

B

Inside this one phrase, uh I sort of voiced my um agreement with eduardo that if we were to make that decision, it would make more sense to do outside of this phrase with just uh four people and to have it be more of a community decision, and I guess I also questioned whether we could do sufficient in one praise quantification for the kind of um multi-month disparity uh cost. Maybe that may or may not have been caused by this uh discount so yeah.

B

I think it'll be interesting for us to see and then sort of think about how we can, um I don't know, make an adjustment if we want to or decide not to make an adjustment over over between now in the hatch.

E

So yeah this is super valuable. um Any any data scientist or any token engineer, will learn from experience that hours spent with stakeholders at the beginning are have the highest leverage um more than any number of hours spent in the lab hacking on data, um because uh jeff mentioned this, how do we align the system, the outputs of the system with our goals? And if we don't do this kind of dialogue and discussion, then we we don't really know what our goals are in the first place, so um yeah there can never be too much.

E

There can never be too many stakeholder interviews. That's what I learned from the ecosystem, value flows course.

E

So I really appreciate everyone jumping in and.

E

E

So this is a good document. uh It's been shared in the params channel, so anyone can actually open this up and and hack on it um there's a lot of really rough notes from last time.

E

If someone wanted wanted to, I think, there's a case to jump in here and maybe summarize these because they're just kind of rough bullet points, uh maybe they could be left at the end as notes you know, for uh historical reasons, but maybe summarized into a few paragraphs and there might be redundancies or things that are repeated from last week and this week so um yeah, that's that's one place that people can jump in if they feel.

E

Inclined, let's see otherwise, I've got the data here um and I'm gonna start jumping in and taking a look, okay, yeah, so praise to and from interesting cleaning praise data octopus is there any particular place that I should start here.

F

I think the best one to start with is cleaning praise data, because that does does a process walk through of what I tried to do in cleaning.

E

F

Anything was incorrect or if there are suggestions, it'll be clear how we could modify it. I hope.

E

Excellent okay, so we have the data loading here from last week. There are one two three four five um data sets that don't load. They seem to have a slightly different format.

E

I'm going to put that as an issue in the repository here.

B

F

And also just as a style thing, not everyone knows this about me. um I'm I teach a lot of classes, so I tend to leave lots of notes as if I were preparing this for my students, so it may be overly noted. I don't know, but I tried to explain absolutely everything I was trying to do as tech before I did. It.

E

Perfect, it's a good habit.

B

E

Oh yeah, this loading takes a little. This takes about 30 seconds, I think, to load all the.

E

E

So this is just iterating over, uh so I just manually listed out each of the worksheets like you see here, so this is number seven december 18th.

E

um We have this whole worksheet as an xlsx here, tec, praisequantification.xlsx and then we use the dot read excel and we just passed the location of that. So that's in the data directory data, tec, praise quantification and then we just add a few of these.

E

So we skip the first two rows because uh they're headers and uh yeah. So these these two are sort of informational, headers and summary stats. And then the third row is the actual header that we want to grab as our columns.

E

And we take columns a to m, that's all of our data, uh then we just do. I do a little bit of pre-processing here because there's one variation in each data dataset, and that is the name of the quantifiers.

E

uh So this is different in each sheet, so I just uh yank those out. Actually we call them validators, so I sub them for v1, v2 and v3, and then I add three more columns at the end of the data.

E

That are v1, v2 and v3 and they just list what those names are for. So for each sheet. That's loaded, it'll have the same three names in every row and then what I do is I drop the nand values.

E

um I add a column that represents the the name, the period so number 10 january 29th, for example, uh and then I concatenate everything into a combined data set.

E

So you can see here the period, so we could group by period, for example, to see the variations and there's so much. We can do with this data that you can cross-correlate like any of these columns. So we could see you know out of all the we could grab all the unique validators and do a count of how who how who did how many validations or or, during what months we can see the average number of praise how it how it differs over different periods, um just a few things as examples.

E

So let's keep so we have our data set all concatenated minus the five worksheets that we weren't able to grab and that's in total 8071 instances of praise.

E

Okay, so now we have receivers. Oh I think this is left over. I was just doing what what do we have here? um Combine date. I haven't combined it, so I think this is left over from last week. I was just giving an example, and I should have documented this. Like octopus was saying, I think I could get better at that, but this is filtering out um receivers.

E

I'm going to skip that for now and come back to it, uh cleaning standardizing some of the data set by octopus and courier in.

F

Our I did everything jumping off of that receiver's data frame, so it wouldn't be too hard to go back and do the same process for any other data frame, but I was just like I was just carrying forward from the receiver's data analysis, but the process should still be the same. I think.

E

What the heck is this receiver's data frame? I think I made this last week and I don't remember why or what.

F

E

F

I assume, since it was the last one, that it was the one we wanted to play with it would be. It would be really easy to just change the variable names right and re-run it. It's not a oh.

E

I know what it is: yes, okay. This is why documentation is so important because for anyone who writes code I'll tell you now you're going to spend about 98 of your career reading code, not writing code, and so you want to leave good notes behind, because that will save you the most time.

C

Question and classification, yes and cleaning we are in. I I guess in two of our key questions we have this topic of. Okay are certain collaboration, contributions, overrated or underrated, or overvalued or undervalued, and I guess this would lead directly to the uh column reason for dishing.

D

C

Seems to be it's, it's pretty generic right, so this has been the comment that has been made in in the telegram or in the channel and and the bot adds that to to this data set- and I wonder, was there any step? Probably there was- and I just don't know that we have a certain classification or tagging so that we know this exact question is twitter- is sending a twitter overvalued versus um spending hours with coding undervalued? Do you know what I mean.

A

C

And therefore we we would need to have a better, let's say, better classification of this reason for dishing.

F

So I have, I have definitely not gotten far uh things we were trying to do, or even even lower level than that so yeah.

C

But probably this would be. I just want to throw that into the discussion. uh This seems to be a step that is required to gain a better understanding.

E

Yeah- and I think kind of what you're saying is this could be actually added not on the analysis level, but in this in the system. So maybe with like a hashtag or something right, you could say at the end you could tag so twitter or coding, or research.

C

Exactly either so, I would rather see that as a result, but certainly take the data classified into whatever 5 10 15 subgroups.

A

Of contribution.

C

And then our assumptions are what are the big clusters of contributions and what? What is also a contribution we want to separate from another and then try to reverse and engineer and see. Okay, we would. This is a manual step. We would need to go over the contributions or a subset to better understand if certain contributions have been over or undervalued,.

A

I did that for a cut for one outlier or someone who is signified as an outlier, and maybe that could be a good starting point, but um but it's it's it's hard to do cleanly with the data that scale. So I can imagine yeah.

C

Okay, yeah just uh idea, I think.

A

This is a great way: it's gonna be critical for yeah. Go ahead, jeff.

B

uh It's definitely a great way to kind of bucket out praise in the future.

B

I think because, actually, like kind of when you group things at that kind of like sub-population level of praise sort of thing, then you can say: okay, we want to you, know value this type of work or 33 of the overall and this type of work for 20 of the overall in this type of work for 10 of the overall and then weighted accordingly, without having sort of like um praise inflation from you know, there might only be uh 20 types of praise for one type of work, but 300 types of praise for another type of work, so for the 20 to not get lost in the 300.

B

If they were in separate buckets with those buckets allocated the percentage accorded to them uh determined by the community, then I think that would be a better way to to do that at least moving forward for.

B

B

And there was like a lot of of thought into that weight and we know.

E

I I'm going to repeat uh I I heard you juan and but your audio was a little bit uh cutting in and out, but I think I heard what you're talking about. um So I'm going to repeat it juan's talking about the actual praise, quantification process, which actually is very it's subjective, but in in what tries to be a very fair way in that there's different uh praise quantifiers every week and what they do is they.

E

These are people who are embedded in the community and they take each instance of praise and do their best to actually assign a weight for it. So they know who's coding versus which one is a tweet and they they they that corresponds into how the praise gets translated into impact hours and there's it's somewhat quantified here by the normalization score, which I still don't understand. Griff explained this last time and but I'm hoping through this data science process, I'm I'm gonna like this will fully click uh for me. um But.

D

I guess we do we do I I hear you one, and I know I've also seen this and done it. We weighed it more, but it's really hard to tell what effect that's really having, because it's kind of like we're assigning.

D

We try to weight it higher, but we we don't really see how much that impacts or we don't really see like what is the output of that. So maybe, if we see some of the overall numbers split out or I don't know, there was discussion of kind of chunking things out in types of work, but yeah I don't know, is it harder to do that now versus if we had done it as like hashtags and things, but I think it's pretty. I mean we could still try to group some things or sean.

D

How much work does that actually take on this side.

E

I think it's a great project. It's a it's, a very s, kind of standard data, science thing, that's really cool. It's, it's called unsupervised learning, it's actually an ai approach so and we've seen seen some of this in the research uh yawn did some great work on this with network spatial embeddings of network structures for the git coin, research- and this is very similar- it's even more standard than that which is spatial embedding of language or natural, natural language processing, or it's clustering uh based on language.

E

um So there's very, very standard toolkits to do this, so I think it's a really great project that someone might want to pick off. Maybe johan.

C

So this would mean we would design some, let's say examples of translating um reason for dishing to a cluster or to a box or to a to a category. And then the algorithm would use these embeddings to translate all the 8 000 or how how many um phrases yeah two categories right.

E

Yeah this is so we could do what a semi semi supervised. So we could just pick out a random 25 of these and or 30 or 50 and label them and and then based on that information, an algorithm will find you know. Maybe we make 10 categories and that these we put these 50 into and then the algorithm already has the buckets. So it just has to go through all the data and and put put the text in the best buckets uh that it can find and yeah. That would be really interesting.

C

E

And there's, even so, that's like a spatial embedding like a language, spatial embedding, but there's probably easier ways, there's direct ways without ai we could just first of all, we could filter every everyone that has the word twitter or tweet, or tweets or tweeting in it, and see how many we get. Maybe that gets all the tweet once or or maybe it doesn't, but we could try simple techniques like filtering by keywords, and we can do a word cloud kind of like we could just see.

E

Okay, what are all the words that show up here like take out all the standard like stop words like uh for his off, like you know, of the take out all those filler words and just see what what's the distribution of topics that are occurring in this. In these reasons,.

D

Yeah, I think that might be a good way to reverse. Do it, instead of trying to come up with language first to actually just use, do it that.

B

A

And of course, sense making afterwards because there's like people who tweet and then or who are like just retweeting or whatever, and then there's people who are running the twitter account and obviously you know- and it's always going to be that game. So uh it is long yeah. We just have to be careful. Looking for outliers and understanding, uh maybe like some also words that yeah.

C

And I mean eight or seven thousand six hundred, it's it's. Of course it's a huge number for do it manually, but actually it's not a really big data set.

E

We could we have ten people on this call we could each you know we could split this praise up and and all go through each one. The human machine learning algorithm.

D

Yeah I was reading this book on algorithms and it's talking about like 13 billion or something is a challenging data set. So yeah.

E

Yeah, it's not too bad.

C

E

Okay, so let's see what octopus has here so cleaning and standardizing in our explorations of the dataset, we noticed some opportunities to standardize and clean various aspects. Oh yeah I'll just explain what this receiver is. What I did here. So there is basically two types of praise and- and I thought this was super cool- it's just like a bitcoin block, um because in each sheet you have all the praise, that's given and then at the very bottom you have the validators.

E

This is so cool. It's like a human blockchain human machine learning. Okay, so uh the validators get paid a certain amount of praise for doing the validation, and so um we can analyze that as well, but I think we can analyze that separately and then compare it afterwards, so I actually just filtered all those out. So in the data set that we have here, we have basically only the transactions um from the blocks, not the not the validators.

E

So sorry, if I confused everyone just there I'll keep going so.

A

Yeah, that's the.

E

Receiver's data set and let's see what we have here for cleaning, we check the columns, that's good. We can see all the columns here uh entries involving quantifiers. Oh here we go. There are two types of rows. I think uh octopus did the same thing that I had done before. Okay, two types of rows that we see in the data frame now.

F

Here, I'm removing praise that was adjusted by the quantifiers as part of their decision-making process.

E

F

Phrase that did not originate from channels, but praise that was by the quantifiers have things where they give praise and they're adjusting people down, because they're paid or whatever. And so I want to just put those two out as separate types.

E

Cool, oh, I see so there's a from qualifiers or quantifiers. um uh Okay and yeah I'll be interested to read this because I'm curious what the general oh. This is all they get yeah the gets paid filtering, okay, so people who get paid and then there's a filter, applied so and then there's these are the quantifiers and then we have not quantifiers, and this is where the receivers uh where the from is not. So for those of you don't know, this tilde is a negation like a boolean negation. So we say not a quantifier.

E

And we have, let's see how many we have here: uh seven thousand six hundred and thirty one.

E

So reality check negative impact hour per praise in the quantifiers data frame. We notice that in the quantifiers data frame, the impact hour per phrase is negative. We don't really understand this, so we want to draw attention to it right now we just deal.

E

Is there a comment or.

A

Yeah, I just want to say that the reason it's negative is because that's where we deduct the uh people who get paid hourly, who get paid for their hours. That's that's. The deduction, makes sense and now actually um to keep the data set cleaner in this last round. We added another row for when we um make adjustments at the end, because after we quantify every praise like okay, three people read every single praise and give it a score at the end. We do some sense making to uh just be like.

A

Oh, why is this person getting so much when clearly, this other person had more impact and put in more time than that person? Well, let's go and adjust the data to reflect that. Maybe whatever went wrong, let's just fix it. You know without trying to understand what went wrong. We just go in and fix it, but before we were actually going in and changing random uh phrases and pumping them up, so you might find some like outliers.

A

Let's say like everyone else got. You know uh 0.5 twitter 0.5 impact hours for retweeting, but then there was this one retweet for this one person that ended up getting six impact hours. Well, that's because someone went in and and was doing sense making at the end and wanted to pump their score and just changed it for that one praised, and so in this last data set we to avoid that um to to keep the data cleaner. We added a new row. That's just adjustments!

A

uh So now there's another quantifier row to avoid making the data dirty yeah. You can see it in uh in 18.

E

And you mean road, not column,.

A

Yeah, I mean row so.

E

It's at the bottom.

A

No, it's not at the bottom, it's it's uh within because we sort it by people's names. So, like someone who we could do, I think ivy. If you go, find ivy's name just scroll up to the middle and you'll start seeing these pink rows. Yeah just keep scrolling. So that's one. These are adjustments for people who get paid and you can keep scrolling keep scrolling.

A

Oh there's one! uh No! Keep scrolling keep scrolling keep scrolling. Oh.

B

D

A

Adjustments at the end for five iv, we thought vive iv didn't get enough uh praise, so we you can see that for that adjustment. At the end we actually just manually gave gave added another praise by the quantifiers that gave five iv an extra seven hours right, but even then, because vivei v gets paid uh his total impact hours per person, which is column l. He only gets four despite that right and this is the effect of the uh the 85 percent.

A

So if you scroll a little bit more for vive iv's praise uh scroll down a little bit, there'll be another pink one that says: oh, he gets paid, so he only gets 15 of his impact hours, which ended up, subtracting, 26 impact, so he would have gotten 30.. He would have gone 23 if no adjustments were made, but then at the end we compare everyone and we made an adjustment and effectively each person uh thought that he needed to be pumped up a little, so they gave them a score.

A

He got an extra seven impact hours from that adjustment. But then, after that, we took away 85 percent uh in this adjustment because he because vive iv receives a receives payment.

E

Yeah, that's awesome to, I think, that's a clear upgrade. I mean there's two points here. I think this is good that you're now not adjusting the original data, you're augmenting it uh by adding a row which can later be filtered out, so the original data can be analyzed and the augmented data can be analyzed.

E

There's also the point of like changing, like you said at the very beginning, changing the methodology, while we're halfway through the experiment, but I this is, I think I get why you guys did this, because this is a very clear change like oh, we could actually replicate.

E

I mean if we're doing analysis, we could take this extra seven hours and we could add it to any random praise instance uh if we wanted to if we wanted to make like transform this data exactly into the prior data. um Yeah there's there's a lot of messiness, but this is a huge experiment right and- and I think it's yeah- it's just really neat that this is all happening, but this is a clear upgrade. um Adding these extra rows instead of augmenting the data yeah.

A

I mean I, I just wanted to call it out so that when you guys are analyzing the data you realize this change, there's also a major change in data from number round number, five to number six, and so now, there's like effectively three data sets. There's this one round that has one style, then there's the next, which is round six through round 17, which has a another style, that's relatively consistent and then there's rounds.

A

uh Well, I guess there's four there's and then there's rounds, uh one through five, which doesn't have any of these deductions and had a tiered system and then there's round zero, which was a historic uh round like it lasted like two months or something because it was the first round and we just scored all the praise that accumulated to that point.

E

Yeah that should be documented. If someone can maybe I'll just write that in here or.

A

um I can I can write it out thanks.

E

Okay, I'll all so that makes sense so we're taking out those quantifier rows, we're just going to work with the no quantifier data frame for this session. Our goal is to produce a csv file that can then be analyzed effectively and further cleaned. If needed. uh Where does praise happen? We first want to understand the various locations. Where praise happens. We learn the following: unnamed three represents a server.

E

Praise is from telegram. ah Other names are discord, servers room represents a specific room inside the telegram channel or discord server. There are rooms that may have the same name. We made these changes four changes. uh First, we dropped emojis from room since they can cause confusion in the algorithms or the programs number two. We renamed unnamed three column as server number three.

E

We replaced praise with telegram number four, a new column source that combines server and room with string concatenation, so the praise's origin can be accurately sourced awesome and then we go through those so dropping emojis. We use the emoji library cool. I've never used this before.

E

F

I I have a cell there, you might need to uncomment to pip install.

E

Got it collecting emoji successfully installed.

E

Just a warning here, it looks like um that's fine, so we use this uh octopus made this strip emoji function, regex to grab the emojis and we apply that to the column, so we simply strip strip the emojis and uh from the room. So now we have a new column, so you can see here the room column with uh the praise emoji and now we have just a praise number two renaming the unnamed column as server.

E

E

So unnamed three has become server here and.

E

Then finally, renaming praise.

E

E

E

So, combining room and server so we get a new column source awesome. Look at this! Look how clean this is. Token engineering, commons praise channel, that's nice.

E

So now we're looking at data inconsistencies.

E

It appears that the format of the dates in the data is inconsistent. Some entries opt for the format month day, oh yeah, I saw this octopus and I was like I'm not touching that one. I, I sure hope octopus takes care of this okay. It appears that I.

F

Made my students.

E

E

It appears that the format of the dates and the data is inconsistent. Some entries opt for the format month day year, while others use year month day. So.

E

Okay, now, why is this only a hundred and twenty four.

D

A

D

Students and interns.

A

E

No, we should get more uh interns in the tec.

D

I'm always trying to hire interns jeff so he's like I'm managing them, but I love it. That's so great.

F

The really nice thing about having students is, they don't know what's difficult so, like you just tell them. Okay go do this and then they have no freaking questions. They do it.

E

I'm just having a little sort of brain fart here octopus, I just there's, looks like this date. Column is 100 length, 124., I'm just a little, don't know why that, but I'm gonna keep going yeah. I don't know.

F

Why that is either? um We can definitely flag that to look at if you would look at how the date time library gets used and see. If that is a workable process, we could just apply that process and double check. Make sure it's on the right thing.

E

Okay yeah, so I see you import this date, util parser, which I haven't used before. So I'm really excited to see uh this workflow. So, let's see uh so we turn all the dates into strings rather than date objects or whatever they may be. uh Some dates are nan, which means uh empty value, and others just have the word dupe instead of a date. We may want to get rid of these in the future, but for now we will keep them since they may contain important information.

E

Now we write a function that turns all the dates into a standard format. So normalize dates, function takes a date and, just like you see here date, strings and we've made. We've um explicitly made sure that they're all strings, so even the nands and the dupes everything's a string. So we can pass those in here one at a time. This function takes in a string that contains a date and converts it to the format. Year month, day um date string the string containing the date and returns the date in the format your month date.

E

Awesome very well documented functions.

F

All of that is my student justin clifton aka courier. He deserves all the credit for this part.

E

Awesome good job courier and why this is so awesome I'll, just show you guys now, because in python in interactive python environments, let's say you have some function that you think you are supposed to use, but you don't know exactly how to use it.

E

You can just type it with a question mark and what you get is the doc string, and this is called a doc string, and so whoever is using this function in the future can now just print out exactly what the parameter is, that it takes and what it returns and a little description, and you can also do awesome stuff, there's the sphynx library which, if you, if you have a whole library of code like if we built this praise analysis into a standard like praise, analysis library, uh then we could auto export all of our docs strings to get free documentation.

E

So this this goes a really long ways.

D

By the way, does hackmd have real time, editing like google drive or no two people shouldn't edit it once.

E

Yeah, it's good for multi-editing as far as cool my experience goes yeah, although it's it can be glitchy, maybe you're asking something. I noticed once me and johann, I think, were hacking on a hackmd and I started control, zedding or control z and uh it started undoing all of johan's changes. That's just really weird.

A

Yeah and just as long as you don't edit the same line, it's usually pretty chill the control, z and editing the same line. Those are the issues.

E

A quick sanity check to ensure that we have the correct results after this function has been applied awesome. So this is super easy, just uses, parser parse date string and returns. So we get a date parser and we return it as a string. This function is string from time and we give it our standard format and we apply that function.

E

Wait did I do something wrong.

E

Okay, I think I ran something out of our order, but it's now working looks good. So far, yep uh name date, length type is object.

E

Yeah so we have, we have all these string, so these are still strings.

E

But their date they're like standard date strings so and now we convert them to panda's date. Time objects, which is awesome.

E

And we create new columns that are just specifically for the year month and day. So if we take a look at our data set, we now have year month and day columns so.

A

B

E

That we need to attract the issues. Users receive variations on their names. We will work first on making sure each user is represented by their impact hour c stack name to do this. We use the do not touch imported sheet. That contains name information.

E

Oh so that's in the sh yeah, that's in here somewhere do not touch. Is there a do not touch sheet.

A

Yeah yeah, it's the tab and it's all in caps. It's in the middle.

D

A

Oh wow, it just has a mapping of.

F

Instance on people's names that I was trying to account for.

E

This is awesome, so good that we have this. I had no idea.

A

It's a product of the discord, uh the inclusion of discord, praise. It became pretty messy without that.

A

B

A

Thing is that sheet is imported from another sheet. That is private because it didn't it has everyone's. We track their github handle and their twitter handles, and there's no need to put extra data in this public sheet. That's of personal data, so we just limit it to twitter and uh telegram mappings.

A

So the that's why it's a do not touch sheet, there's another sheet that we use to actually add data. The ad names and names get added every round uh during the quantification round, while those while the three quantifiers are quantifying. Basically, I'm updating that sheet so.

E

Awesome yeah very resourceful uh carrier, good job, finding that, and so we read that sheet with the pandas read excel.

F

So this is back to this is my part. This is the.

C

F

All blame goes to me.

E

Awesome and yep, so we read that in this is so cool. We create a dictionary that matches each non-null entry. Not these nands. Remember our empty values uh in the spreadsheet to its impact hour and c stack handle sweet, so just a couple of for loops here so for each row, in the data frame and for each column in that row we take the name and then we get the canonical name. The ground truth c stack handle, and if it's, if this, if the c stack handle is empty, then we'd grab the name to consider.

E

And we iterate these right, so does that mean the telegram? Let's say someone had a disc had this whole row populated they have their discord handle and their telegram handle it's going to grab their telegram handle by default.

F

No, it's going to grab their c stack thing by default.

E

F

E

It iterates all the columns.

F

Yeah, so the other thing is some people have like a telegram handle, but don't have a c-stack handle. So in that case, in that case, I just picked their um whatever they have as their standard identifier, but if they have, uh if they have a c-stack identifier, I'm using that as their standard identifier. For now.

E

Awesome awesome.

A

I would be careful with that. uh I would check to see just just because usually it's what happens. A lot of time is, there is duplicate data, so there might be uh you know there might not. They might not have a c stack hand.

A

I would only use sea stack handles basically long story short and if and if you find some errors, please just tell me that, because uh it's very possible that uh yeah I mean earlier in the data set you might have, you might need to do something like that for people who haven't been around in the last uh three months uh four months, but there might, you might also create a vacation.

F

We need to do something with it, uh so I'm happy to have any suggestions about what else we do with them. It's an easy logic fix.

F

Relatively um they're, actually a fair number of people who don't have cstec candles and.

A

That's very interesting. Okay, I will try to. I can try to clean that data set. I think that's probably the best best way to start well, then,.

F

Yeah once it's clean and everybody has a c-stack handle I'll, just kill that, uh if is that, if izman sees that candle catch, because if we're sure that everybody has a c stack handle, then I'll just make it so they automatically get their c stack candle as their default identifier.

F

My friends, I don't have one and I'm in the data set so.

A

You don't have a you're not in in there.

F

No- and I actually can't I can't take any kind of like any compensation I receive for outside work. I have to run by my employer and I don't want to deal with that level of hassle so, like I can't get any tokens or anything out of this work.

A

One one thing that, uh for instance, with that kind of stuff, is that you so I I have you as having a c-stack handle in the data set on row 766, but uh the only league so for impact hours. You will only be able to get impact hours if you are a member of the trusted seed, and uh so, if you're, not a member of the trusted student, you won't get. Not all impact hours will be distributed.

A

If people aren't legal legal members of the swiss associations, though.

F

Excellent, thank you so much um well. So that's just my comment. I'm happy to have everyone mapped to a c-stack candle, but at the moment not everyone has one.

A

I guess the the other place to check is the total impact hours so far uh sheets and that has all 400 and something people that uh have ever received an impact hour and uh and so that's a complete set of names.

A

And I'll try to I'll try to cross-reference them to make sure that they're all in that other side.

F

Very good! Well that will just that's not that's not a hard change um once it will work.

A

Great thanks thanks octopus.

E

um Griff, so are you saying this? We can use this as a ground truth, this uh total impact hours so far I mean for a tag.

A

Yes, that in the end, that is uh an aggregation of of everyone. Every um thing I mean we need before before we hatch their needs. Before we launch the hatch, there needs to be a major audit to ensure that there aren't duplicate names.

A

I try to do a small little audit every every round, but uh we need to do a major audit to make sure people aren't listed here twice, just and really that's not because of any kind of fear.

A

The trusted seed does its own civil protection, but if they're listed twice, they may only receive one set of their impact hours and that would be tragic. So.

F

That is actually the next part of what I'm trying to do, because there are a bunch of duplicates issues of capitalization and, like ygg, had like a lot of impact hours where the y was capitalized um people where they, they capitalized their first and last names, but they only had all lowercase and their price like there were lots of like there were people at five or six different names.

F

Not all of them were caught by the do not disturb or do not touch sheet.

F

So there's the level of cleaning to make sure everyone in the do not touch sheet, gets a standard name and then there's also catching the alternate names that people might have used that are not in the do not touch so that's two levels of cleaning that I tried to do now.

F

Obviously we want to get this in a format that everyone wants to work with, like I'm, I'm happy to do whatever the process should be, but this is kind of a proof of concept of how I, how like I caught it as much as I could, so it never didn't get caught. I really need to see it like that's why I wanted to go ahead and show it now.

E

So in this cell here this is exactly what octopus was just describing, but I'll just talk a little bit about the code, so we combine all.

E

Usernames into a set, and so we have a set of all usernames and we say names.

E

Uncaught a list of this all this whole combined users set if the user is not in our name's dick uh dictionary here and not um it's not nan, it's not empty in the data frame. So it's saying okay cool. So this is like a double check, so we have our whole names mapping here where octopus has done his best to get the c stack uh column and if there's no c stack column, then he gets the next best thing, whether it's telegram or discord and then he's doing like this check to see.

E

Okay jump back to our original data set and just get all possible names like everything in the from column and everything in the two column and then check for names that are uh from our original data set that are not in our new dictionary that we just created and and and the name isn't an empty name. So basically he says: is this name just like some empty name like not a value uh if it is skip it?

E

um But if it's a value- and it's not in our big giant name data set that we just created, then we call it it's an uncaught name and then we list all the uncaught names. Okay.

E

So there are 105 names in our data set that lack canonical representations. We do not have canonical representations for the following, and so now we can definitely go back and cross-reference with this total impact hours so far sheet. I think it'll be a good way to to validate uh what we've done here.

F

E

F

Automated cleaning to catch obvious duplicates which knocks it down to about half of the uncaught names just to reduce the human effort.

E

Yeah, so we have 106 missing names, which is a lot. Some of the missings are fairly obvious, fixes uh there's an inconsistent case with capitalization, so zeptimus q, ver septimus capital, q, ygg versus capital, yg usernames, with punctuation that get dropped. When praising and uh dropping the discord. Four digit identifier, amw fund versus amw fund, uh number zero, nine, seven, nine, the ones that are typos of these types can probably fixed by casting the names to bare bones representations and checking to see if there is a reasonable key, though this has potential pitfalls.

E

If two unrelated usernames are similar enough, still worth a shot, we're going to write a function to clean a name, so clean name takes a name and generates a new name which verifies that it casts that name to a string. First of all, it's probably a string already, but let's make sure it's a string. um I guess it could be a nand value uh new name, so we just strip it all the upper case, and so we we put the whole thing to lowercase and we substitute out um what are these slash d's in regex?

E

Oh, oh, I know this is digits. This is the yeah. This is the discord, so we strip out the discord number if it's there and and we replace it with nothing. So we just strip out the discord number and then we substitute.

E

Alphanumerics, what's this last one dude, oh, remove all non-alphanumeric, so we strip out the punctuation.

F

Emoji's situation.

E

Cool and we return the name. So this is our clean function uh below are some sanity checks function c works is intended, so yeah lowercase.

B

E

Get uh turned to lower case, underscores, get removed, discord, numbers, get removed and and non-alphanumerics get removed like a time zone.

E

uh Now we use this function to check if a clean version of an unknown name has any good representation, and if so, we borrow it so cleaned keys, so we clean, so we have our name data set, which was this dictionary and we're gonna clean the keys and clean the names. I think there's a lot of code in this one.

E

Do you want to talk through this a little bit octopus? I think it's just. Is it just cleaning all the names in the dictionary.

E

Okay, let's see.

B

E

E

We were able to add the following names: okay, so we're checking the uncaught name and we're checking if they have a. If we clean them do they have a cleaned match in our in our name, data set, I'm guessing that's what we're doing.

E

So we get a whole bunch here. This is this is pretty interesting.

E

Yep the they seem to be working.

E

This is really cool.

E

um Yeah, so how many are so so we clean all these names and we find matches cool and how many are still missing. Well, there's still 47.

E

Missing so there's still names in our original pen. Our original data set that we loaded that aren't in our names dictionary and those are 47 so 47 users that we don't have a true name for how often do they appear, so we create this total appearances function and we um count them so oops fix me has 11..

E

Burrata has 12.

E

So yeah this is really valuable.

E

We're unsure, if oops fix me, is actually a user or an artifact of notes from the recording sessions. In any event, we will just give each of these users their own record. In the dictionary reality check. Every name in the data set has a key. Now, there's one remaining user with no representation nan, we have a nand that somehow became a user. We're not sure how this happened, but we can live with it. All users have been processed now we're ready to use the dictionary to substitute these names into the data frame.

E

uh We have the results, so let's just get what those are.

E

Bam um cool so clean.

E

Yeah looks awesome, so I guess all of the to and from have been right yeah, I think yeah. So we map we map all uh rows in the to and from column to uh the names dictionary and a dictionary is naturally a mapping, which means you give them so sorry.

F

Can you can you all hear me yeah, we.

E

F

Hear you sorry, my I picked a bad time to drive through a very internet dead area.

F

Sorry, very sorry,.

E

We're just getting to the end of octopus, I'm not sure if you could hear we got the whole. We got the whole um names mapping dictionary and we're just applying it to the data frame.

F

Okay, sorry to leave you to fight through the logic of that that was the one part where I really could have helped from it.

E

No, this has gone really smooth. uh I think this is this is amazing. I like this. uh After all, the cleaning and the cleaning worked well, it looks like I just scanned through it and I don't see any uh just from a quick scan. I don't see any mistakes here that I could pick out.

E

um It looks good and then we still have 47 names, and so we just say: well, let's give these 47 users, uh let's, let's call them users.

E

And then there's only one remaining, which is the nan, and so it seems as though we've we've successfully resolved all of the usernames, which is incredible, and then we map it back to our dictionary and we have a nice clean dictionary set with a with a date time um did we apply the date, I'm just looking at the date now and I see or do we have a clean date, column.

E

I'm not sure if we, if we put our our clean date column on here.

F

It I'm not sure if it if it got feature engineered to be just be your month date and then the date column got dropped. Maybe that's definitely something to check to make sure of. However, we want it that it is that way.

E

Yeah I'll just check back because I see two different formats in the date column there so date, consistency, oh something's, yeah something's, going on here with date. When we grab the date column, I get 100 or rows. That's really weird.

F

There are definitely still issues I'm not presenting.

E

ah It's simple: it's it's! I think it's simple um he's carrier is referencing uh yeah. He was just referencing a different data frame. So, let's see, if we convert all these to our.

E

Non-Quantifier.

E

Oops unknown string dupe unknown string format.

F

Okay, there's something you can do with uh there's there's a way to fix that by having it ignore n a's or something.

B

E

So we're taking the date string and it's it's dupe and we're trying to parse it and we're.

F

Returning it in his notebook, he dropped dupe nna before he used this function and I copied and pasted stuff and I did not drop dupin in a so. I think that's the issue is um the function, doesn't have a catch for duke or in a. I don't think.

F

And if we put one in.

E

Awesome I have to look this up. uh Pandas draw particular value unless you know it. Octopus.

F

I don't know it right away, but it's in his uh I I can find it because it's in his notebook that I copied and pasted from.

E

Should be a quick google yeah? Oh there's uh drop. Oh yeah, it's called drop.

F

But I don't know, I don't know what dupe I don't know why there are dates that are dupes so that.

A

Would be helpful, that's just us. The quantifier is saying: oh, you know what this was already dished praised for and two people might dish praise for the same thing, and so uh at some some praise sessions. We remove those other pray sessions. We decide in an informal way sort of to just be like well. If they get praise, that's like extra praise, you know they.

A

They deserve extra points for that and then we'll just give it a lower score and some praise quantification, remove them completely and we'd say: oh, this is a duplicate uh and we'll we'll. We already gave praise for that action, so we're just going to put dupe in here to like signify that we're not going to quantify it. So all the dupes should be getting zero uh impact hours.

F

So my thought is to create a column that just says boolean on whether or not it is a dupe, create a column is dupe, that's either zero or one and then replace the dupe back with the original date.

E

There's only five dupes in the whole data set.

F

Oh then dropping them probably doesn't hurt me.

E

That's what I'm thinking yeah and they're all too only ivy ivy got two dupes.

E

The rest are unique.

F

I bet justin did that analysis and decided to drop it because it was small. I should have trusted him.

F

More but yeah I just wanted to put together kind of a sketch of of like a direction towards cleaning the the qualitative data, the non-numerical data and getting feedback like from chris that we should enforce c-stack handles being the default id, and if somebody doesn't have a c-stack handle, then we need to like that's a that's a data cleaning issue, not a coding issue, like that's, really helpful, so whatever feedback people have about how we should take this in the future, um I really wanted sean to look at it because he's a better data scientist than I am by a lot.

F

So I don't think so. No, this.

E

Is awesome, work.

F

So I'm totally open to any feedback and want to keep iterating on it, but iterating on it, not just by myself with some community feedback would be helpful.

E

Yeah, I think it's great, that we have this opportunity for to tune in through this process and see the cleaning step and especially with with people like griff on the line. Who's has so much experience through the quant process and can like has the domain expertise. You know that can shed light on a lot of this stuff.

E

So I feel like this is a really valuable session and- and uh it looks to me that the data set at the end is great, like it's primed, for people to start jumping in and doing analysis, which is a good jumping off point. uh We have people here. I think we're gonna have a session this week, maybe on tuesday, and um so when we, when we start the next session, we can basically say: hey everyone. Here's the cleaned data set and here's all the analysis that the community is looking for.

E

So who wants to break off a piece.

E

So I'm just going to remove these dupe columns for these rows. There are only five duperos in the whole data set. So let's just drop.

B

E

E

Okay, we also have none.

E

Let's see how many of those we.

F

F

uh 45 minutes and explain a bunch of the background to me on.

F

A

Are you kidding, you guys are the heroes. All I get all I can do is tell stories so like. Thank you for putting the work in um it's my pleasure.

E

Okay, um yeah we so there were just two strings: they called dupe and none so I removed those looks like everything's working nicely with this date formatting. So we now have 7625.

E

And yes, we apply. Normalized dates, looks good and okay. So, let's add this now to the no qf data frame.

E

And I suspect that this will be all good to go.

E

E

Okay, so at the very end we drop the cleaned data on disk as a csv. So I'm actually just going to restart this and run everything uh to see if we haven't broken anything.

E

So this will take, it's got to load the data at the beginning, so that takes 45 seconds or so and then it should go through the whole cleaning process and drop the results on disk.

E

E

Yeah looks good so now we have clean data.

E

Yay, okay, awesome! So I'm gonna push this. I'm gonna commit this.

E

Get status so a few modifications to the notebook and the dataset. So I'm going to add those.

E

Git commits dash m um small updates to.

B

B

E

E

Adding the date.

B

E

So it's this is kind of annoying with jupiter saved it. Auto saves the notebook, but it it auto saved the notebook after I already committed. I think so. I can.

E

And I you know this isn't a big deal because it'll have all the updates, but I really like to get my get uh status to be blue, so I'm gonna add that again and I should be able to do this. Git commit dash am which is up or dash a maybe uh so this appends to the previous commit.

E

Oh, what oh wait dash a.

E

Oh, it wants okay, um just a notebook, save okay. Does that work there we go nice and yeah, okay, so we're all up to date, so I can get so. This is working off of octopuses fork and I should be able to just push up to the original fork, so I should be able to go, get push origin.

E

E

Cool and that should be all good to go.

F

um Sean, do you think it would be a good idea to go ahead and put issues in the github repository for the things we identified.

E

Yeah, that's a good idea, so I have two I popped in two. While we were going my note on the clustering on the reason for dishing and then the is the worksheets that didn't get loaded.

E

um Should we take a moment to think and reflect on anything else that came up that we want to capture issues.

F

I think we don't need to remark like, instead of catching when someone doesn't have a c stack id and then you know just and whatever id is available.

F

We should just print the users who don't have a c stack id, and then the people who make those assignments in the spreadsheet can double check. That or someone else can do it or whatever.

A

I am so close to finishing that intention and thank you so much for the sympathies. I already found one person who had a good.

A

Everyone's name will be in there who received impact hours so.

A

A

So you're gonna yeah. Thank you.

F

We'll rerun the nook, you know one time then clean it after.

A

I'm almost done cleaning, I don't. I don't need the notebook to do it. I don't know how to use it. I don't know how to use your magic man.

F

Csv file that just got put up is based on the previous. The data that existed when the notebook was run. So, if the, if the spreadsheet, you know, if the do not touch tab,.

F

Then we need to run notebook again to produce an updated style. Does that mean.

A

Yeah, it makes sense it's it's been updated with probably about 40 names already.

E

Okay, good work, everyone.

E

So I think we're at a bit of a jumping off point, um or maybe we should well yeah. We can get started on analysis and if this sheet gets updated, then anything that anyone's built should work. Just fine it'll just be the names that might be slightly different.

E

So um maybe I'll just take a moment to pause here and open up the.

E

Floor, jess or jeff, if, if you want to jump in or or anyone, yeah.

B

That was really cool if I can just jump in real fast, I just wanted to maybe throw it to you based on this. How would you like what would be the lesson learned for designing it going forward.

E

So I think the key, what I think the approach here is.

E

We have the voice of the community captured here in this document in this hackamd file, and it would be the the best outcome. Best case scenario is: if we could here's the questions right. So if we could address all of these questions, um so if we basically got a volunteer, if say we had seven people that wanted to jump in an analysis and each one took one of these questions and saw if they could address it through the data.

E

I think that would be the ideal outcome. I don't know if we have seven people. You know that we can rally comfortable diving into the data and trying to answer some of these questions. We we very well might we might have three or five, um but I think, that's kind of an ideal outcome if we distribute this this this. This is the voice of the community here and if we can address a lot of the or the majority of the points here through data analysis, I think that's an ideal outcome.

E

I personally just have this.

E

I get it kind of obsessed with things, and I have this vision of something I want to create and, like my ideal outcome, is that seven data scientists volunteer and each pick one of these questions and then I just want to go, make a force directed graph. That's all uh course directed graph um I'll show you what I mean by that.

E

Maybe this one this one. I think.

E

So this is a force directed graph so like we can get a network of our community and each node is a person. This is actually the movie limitable and uh the each edge is a interaction between characters that happens in the movie, and I forget what the colors represent. But we can do this for the tec. We can each color could be a working group or some or I don't know what. How would I do this exactly, but I'm so excited to see the network topology of the tec and to get to play with it.

D

Yeah angela um did you have any other thoughts or I don't know. I also was wondering if johan wanted to say anything or had any ideas, and then I think I don't know danilo is pretty busy, but there's potential we could see if he has some time on tuesday. Sean.

E

Johann yeah: are you having any crazy hacker ideas? I've seen johan build some pretty neat things pretty quickly, so I am curious.

C

I have a proposal for yohan just in case.

C

I think I mean understanding what has been praised so the the again the reason to disgrace to better understand what has been valued so far and maybe what's over and underrepresented, would, I think, help us a lot to to see if this makes sense for the community and for the hatch it touches on a lot of questions that have been raised.

C

I think- and I think the next step so one of the next steps- and now, if I can support I'd, be happy to support, is to create a list of categories so that we can um instead of 7600 reasons to displace. We have then the 10 to 20 categories, maybe johan, you can support them, translate the category or make this again um translate this data set to okay, we have 12 categories, and then we could take next steps from there to see how not what was the reason but then also to see.

C

How does this run then be is translated to impact hours. So, for example, uh tweets get a lot of impact hours and another category is just getting way less and, and there step by step, have a better understanding of what happens over the last couple of months. And if there should be an intervention or not.

D

Yeah angela, maybe we'd, do a session on tuesday and I'm happy also to like dig in, and I think another few people like myself, who you know were not data science, but certainly we can help come up with categories.

D

um Perhaps that's a big project. I feel like that's one group, so we could do that on tuesday sean does a network graph. Maybe we can see what octopus could be interested in and then anyone else we could look at these other questions and see if people are up for taking one. I'm happy to work with you on categories.

B

Yeah same also in just like the the points that um angela raised, I think, are very valid in constructing a well-designed value flow.

C

I don't know johan if you would be able to come up with a library that could help us to make this most efficient.

C

Like note to back or any anything, that's probably simpler, but similarly efficient.

E

Johan, are you.

C

E

Hello, maybe he's afk, um I'm actually. In the same, I can go check on him physically, um but uh I could. I would happily work with johan on that.

E

So I'll check in with him.

E

And um yeah, so if is there a session on tuesday, that's.

D

Well, we were discussing uh same time on tuesday, I don't know angela that still works for you and yuan, and I don't know who else we can rally.

C

Juice stays fine on my end,.

A

Yeah, I would, I think the categories are huge and I think another, and that for especially for the non like uh python rockstars, but then also it. I think, it'll be really easy to pull out any negative numbers from here and just see the impact of adjustment hours and what you were saying before uh ygg about like how much governance people will be able to buy with their payment versus impact hours. That was taken and doing an analysis of how that how that is.

A

uh That would be really interesting. And if, uh if anyone, I would be very excited to support that.

E

B

Hey sean was there a question in particular for me earlier sorry, I was a bit uh away from the computer and.

A

B

It but uh I'm at my own up top middle.

E

uh Sorry, what'd, you say jeff.

B

Just didn't know if there was a question for me earlier. I thought I heard you uh calling my name, but I was a bit away from the computer to respond.

E

I did call your name at some point. I.

B

um Present but late.

E

Yeah, I I think we figured it out.

B

Cool, it's been a super interesting session. I I I've heard that you know data munching is such a big part of data science, but it's cool to what watch it go through the steps and uh yeah really interesting stuff.

E

Yeah yeah, I'm really happy like it's. It's amazing how things work out I mean if we like, if we didn't have octopus, where would we be right now, we'd be just rolling around in messy data, so I just think it's awesome how people, just in these decentralized communities the timing always works out. You know people always step up at the right moment to make things happen.

E

B

E

We yeah we- maybe we just take a few more minutes like till the hour, to just continue the discussions and allow anyone to bring up anything that that they want to point out and then maybe we do a couple hours if we can get people started, I'm going to check in with johanna here johan are you? Are you on here and uh anyways, I'm gonna check in with him and and then yeah?

E

If I think, we've identified, uh maybe some people want to hack on the hackmd file and just continue to flesh out some of the discussion that has come up so far.

E

um Maybe some people want to get started on the category and maybe some people want to get started on the balancing negative scores from payments first, how many impact hour tokens that could purchase? I think these are two really interesting areas that are both.

E

They both have some low-hanging fruit, like. I think it's not too hard to get started and see some results, and but they also both have the ability to really like be fleshed out deeply and be a very, very deep uh data science project. So I think that's good easy to get started, but a long ways to go to in terms of insights and I'm happy to help anyone who is working on either of those and I'm happy to jump in and make contributions, and I also want to see if I can get this fdg going.

E

I think I could get it in a couple hours.

B

Say uh I'm like I changed my discord handle a few times and I'm wondering if there's an easy way for me to check that list and verify that I'm not one of the uh nameless so to speak.

A

I I definitely got you uh d dan, you were sp, you were a very uh prolific name changer, but I'm pretty sure. I'm pretty sure, though, pretty sure that we nailed you. Thank you chris yeah, but you can check just uh on the in the prairies data sheet.

A

You can look through uh the total impact hours dish and, what's more likely is that you would have two names on there, but if anyone wants to independently verify uh just make sure that then make sure that you only see one of your handles on that list.

A

D

Yeah, I know oh, go ahead.

B

Hello yeah, sorry, I was just gonna say on the on the directed graph. It would be really interesting if that comes together to look at the the number of times. You have multiple people giving like clusters of clusters, of praise and thinking around consensus if it should just be fully um with you know, individualized or if there should be like a consensus before the phrase is sort of validated by the communities, meaning like more than one praise giver. Just I was just thinking about that might be some interesting tidbit when you make that graph.

D

Yeah, angela, I know it's late for you. So do you want to try to do that session? We were discussing on tuesday to work on the categories.

C

Yeah we can, I mean particularly the price quantifiers would be super important stakeholders there, because I guess they went through every single of those seven thousand at some point and have a pretty good overview on on potential categories and from there we could run um data yeah iterations on on the data to to add labels and then have first analysis.

D

So they schedule for two hours, probably two to four. Like we said same time: okay, I've been through 1500 of the praises myself the other week so or was it 2 000? I think it was 2. 000 of the 8 000 happened the other week.

C

Yeah and anyway I I'm not, I don't have the full overview on who who wants quantifier. So maybe we can reach out to those people.

D

Okay, I'll put out the I'll summon everyone you.

A

D

The bat signal.

A

Cool well, I want to uh hijack the call because there's another call happening right now in this same room, uh which is I I, if that's okay, uh because there and and I want to give people space to leave but at the same time invite everyone to stay, because it is kind of a pretty monumentous occasion. That, with that, we find our.