.Net Foundation Design Reviews, 8 Oct 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: .NET Design Review: DataFrame

Description

We're looking at a new API for DataFrame. https://github.com/dotnet/apireviews/blob/master/2019/10-08-dataframe/README.md

A

Yeah all right, so hopefully, if you have audio on the West bad things will happen we'll find out. People will complain.

A

All right so who wants to get started.

B

Earlier the beer portion all.

A

Right so hopefully, if you have audio otherwise bad things will come back, but if all your two places and we find out people will complain.

A

Let's not me at this point all.

C

A

We'll get started whose audio is that? Only.

B

The portion all right.

A

So hopefully you have audio.

A

Fine now, okay, that's all.

D

A

All right, I'm you, your self level, wants to talk. Okay,.

E

So probably you should do the first, probably most controversial thing first, which is the namespace and the name of the type yeah.

F

They go raised, Louis the.

E

Rest of it is not the special; it should be fairly easy if.

B

You don't like background or describe like what this thing is: okay,.

E

So if people are familiar with pandas, then this is like a dot medical and dependence for those who are not familiar with and that's what the data frame is. It's basically tabular data, that is in memory, but you can execute operations on it. So, like a binary operation also, you can do like column a plus column. P would give you the comb right.

E

In this space and dotnet, because we're so strong, we type well before that each column can only have one type of data set to music in close doubles.

E

Whatever the data frame has a backing store in the Apache arrow format, which means it just towards bytes at the back and there's like a standard representation for the float and intent blah blah, which means that if you have a column of say, ends and then you add a column of float, the result, the compiler knows is a float, but because so so you need to make a new column. You cannot like update the int column itself in place because the documentation will change right.

E

So that's kind of one thing to keep in mind as we go through the to the api's. The other thing to keep in mind is this is mostly at the moment being used in a jupiter notebook.

E

So for most operations you would expect a copy to be the result in place. Ops are an option for some 80s.

E

Those are the major things we need to know, so we can actually start looking at the API and yeah like I said the first, probably the most controversial thing. You should get all the ways like the name streets and the and the name of the type itself, but it is a feature general-purpose.

G

Or related to ml or data space, it's.

E

Related to the data space, it's a way for people to explore a data set that they have. So if you have like a CSV or something so you cannot import the CSV in a notebook and you can explore beta. So, for example, you can say say that you have a column that you want to like you want to investigate, so you can do correlation of column a and column B, and so that would be like an API that, like you, yeah access to a column bigger. So you can write the correlation function.

E

Whatever question you want to do stuff like that, but it it doesn't have any dependencies on either a neural net or system data, not on system data from ml dotnet, there's no dependency. Now it only implements I date of you, which is its own yeah. So, like it's okay, so it's yeah, it's independent! It's related.

H

To either W, yes,.

G

B

Imagined we imagined data frame would either implement I date of you if we want to have a direct dependency on the date of you framework or we could you could imagine like a to I date of you extension method in the future. Today,.

I

B

Directly implemented I date of you, because that seems like the most natural thing to do. But.

G

I think the view is still just a nougat package for ml net. It's not in the correct, correct.

B

It's actually yeah it's a nougat package, but it's separate from like the core algorithms of ml dot do do we separate it I dataview out into its own nougat package, independent of everything else.

J

The backing store for the data is that important to the data frame or is that an implementation detail.

E

Data frame, because the backing store that format is also kind of a protocol. So, for example, we're doing is part you get the data from sparked in that format. So, in spite you see the data, you can wrap a data frame around it and use the data mining. Yes right.

B

It's important for two reasons: one is you can do 0 copy right um wrapping, but then 2 is we expose you'll see later, but we expose api's to get act. The actual buffers, and so the arrow format is a columnar format right. So all the data stored in columns and so the idea is, you can do sim D operations on this data, which you couldn't do in like a role based format right.

E

Which is the API that exposes are kind of grouped into like these major categories, which is binary operations, which is like add subtract, multiply blah blah computations, which is like some max mean that kind of thing there is also joins, merges and sort connects.

K

Question going back to ideas of you, so panis allows the random access right and either the view I think it's just a bit only for work kind of streaming for sorwe accession data, and so so you able to read, even though you waste your tourist on I decided of you. But you were able to overcome overcome this, so you can do a random access. We.

B

Do random access? We don't base our storage on a dataview. Oh.

L

You don't data frame.

B

Implements I date of you, so it can Englander.

L

B

In I date of you cursor over a tornado, but all its data is in memory. It's holding in memory.

E

And the primitive types that the column can hold is like. The premium pipes that we have in c-sharp like in float, is mutable and then strings have two columns, a string column and an arrow string column. The arrow spring column is immutable. The string column is new row. So that's the whole point of data we can represent and are.

G

The transformations implementations of the ml dot next Transformers no.

B

So a couple reasons: one is the ml. Dannette transformers are lazy, evaluated right, so let it come ajan.

B

Imagine you're replacing any values with the you know, normalized median or something like that right, like that, doesn't actually happen until you loop over the data frame or the dye dataview and Imelda yeah.

G

B

Here in in data frames, like they're eager, the operations are eager.

G

Like is it a conscious choice, it's better to be either like doing them a dotnet design days? You know we kind of we concluded that lazy evaluation is better for large data sets. It's.

B

Better for large data sets that don't fit into memory, but kind of like a key part of this is all the data does fit into memory eager.

L

Usually is easier to debug and easier for developer, but non eager mode. It's usually better for performance here.

B

So these computations right, like if you, if you say, column, a plus column, B you get back a column, C, well call them C like again since it's it's it's internal memory is, is in memory right. It can't do like stream, those that data lazily.

G

Even if we have those transformations that are either net abstraction such that, if I want to add two columns, I don't have to use two different transformers. I have a transformer that can be either in the eager or or yeah.

B

If you're, if you, if you're buying into ml net like using ml net, the data frame implements I date of you, so you can put ml dotnet transformers on top of it yeah and you'll just get back at you. Yeah.

G

I understand that I can do that because they take a friend implements like that. Have you what I'm asking is these transformations?

G

Can they be used in a context of ML dotnet pipeline and I felt? The answer is no because they are eager I'm saying: could we make them so? No, no, they are in tumult. You.

B

Did took I guess, I understood the first question to be: can I use ml net transforms always in data frames like do it on a data frame and get a data frame.

G

I understand that this works, that's not the question. The question is that.

B

Doesn't what I just said doesn't work if you, if you take an ml Dannette, transform apply it on top of a data frame? You don't get a data frame back. You get an ID, a tab. You back! Yes,.

G

B

Is lazily evaluated but.

G

That's not what I was I'm asking can I use these transformations like you were talking about adding two columns and you know, storing the results in a third column. Can I use these transformations in a context of a method pipeline.

B

Yes, well, you would do you have a data frame, you do the transformations you get a data frame and then you pass that into an ml dead pipeline. No.

G

That's not what I'm asking forget about that afraid. I can't! Basically, let's say yeah I kept I didn't have you from a magnet I loaded CSV file can I apply one of those transformations like that. Oh no, because.

B

These are our native; these are data frame. Ap is right, you can't you can't apply it a data frame API on something: that's not a data frame. The.

L

Underlying Lit memory, layout and data is different.

G

And so what, where.

A

Are the operations here? If you look at some sample code, we do. We have like a like a list of like high-level stuff of what you would do and how would it look like.

E

B

We we have like a notebook if you want to pull up Oh I.

E

Said, could you share your screen, maybe.

A

Yes, been sharing, will not work very well, because nobody can online I can see that because it has to be on my box. Can you just send me the link in teams? Yeah I can just open that, hopefully.

D

B

Up not like him to show that notebook right now. Sometimes that works. Sometimes it doesn't on ghetto.

A

If you look at the wall view, is that intelligible or is.

A

There, like within the sample something else you can look at.

B

We can look at the tests.

B

Know that that repos a completely.

F

B

Another link to.

F

E

B

To the first fact, maybe yeah.

E

First, back yes, further down yeah that one.

E

This one is a data frame of two of two columns both having integer data, so this indexing, so, for example, data frame, 0, comma 0 means zeroth row and 0 will be the first element in the table. 1 comma 1 would be like the tag next diagonal element, but.

I

It's row comma, followed by yes,.

E

So when you say head that would mean the first five rows is what had gifts. They would give the last five rows so the first basically from one line, 170 2180 you're extracting data- so all the indexes or the get value indexers line 183- is a set value. Indexer. This means change the value at that index. 1000.

E

Let's see line 189 what happens there? This data frame into two returns a base column.

E

The drive type is spirit of calm event. So that's what I'm doing. The reason that is important is primitive. Form of Flint has knows that it that it holds integer data, so the so the API said you use. Intellisense will give you that the return type is an INT and the data is a name there. If I'm, a data frame up into everything would be object because that's the base column.

G

So primitive column is it ml doclet I data, you type or it's your type, it's my time. So why did you new types to describe? Why not use the ones.

B

There's no type in ml done it that so yeah. This type of.

B

There's there's schema mean this is data though primitive column is data represents so like data frame is, is columnar laid out right? So if you have five columns, you have five. You know, objects that point to the data of those columns.

A

Yeah so when you say data frame, some name, you basically get the entire vertical slice of the table. Yes, the.

H

Whole column yeah, like in stew, would the name of the column itself is in two and then you can just index into yellow.

G

But so would you why not when you get into column the return type could be I data.

B

When you say into column, you don't get it like. An I dataview represents multiple columns.

M

Literally, no, we could make.

E

B

You could make an ID a to view that just contained one column in it. Yeah.

E

B

G

B

It's a different concept, greatly, it's one. It's just one column, it's not a whole collection of columns. Sure.

G

I've, just basically some what I'm getting at is, it seems like we are on a path to end up with two different, slightly different representations for tabular data and I kind of I. Think we all hope that.

A

Well, I mean cement.

B

Exactly like think of it like I list and list, do you do you think those are two different representations of list data I.

G

Think it's you know, knowledge is reasonable, but it's not perfect. I! Think you because then you have, you know different way of offering. We are creating two different ways in operate on these tables of data versus I list endless you can both like common operations are done in the same way, but.

B

We're doing that because they're different target use cases great fight. That dataview is it's extremely like the least common denominator of what a tabular data could possibly do right like the least common denominator, I mean it describes to you what the columns are and it allows you to get a cursor forward only cursor over the rows right. That's all you can do with I date of you. It's the only operations it can. It can possibly do.

B

Like that, that's its only goal in life cuz, the idea being you can. You can implement that with all sorts of different data right. Some could come from Azure, blob storage from a file from a database and all sorts of different places right where a data frame is very opinionated about how its data is actually stored. And since it's opinionated about how its data stored, it can allow a much greater range of operations on data I.

G

Think it will be good if we ended up with these two technology. Technology is being rationalized and for somebody who wants to deal with tabular data, they looked like single, coherent teacher with, maybe you know, different api's and abstractions optimized for different slightly different scenarios, but one teacher I worried whether we are not on track to come, say well. Data view. Has these limitations and I'm a dotnet we're gonna create almost like. You know another.

F

B

Rationalization is if your memory, if your data fits into memory, and you need to do in memory type of operations against it, you can use data frame if your memory, if you're, if your data, doesn't fit into memory right and you and you can, and you just need to stream it one row at a time my dataview and and data frame implements I dataview and so that anything you can, that takes a night.

K

B

You you can pass a different there.

K

Is also another rationalization idea of you doesn't provide random access and we target goal here is to mimic. Finally, the frame which allows you random access, so you can access any data in any order. You want well identity. You have to go like concern from the beginning to the end, all the time there is already.

E

Can we see some operations like yeah.

G

We're the balanced.

H

E

Can you scroll further down.

B

Like sorting yeah, there's.

N

B

Right, there's a sort, there's zero chance for you to do a sort operation in I date, abuse. That's.

G

Not what I I don't think that that is the goal and I understand you can add salt stream on the IP is I'm just saying. Can we end up in a situation where two users who want to use any of these technologies? They basically appear as different features in a single technology not to completing technologies.

O

But if data frame is already implementing, I did a few. What it's already implementing it is, for example, treated as an identity, but it's not popular. You apply operations to the tables differently, not if you only have as an identity, yes, but that that direction is one way, but the other direction doesn't seem to be good. I mean it's a specialized type that has specialized operations.

J

A

It's specialized types good, you could think of it as an array where's the span right I mean it's not they're, not all completely orthogonal, where one is a strict subset of the other, and so you end up with the world where certain operations don't make sense. You have to layer it in terms of like a concentric circle, kind of thing right so.

G

Let me like very concrete thing where I think there is a bit of a divide. We have been discussing namespaces right. Are these types gonna go into a man that namespaces I think the idea today.

A

Would they should provide in the same way that it didn't put I date, everyone about specific noise, but the idea is the degree kind of provider in a more generic data thingy. That makes certain things more well. I.

J

A

J

My opinion, which I haven't know here, for when we get back to discussing the namespaces. If this is dependent on the spark namespace, it needs a non-generic sorry, the spark data format it needs a non-generic database. This is not data. This is for.

J

A

It yet, but if any wasn't they.

J

Said it was very important to the format that it's exposed in the format. It's an input type. It is very transparent over the Apaches particle stuff. That means this is spark and not data Apache arrow stuff, which is different from sorry spark stuff. Sorry, a patent, something Apache, okay, yeah, there's a data format that this is entirely dependent on. This is just a representation of that data format. This is not data in general, this is arrow or whatever. It is well.

A

I mean I would I would separate the particular implementations from from the use cases right like in the same way that you know, data set had high dependencies on certain xml formats. But the question is: what interaction model are these things used in and how are we promoting this and when our setting is really what promoters is?

A

If you want to do something in an interactive fashion, similar to a workbook style thing, then yeah, you basically have to think about what that means for the user experience, but the fact that it also mirrors somebody else's format I think is not the I out of it. At that point, it just becomes a what's the use case for the API and does the namespace make sense for that use case I, don't think it should be Microsoft data, because I think that is way too.

A

You know it simply implies the commonality that just isn't there right. It's it's it's a way more specific thing, but it is the right understanding is general purpose from the point of view that you can load any data into it. You can look at any data. You can transform it in any way you see fit and then you know over time. You probably also have multiple ways. You can export this data into some shape, but it's a CSV file or some you know other shape, but I think they.

A

So the question is: should we first go over the API so probably decide in the namespace we the thing otherwise it's it seems like somewhat backwards that we don't know what the thing is yet and we have to decide on a name yeah.

G

I just mentioned the namespace talked to kind of you know as an illustration of when I think external customers will see this like two features as related. Usually we communicated by hey, you know they belong to eat, maybe not too exactly same namespace but kind of like you know, similar area operations looked at, you know similar I, don't know they like enameled up nets to be. There is no API to inside. Her call me correct, but yield yeah.

B

I did abuser read only which.

G

B

Another difference between data frames and I didn't know you, but.

G

New columns can show up basically transformers can in insert an you follow rate.

B

You get back in that a new iData p.m. correct.

G

I

Think we can make.

G

This work the same way.

A

We love you mortal notifications, you just it's in place right.

E

It's there's, there's options, there's an in-place option, so it's either in place or it's you. You get a new data frame. If you scroll further down you'll, see wakening an example, but.

A

Then you cover the data quick, yes,.

E

Yep, so if you see your inbox or that the result will be a new day of thing, the original data frame is to unsorted. Unless you pass in an in place flag, in which case the original one is now solid, and then the return mail is just this yeah.

P

And is there nothing in the X in the what I assume is the expression tree that ultimately gets generated? That intelligently says this thing is last used, so we can make it in place, even though the user didn't say in place. No there's.

E

Nothing, it's not address logic because yeah there's another project because in a notebook, personal living like at least the examples that I've seen you always want a copy in all these places. So you did something you don't like it. You can just do like like a pyro and you get what the previous state that you were in. If you're now in place, then it's angry. So it's more exploration style. You don't know. If that's the end product, you want somebody.

Q

The Colorado.

B

I was gonna, go back to Christoph's answer or ask question about if we can make these operations look like ml dotnet operations.

B

My opinion on that is like ml Dannette up pipeline API is super hard to use and so I, don't think I. Don't think we would want to make that more prevalent. Yeah.

A

I think the question that becomes more like ignoring the API is for a moment, but is, is the idea that data frame is not used at all inside of ml, but not always the idea. That would be something in it. That will leverage the lab here at.

E

The moment it's completely separated the way I understand it. Is you take the data that you want? I mean you, take your CSV file, your source, and you do all the expression you want. So you see if this columns actually correlated to the other one or not to all the operations you want, like add, columns to projections and stuff like that, come up with a set of operations that you like that. You know that this is now.

E

You have cleaned up data that you want to send to ml that net, and then you can just use the transforms that you want in. You can fill that into ml by point. There will be two separate streams and.

A

What is your hero soon? Oh it's basically in spark tonight is that is that the primary consumer level data frame right now at.

E

The moment yeah.

A

And this is I mean one of those like I guess concerns I have is that if we, if you don't, have multiple consumer, but if you only have one consumer, then you're always running the risk that the thing you're building is. Is it's very specific to that one scenario versus when you try to create something that is more, you know, general-purpose, you Kaba need more than one consumer to really make sure that you actually have some general purpose and what's something that is, you know it's part of my data plan right.

A

This is a Microsoft data data frame. Why did that make sense right, I see.

B

We have to, we have to high level use cases for.

L

Those British Library.

B

Yeah one is data exploration specifically in in a notebook right into Jupiter. Notebook I want to load up some data and manipulate it right analyze it chart it maybe modify it. That's one complete use case. Another second major case that we're seeing is in spark there's there's an operation called a UDF right, a user-defined function, yeah.

L

B

Input and output of of a type of these UDF's is an in-memory tabular data set right, which would be this data frame and so like in there a user user-defined function right is like you have all this data coming from spark, and you want to do operation on a chunk of it. There you're not like exploring the data like you would be in a notebook right. There, you're actually like taking that data, manipulating it somehow by either whatever the UDF does to the data right. If it's adding these two columns or whatever, but.

G

So maybe maybe I confuse is this: do we consider this a PCL, what we are doing it or like.

A

I think that's the I mean it was originally the idea. That's similar to you know. I dataview was PCL e5. It was the phrase we used so I think this is. It was the same desire. That's why I'm saying, depending on where we land with the number of consumers that may or may not make sense right. This is this is why I mean.

B

To find be strong, link in net standard, like is that our heart definition.

G

In this context, whereas you know when I said it is it basically a feature of the notebook and we just consult kind of to you know, help improve the API or we see it as a truly general purpose technology yeah, that's that's how I would define I.

B

See it as general purpose. I am NOT a proponent of putting this in the system. Namespace though oh and.

P

I'm not sure that we're going to have something that should be general-purpose until we actually sit back. Look at things like Python and other languages do and start supporting, combine combining localized, compute kernels, properly and optimized, so that you don't walk memory again and again and again, because if we're not doing that, we're always going to be light years behind what other frameworks and languages are doing and also.

G

I would say you know like we may, we may say we messed up and I date of you and a man pipeline is not it, but I thought that was there Nia. So if we think that usability of a magnet is not great, then it doesn't support random access. Would we think about trying to fix amel, definitely support these additional scenarios, because I can.

L

Also, imagine you.

G

Know like a mere dotnet started as a forward Omni Lacey I, don't see a reason why, for you know, smaller data sets take a sense that fit in memory various other. You know scenarios you would not want different kinds of implementations of some very commonly abstractions and then the operations I completely agree that today, that Emma dotnet pipeline is not super easy to use. But that's another thing like would we consider trying to fix it first if we were working on? Actually, you know, we see any Peter I.

A

Mean for another net. The question is really like how much I can't actually Myron will be necessary. But if you have one model that is all about, you know you create new instances and it's lazy without duplicating the entire API services might be very hard to have a model like you almost need different kinds of wiring up five flights at that point, but I think it's it's a point well taken right. I mean like.

A

We know that when you try to explore mother that models- and you start with small data- sets that that the pipeline model is very hard to explore because you have to business. I got the whole thing, the bug in one pass and unchanged code, and we print and repeat verses in you know in other systems that have data frame style API is you can set breakpoints, you can introspect intermediary results and things make sense to you, I mean you can just mess with stuff and.

G

Also, I, don't think that I think the view is fundamentally lazy.

G

Implementations are lazy and there are some IP is that kind of you know for the support of lazy scenarios, but you can implement right. The PI data viewers only eager API, the the forward on is a separatist, a random access. It doesn't support, but it's not fundamentally lazy. Correct me if I'm wrong, very I mean.

B

I there's the only thing you can do with an idea to view. Besides get it schema is open a cursor.

G

This is amazing, I mean.

B

If you want to load up all the data into that cursor, first, you can nobody's saying you can write, like is ienumerable lazy,.

G

Candy candy link, you know when you change link operations, it's lazy, but it can be the.

K

Same happens with I need a new identity, doesn't happy delays on top of which data you applied, for example, have is really not already realized memory, and you put ideas original pages already in angry realized data, but.

G

That was my thinking or how I remembered I date of you. There is one important distinction, which is it's not random access. So that's that's true.

G

K

The compare is not with either list and I list. Actually there will be better competitors at least and I enumerable. So this is our like array. Is our data frame? Currency and I knew bro it'll be like I'd interview. So what is interfaced another one is actually mutational. Some. It's actually.

B

A really good analogy right, because ienumerable only says I only get a cursor over things. Where list gives you a ton more things right, you can edit it yeah.

K

B

If we think of it that way, I date of you as ienumerable, and that in the ML dotnet pipeline is linked right.

B

What we're talking about here is implementing lists and allow you to like add things to it. Remove things from it sorted, nothing.

L

To prevent us in the future in the future to use arrow data frame as underlying implementation ml donut. All this way, I.

B

Mean I believe this brings arrow to ml dotnet with our demo a meltdown that doing anything right, because if you, if you have, if you have Apache arrow data, you can just wrap that in the data frame and then pass that into ml time and it just work. Yeah yeah.

G

This direction is great, literally no concerns here. I have a a bit of a concerned about the operations working differently and operations from Emerald net. Our operations from this feature will not basically not working alike the operations in the method net. You said that the reason it was done this way is that amended. That is not super easy to use to apply those operations. I agree with that. I wonder whether we kind of came up on trying to make it easier. It's.

B

Not easy to use and and those operations are a hundred percent lazy operations.

L

So make make it m-net eager mode and easier to use is something we need to look at it, but that something will take a while to reach this area.

A

Yes, many all right. However, let's let's focus on this API here, because I mean why not be a source. Api is easier to use in this API but I, but first, let's get a handle on what that API is and then maybe we can later then decide like, or should this API be used in other places and then, if not, then let's pick a different name. If yes, then, let's pick it, is it rain again like so this is basically so that what you have here is the modification part of right.

E

We can skip this unit test itself doesn't show.

E

This is inserting and removing columns. So there's an API annular just insert column.

H

Remove column, so you can scroll for this.

E

I, don't have a question.

H

About the unfairness in.

G

The previous test, yeah, you had insert column and then you set their name of the column occurred.

G

No back, you know even more one more. Oh.

H

Yeah, what does it mean? You call insert column twice passing through, so that means I'm, inserting.

E

At positions you were in position, one in the in the table in the data frame itself, so I'm saying the first, the 0th column would be in each column and the index ie. The next one would be a four, so I'm, looking at lines, 2.

G

0 1 1 2 1 3. They basically both insert at the same location.

I

Right that shit's, that's what it's the first to insert become the new could repeat name before because of reading right.

E

I mean I'm, throwing the exception guessing it at least I need to know why I'm cat, why I'm throwing something.

I

I guess because you already have that means I, wonder.

E

If I have a big column already.

B

I think it's because of the bug out loud. That said, you can't replace columns. Oh that's right.

E

No, actually the mean the column. Names are different, so it may not hit that code, but.

B

Here, you're trying to put it at index two: that's.

E

Right this yeah that's right right, so yeah ignorance, that's about that. If that is not fixed, this is supposed to work. Yes, what is the intention? The intention, the the intention was I did not want two columns with duplicate names in the data frame, because then, when I say data frame off column name, if there's two column names I, don't know which ones return so I was drawing.

E

Now it means if you're replacing.

E

So so the bug was, if I replace the Colin with another column of the same name. I was preventing it, but I should, because that means I'm replacing one column bit with another column of the same name. So it doesn't matter I see but you're passing here index 2 right. If you pass it a different density that would throw because that means you're having to clean column, names yeah.

E

R

E

There's a set column, yes, but there is a replace like okay, even if you have an insert or a replace if there is a duplicate column name, it throws. Instead, if you replace like column a with the the column, a again, it's fine, but the verb. Yours insert.

M

All the poison, yes.

E

Because there are two different things that you can insert a column anywhere, you want yes and that's. Okay said this is the actual replace insert is.

A

Not the way, if I see the insert, I think they're exactly so to me. Logically, if I insert a column, yes moving name that already exists. Yes, the index shouldn't matter, it should always throw yeah.

K

A

What it does and so set is the only one where I say if a replace the exact color with a right.

E

A

Okay, that makes it alright yeah.

E

P

Okay, these, like the binary office well far, they're down I, have a question in the binary Alps. Yes, so if you have questions, if you have a dataframe, that's like let's say 100 megabytes. Yes, each one of these is going to walk that 100 megabytes separately right.

P

So when you, when you do add it's going to walk the hundred megabytes, doing the add operation and.

L

P

When you do divide it's going to ReWalk that hunger and make it my deck.

L

P

Do we have any plans on providing an API that allows you to walk the memory at once?

P

Basically because otherwise, you're you're, the slowest possible thing you can do on a computer outside of like network access like memory access, yes and so you're walking, you're, chained you're, changing something that could be walking 100 megabytes once into walking. You know 2 gigabytes, right, doing 20 operation, ok, so.

E

We don't have any pain that that's that, because there's two reasons the first one is like on a notebook. If you say data frame got add something and then on the next. So if you say their frame that add dot divide, something because there's no there's no way for me to differentiate which one you did because I don't have like I'm, not building the set of operations that they're doing so like I. Don't know that you're doing an ADD, followed by a divide, so I can combine those two together.

E

So it says what you're saying the.

R

Big hole, byte need to observe the intermediate state right.

M

E

Don't have that.

R

Yet well, there.

G

Is an evening I think.

B

It you did buffer directly, though screwed yeah.

P

B

You can go, do add, multiply on each step, yeah.

P

That requires people to go and write their own, basically compute kernels that do add and abide combined and so you've got an explosive number of expression, trees that become infinite rather than having a system that understands them inherently in combines and themselves we're. Just like Python knows how to do right. I just had one caveat.

E

Which is there's function, application which would go to each element and do whatever function. You want it's limited in that you can only give a given in a tea and in return eat, so you cannot convert from info flow. So it's what you want exist as long as you're. In the same.

P

Data type and you're willing to write the function yourself right rather than using the existing built-in, yes kernels, that might be hyper optimized already. Yes, why don't have to be in.

E

The same data type because going from an int to a float in place would mean I have to change the backing store itself so Arita. So the backing.

R

J

Of the individual serving lever, so the actual backing store has each of these values in a laplacian matrix or an array, as opposed to each individual value, be embossed yes, but when you call the indexer they get boxed at that point only.

B

If you're casted as a base color- yes not on the concrete cast as a column of Ecology, it doesn't get boxed. Okay,.

E

B

Back to the compute kernel stuff, another another thing in that in Apache, Aero project is exactly there's the thing called Gandiva, which is like it basically look like compute kernels built into C++ on top of Apache, Aero, nado, right and so another a plan here is since we're building on top of Apache arrow in the future. We should be able to take advantage of any new capabilities that come to the Apache Aero realm space. Whatever you want to say.

B

So that way, if you did want to, if you did, want to make hyper optimized like expressions if we would just translate that into Gandiva and tell that c++ kernel go, do whatever you need to do, here's the data.

P

B

P

Sense yeah, it still seems like there's a gap there for like ml or general purpose, tensors or stuff, because it's going to apply everywhere, but maybe that's a separate discussion right. We could add the query, planning and stuff into the data frame. It doesn't exist.

E

Now get against probably back I, don't.

B

Know the thing is I, don't know. If that's what the business will you want to be in, though, like.

L

B

To duplicate the effort of what the rest of the arrow project is doing, I think we want to piggyback on top of what they're doing.

E

So right, so this is still doing binary ups, but this is with operators instead of the movie as itself. So you can scroll further down sorry mom.

B

Those were good ones, though. Actually, if you go back to that test right, like DF DF, one of I'm on 317 right, like doing the equals equals on two columns, compares each value in the column and then gives you back another column full of boolean's. Yes, that is true false, whether that those values equaled or not. Yeah,.

A

That one is the only one where I'm you know, I'm sure. That's such a good idea, I mean, generally speaking, we do + and I think it's fine. The return type is within the same domain. That kind of makes us, but for equality, comparisons. The thing it will be I mean it's not impossible to do reference. Equality checks. Why? But just either doing object or reference equals ordering is now or is great grace to do the typical, not check thing but I think people will find that confusing.

G

Isn't that how factory operations work.

A

But given a few operators, we did name methods, but for other reasons, because we couldn't overload the way we want.

G

Isn't that the doctoral operations? Well, you.

A

Have some operations.

G

That let you compare values in vectors. Yes,.

A

But the problem is there's too many different ways. You can do that, and so we be ended up doing necessary. It's not better for them. Oh you're.

G

Just commenting on the operator yeah, yes,.

A

G

Not defending the functionality.

A

Know this it does not make sense, but I think if you called you know dot equal, but.

P

Say: D cubed, just basically how the vector works. Is you compare and you get back a map, a vector of the same type or each element is either 0 or some value for true, and then you extract them. A union I.

G

Didn't get the fact that the comment was about property use a curator angry, because you would me that you want to use it use this in the need statement. Yes and say if this equals this other thing and then the answer is not quite.

B

If we take equals equals away like how far does that cascade right, obviously not equals, goes with it, what about greater than equals, and what, if L, less.

J

Than so the boolean the inequality operators are inequality, operators, they return true and false. They don't return, they do, they are not. They are not vector.

P

Operators, well you so you can still have equals equals, because all all platforms also provide a version that says, are all equal or are none equal right, so you could have the equals equals basically do a are all elements between these two there, which is what most people would expect, the defunct being equal equal, not.

J

Equal less than less than equal, etcetera, all return a boolean, they don't return any sort of array or list or column. They return a boolean. That is what those operators do it is. You can use it in an. If is it is a single one bit value, so what we would have.

B

J

B

What people do with Python.

J

Have you ever seen what people do with C.

B

And like the thing you can get done in one line of code that actually sent might make sense, it's super powerful right make.

J

Sure, but our neural guideline is don't be cute with operator overloads. The Malayan operators return a boolean, the inequality operators are all boolean operators. It's just don't be cute with operators, it's R, that's gotten it. If we, if we have so what what would the suggestion than just a static method that returns a new column that is the equals mass between a and B yeah, then you think you won't know the name I.

B

Think that, yes, all operators like I, don't think we should even to plus and minus and multiply like I, don't think we should do any operators, then, because.

J

Plus the the general rule for plus is a thing plus a thing returns a thing in the same domain, so a column plus a column returns a column that makes sense it's that less than less than is a true or false answer, not a what about column plus a column.

B

S

E

B

Does make sense, though, right it multiplied in new.

S

L

Make it what's the word in.

B

Andis numpy, it broadcasts it broadcasts the one to all things.

J

What nation is a pendulum to the column, yeah.

B

J

If it was a string it we can cap. So again you it's being clever with operators.

B

Operators, that's what.

S

Are you eating the.

P

B

Don't be clever with operators, then all of these things are being clever. We.

P

Already do things like multiply vector by scaler in numerix because mathematically they are well defined already yeah.

J

It has a well-defined thing and everybody believes that a column plus a scaler is make a column that is the same height as the other one from that scaler and add it then fine, but like equal equal. You would expect to see in an if statement and if equal.

E

J

Does something that can't be used in an if statement that means you're being clever with your operators and that's not not that so.

A

The one thing that everyone remembered the sendee stuff we did review the badness at some point and think Emeril's use the phrase of like modem. It wasn't so much about operators. It was specifically about equality like a quality, is hard and people have a very, very pre-canned understanding of what they think that will equal does and I think returning anything. But boolean will with that. At the same time, though, you expendable to be able to overload equals not to mess with the return time, but to just change change the way you do equality about that.

A

You know you could look at conkers and other contents right and I think, with plus and minus people kind of have an intuitive understanding already that the types may change. But if you multiply an into a verb float and you don't get back in in you- get back a float, so people people understand that forever arithmetic yeah, the return types may different and they don't tend to use that an if checks right versus every time, I use C for boolean and turns out they overload in an operator to operator fault.

A

It does fundamentally mess with your intuition of what these things do and that's why I think you know a plus B and columns that seem reasonable to me, because you know you, you can argue whether it does this or it does that, but I think as long as there's the same definition of what it does I think it still is within evil people, I.

G

Think, there's more to it. I think operators work better in domains when there is an expectation that operators would work. So, in the you know, in the domain of numbers, you know that plus something it claims to be a number like I need. You know that it's been a support. Class minus multiply in this, and that and everything right here. I also worry like how will people know that column has a plus operator that takes an int? It's really nice. You know it looks very nice once it's written, but operators are not.

G

They don't show up in intelligence. That's right and it's a bit surprising that you know I can add one to a column. Second thing: I think it's a bit unfortunate here is that when operators work everywhere, it kind of looks nice like math, you can add two things, but here it's like a mixture of methods that are not really named as methods like, for example, column, that takes an eye.

G

It doesn't follow our naming conventions for frankly method names, and then you can plus it it's cute, but you know you're a little firm ethical, like the column. Passing and I already doesn't look like math ready so.

E

The class is kind of great I mean you could change that the df1 of like column, name plus one. Nobody.

B

Right, how does it do.

K

B

This isn't sample code right. This isn't like how we would tell people so.

A

I think we're watching you board average I buy it's the same thing. We solve things like object, initializes right like if you look at how people discovered that in our JSON API is they usually didn't write either had an expectation it's there or it's not, but once you know it's there, it makes your code really nice to read right. So the thing that I think it's generally true that for every operator we would have a method that does the same thing.

A

So if you just described my fellow Sims, you see those methods and then that's the way. You would do it and at some point you see scepter code, where you see, oh, my god, they look super compact, notation and with super nice and so I. Think in that sense, I. Don't think we need to take something away from people because they you know they don't expect it, but at the same time, I think when we give something to people it has to be self-consistent.

A

I, didn't I, think your friend data frame like to what Eric said in the Python. How many people do that, and it is part of the reason why data frame is successful there, because it is fairly compact yeah and because you can't do relatively complicated things in what looks intuitive? You can look at what lines of code you can learn about what it does versus. If you look at vector multiplication and see where you don't have operators, it's very hard for you to visualize what it does.

A

You have to visually read the text and then basically.

G

Draw the diagram yourself kind of make it all work. It looks nice I didn't know that there is actually you know, mathematical notation to access the column right I. What I was commenting on right now it seems like. Oh, there is a plus, but the rest of my code looks very verbose and not Maki. Yes,.

K

G

Say then, maybe it's not worth it because of there is only for its end to end story and it all looks like math in the end, and you know great what about the Equality.

Q

Donate depends on what domain people are coming from. Yeah they're.

G

Coming from Python.

Q

In the coming from getting used to this equality, they don't care the dotnet. Equality is different. No.

A

But that's some things.

Q

A

Think that's a jump that is not legal for the reason that when you do, when you like, you can can't take the price to develop for undocumented or get an expecting to be successful regardless because it's a completely different thing everything, but for this particular domain you.

R

Know if you're a Python developer who's already steeped in this and has already written Python code to do data analysis like this? Why not.

J

Just remain in Python like what what are what value out are we actually giving to people to me, and so, if I'm, not a super solo, well,.

A

They're entities like what am I get away from Python. Thank you, I. Think to me it's it's not important. That feature. X is windowed net is the same as feature X and Python, because that's not how the world works, but you can't just work you you don't just move to donate for this one feature you know: I have to absorb the rest of that I. Do you have to access files?

A

You have to deal with the compiler with the IDE, there's a whole different ecosystem, so once you're that ecosystem, it is super important that things are self consistent, and so that's why I think equality is the thing you really can't with, because it is already very complicated internet. We have very attractive reference types, you have different semantics, and so, if you now also be blending the fact that yeah in this one feature, we also completely redesign how what what the expectations are for equality, I, think that would be fairly bad. Yeah.

K

A

Why I think for for equals? I, really, don't think you get to return anything about cool. We.

J

Would be making a feature for Python developers and offered on that developers, but I mean it comes back to my original question. Why would Python developers yeah so in our design guidelines we say equal equal is the same as I equatable and the inequality operators are the same as I comparable. That means that they are boolean operations right.

M

E

I get all that the only respect I have is for now. You guys have only been doing column plus one and then, if you chain operations together, then you could have operators everywhere, except where you have. When you want to equals, equals you have to say, dot equals, and then you have again just operator. It doesn't look. I've.

B

Even got it this.

E

Isn't gonna work because that.

J

Equals would how would attach my don't have to change any you turn saying that returns a boolean. That is, is this the same as that one? If you want pairwise that would be pairwise equals, you need something to indicate you're doing something other than what what.

A

So don't get me wrong right, I, hear what you're saying and I think this goes back to an Eriksen earlier. Should we remove all of writers it to me this is kind of like the throwing out the baby with the bathwater, but, like you, I think, the one thing you need to think about is: how often do you do Plus? You know concat multiplication versus equality. I would assert you probably don't compare columns that compared to other modifications you make thanks for giving up the the compact subjects.

A

Just because you really don't want to do it for equality. I might be a bit much I would assert here's.

B

The thing you do correct here's, the thing you do is like give me all the values. Give me all the things that are greater than 500 right.

E

B

Want all the big rows and that.

E

Happens all the time when you want to forget what it says or.

P

Or so so doing a vector, wise comparison and getting back the Equality mask is the most common way to do things like King, branchless, ternary select with vectorized code.

Q

For that scenario, I don't have like a filter method that makes it a predicate and rather than returning a true/false column, it just returns the values themselves, basically to false results.

E

It is but the way you read it is with verifying off and then you say, the airframe of column, greater than 500. The predicate is inside the indexer.

E

That's what he is natural in a notebook. Yes, I can't make sense. You pause the longer to the inaccessible. You say, I'm saying like you want to return all those rows where, where the values in this column are greater than 500, so you say: data frame off data frame of column, greater than 500. So.

A

Like they're off you mean records yeah by square bracket listener. You pause the lender expression. If you go to the to the.

J

Indexer a disaster yeah, but if that's reducing them as long as you get back, that's a filter operation and not a like, not a returnee column that contains the answer of the inacol yeah like these are again a different operation that you could say. This is what I want less than to do and like they in fact, the the actual rule that I was I thought paraphrasing is don't be cute with operators is actually do not beat you but operator, so good job Chris.

J

B

All comparison operators go away. That's.

J

Over saying all of aliquots, where equalities and inequalities are boolean single boolean operations, what what's wrong with having a method which is literally like dot, where greater now or something like that? That's fine is.

E

It just not discoverable.

J

It's a method, there's.

E

No intelligence on the notebook yet but I was going more for because it's all weakly typed I was going more for as less typing as people can, because there's no notes in Toby's going on me we'll have to keep looking at the documentation. Come back and keep buying stuff rate is.

J

The idea that people are actually going to be writing to c-sharp in.

R

The notebooks, or are they going to be writing some witch that can be translated? No the calls to the C, shucks or potentially F sure yeah duck men.

J

Or would be, would we be expecting them to write things that are terse in order to be used inside of a notebook? I would.

E

Insist nary, a word I would expect the second one just because it's mostly data scientists using the notebook I, don't know how C sharp B Timothy so, but does.

C

E

That we needed language other than Steve short for them to work with. Oh.

B

It's still C sharp, like it's C sharp. If.

G

It is they're gonna, be calling all the framework. Api is therefore, probably they would want the intellisense and therefore they are. They should be okay with methods, because we have all over the place lots of metals in definitely.

P

So one of the problems with integrates in.

G

Intelligent arrays.

P

And not overload resolution ah pen DES the expression ordering if you're using methods for add subtract, multiply divide. You have to be very explicit and very careful about how you order those expressions. Yeah, you don't get the compiler just saying. Oh I know multiply, comes first, so I'm going to do that. Even if it's over here in your expression, tree yeah, I I,.

G

Was not saying that is the best language for these scenarios. Well, I mean, and you take what a big 19 see shot here. It's gonna be more than just operators on those tables and therefore they sooner related when I end up in a dotnet world, which is full of functions that have names and they alone and intellisense. What you mean you had a.

B

Back about intellisense will bring the intellisense to the notebook experience like that's, not a reason not to do something. Yeah I.

A

Think, though, the one thing that I would say is that there's a reason why we have operator overloads right like I, don't think we can talk away the fact that when you call a method, both syntactically and in terms of number of characters on the screen, it is fundamentally harder to read right, like that's, that's I think the thing if you, if you want to like, say, let's get a Python panel style experience in to.net, yeah.

L

A

The thing you have to think about how this will feel natural, even if you're a sous chef developer like how do you get the same benefits of a compact notation where you can do fairly complicated. You know data translations in a way that fields you know both intuitive is easy to read and also easy to I. Think that everybody.

G

In the room is okay with adding operators for things that feel naturally like.

P

The remaining thing there is, then, if, if there is a place where people want and need to be able to do, equality's which return a mask and c-sharp doesn't have support for that. Today, then, is that something that we needed to talk with the c-sharp ldm team about to see if they can add new operators for that there's already a proposal for a new power operator. So maybe we need something that says: here's equality in their comparison, that we argument was what I've heard multiple times during this.

J

Session is that this is not intended for typical top end developers. This is intended for data scientists who aren't otherwise steeped in the dotnet world, and that also comes back to the point that we spent a while you know beating earlier, which is: does this belong in corner of X? If that's the target audience and I would say, the answer was no. If that's the target audience I mean.

B

I definitely I definitely don't agree that this isn't intended for dotnet developers. This is totally intended for dotnet developers, but it's intended for dotnet developers who want to do data things right.

L

B

Don't want to have to go to Python just because, like there's no data things in dotnet right, like I, don't want it. I need to do big data. Well, there's no spark nets. Now, I have to go, learn, Python or Scala right. So.

J

There's no mission, are you talking developers? It's.

B

Totally, in my opinion, this is totally.

L

Empty no, no! No, but let's step back is it for dotnet developer, with ml background or data background or who know without data background. Those are the different audience. I think.

B

You have to have some some data flash maps also.

L

One of the things usually happen a lot, and this way a lot of people Sandra's own person word. There is tons of code and they're from papers that available in Python and if we try to copy or move this code back and and use them so I expect also in people that use dotnet will look at Python code because there is a rich, a lot of Python code and try to convert it to c-sharp and the easy we can do. We can show them how to migrate.

L

It be easy for people to implement their algorithm or using.net.

L

P

And I, don't think that you know, even though this might be primarily targeted towards people who have some ML background and stuff. One of the reasons people like Python and they're able to do it is they might be trying to add some minimal ml to to their app or something like that and Python is very easy. You just open a command line.

P

You start typing code like you would math or anything else you already know, and you can get results because it's familiar to people, even if they don't have that data or ml background they just type math and they get math back at work and that's basically all this is. Is data frame, vectors, tensors, they're, all just well-defined mathematical types with well-defined operations? So, if you understand math, you can type math and you get math back.

J

All right so like I, have no objection to the asterisk +.

J

M

Were pleasurable and.

J

Percent operators- those are fine, it's the equal, equal, less than equal, less than etc. Those are the we have. A a.net developer has a very strong understanding that those returns a boolean, not an effector of boolean, not a tensor of boolean, like they return a yeah and.

C

It is the fact.

J

That the doctor never overrode bust them to return something random boolean like and if so, that's the extent to like I'm only arguing about the ones that do or could have an equal in them. I guess, plus equal comparison, I think so. The comparison equal as opposed to.

E

J

Equal, like the inequality operators and the Equality operator, which is a special case of an equality operator, are a single boolean and we we have that it's actually mostly implicit in the guidelines, but the two specific things we have are like, though I comparable translates to the the classic inequality and I equate able translates into equal, equal and not equal. So.

E

J

Said those are single boolean and that is the defined behavior we have in dotnet and the whole point of API review and framework design guidelines is dotnet feels like net, and not this thing that came out of inspired by Python feels like Python. It's it needs to feel like that meant. That is the number one rule in got net. We are not.

A

Don't get me wrong, like I could really agree with what you just said. At the same time, though, like the I think, the goal is not to take a Python concept. Imported pseudo net right I think the goal is to say there is something that is really popular and successful in the Python world for a particular sort of characteristics. And so the question is: if you were to build something that has similar characteristic in dotnet. What would be the dotnet way to do that?

A

Right and I think we just discovered that for operators it is harder because people like the way strongly-typed systems like C, sharp or Java work- is that there is an expectation for what certain operators do right. At the same time, though, I think there is a desire to say, can we find a way to do these things in a compact fashion?

A

I, don't know what they would look like, but, for example, one thing I could see you doing is instead of saying we have these operators on the column type or the data frame type, which is the thing that people will pass in and out of methods, and thus you probably want to do now- checks against it's different from saying you have a method that takes the lambda. The lambda has some funky arguments, some types and those have to have certain operators defined for them.

A

For you to express what that condition looks like right and for those things, maybe we don't care that we have a double equal to, because those are not the things you're actually passing a lot of methods, but I think the question is: can be carried compact notation to do column based transforms and and not create a world that people when they're through argument, validation are surprised that double equals doesn't return the move anymore right. So I think that I think to me is more. Like you know.

A

What would the API should look like to to to get this into? Maybe there's also like I. Don't know. Maybe missing C sharp features or whatever, but I think we should think about that. Yeah.

R

I think we all know since you're short creatures here for doing it. Now, if you stop.

A

But also wouldn't like used existing features and which of them into submission right that that's, that's also, probably not a very efficient way to do it. What.

M

Is the heart now? Yes,.

M

B

Are we proposing that go so.

J

Our rule is, if you have an operator, you have to have a named method, because not only we're just going to go for the operators and we're saying get rid of your operator overloads on the everyday method.

P

Okay, the name method equals returning a column and stuffable I think it needs to be something like equals: masks buttons, okay, well, you already have them at the name equals and anyone who has a equated will or I comparable or any other dotnet background will, by default, assume greater than returns a bool. So you need something like greater than mask what it does not implement. Ie I still think the default assumption is that it does nothing.

J

Name: dot. Yes, you're reviewing a bunch of code, it has var all over the place. You have no idea. You're working with this type, you see bars. Some guns equals.

B

J

Your brain is a.net developer. What that bar is that I mean.

B

A

B

They're static functions that take to college or potentially a column and no scaler I.

J

Just agree, but like it with Tanner's suggestion of masks, you just want, you know: mask equals. Our deepest maskers I ever liked can suggest that it's returning a a piecewise vector.

C

Of data, as opposed to a single.

Q

C

Q

For equals, it's not fully clear, then let them know.

S

To any concurrently of the comparisons.

Q

And resultant force specifically I, see I get. Let me ask officially to equal to ax equals massive, etc, but weird, unless they're, not really large enough and I. Don't think that that common people have this beautiful girlfriend. It's.

M

I comparable so including.

J

The dust sorting I think if I read VAR b, equal a dot greater than c I, believe that variable is a vector because.

M

J

Obviously, everyone caller concerns.

L

Me very not in the very oldest.

J

B

Easily follows one follows the other right: it's like equals equals, if you say no, then greater than equals has to say no and then therefore greater then has to be no right. Like you can't I know.

J

You want the wind, a spin them and, and that maybe here, there's yeah, that there's room for suggesting to the language design that we need a tilde tilde greater than equal, which is pair wise instead of and then now it's a new operator and nobody has any preconceived notions on to.

A

Be clear: that's all the different vector of T vector of T. We have equals that returns a vector we have equals all equals Eddie. We have greater than returns a vector. We have greater than Oh.

P

Hector of T, in particular, we said, was very special and it was documented to be a hardware I. Don't.

A

See a reason why this damn language doesn't apply here. You could just say this is a special type and then we just say this is because I think for the argument is or their aesthetic methods right. So yeah you, the way you invoke them is so different. Already that I guess.

J

A

You don't really have any static methods.

J

Then you know get the like things stand out a little more. What yeah? Okay! Guess that if it's an extension method or an instance method, then I think, especially in a world where people like writing, var yeah.

E

J

You have you, you can't really understand what the flow of what you're getting is and worse. You have an expectation that is wrong right, like when var leaves you compute with ambiguous. That's okay, when, when you have concretely decided the wrong answer now, that's when you get confused in reading the rest of code, I guess.

A

We feel that else of what we gave up in vector of T when you actually write back the rest code, like you intuition, is no longer serving you anyway.

J

Thank you. We also like another another argument. We also don't tend to put operators on non steel types because you can override a change of behavior and the operator is a static function well, but the operators just supposed to defer to the name of this. That's our guideline up there, anyway, they're.

A

J

A

J

R

A

R

Should scroll so.

A

J

We look at the API. Is that all should we just go with the front all right? Well, so one of the things I've noticed so far is there is a type called base column yeah. We don't use the word base.

A

To the phone, yes,.

J

You look at the core types, maybe and I would we have avoid using. It says.

A

J

Suffix base, but a prefix base would all sorts fine yeah it's it's. If it's a useable type by itself, then it should just be data column or something, and then you have specialized types beyond that. That's fine, but like we call it collection of T, we don't call it base collection, even though it's, but mostly only ever used as a base. I.

P

Think the only case where we've made an exception to that is base as the suffix for the hardware intrinsics like arm six arm base, because it's the base is a and we didn't I, don't think we ever came up with a better name, but.

M

That wasn't that it was a base.

P

Class, that was that it was representing.

J

Anything that in the spec is itself called base. Right is p.net house types that use has abstract base classes that have base and very.

E

J

Just I mean I mean there. There is some precedent, for there is there's.

B

A proposal lower there to rename it and I.

J

B

I think that rename is fine right, like data frame called is it follows like it makes it if we call if we call the Pinta the major type data frame, yeah.

P

B

I, don't know for 100% agree, onion or either whatever that thing is, column probably makes sense for this base to.

Q

Be excited: they have coffee back with that from to discuss okay, so the.

B

Guy, you don't rain in the base. Column are the two most important times this avoid.

J

Aiming base classes where they base suffix if the class is intended for you, but.

B

It's a pretty face. Okay, yes,.

J

Touche, it probably also shouldn't yes, implicitly includes well cuz prefixing, don't like because I, that's all your stuff in weird places, yeah.

B

Let's go to data frame: wait should we start with the name of that? Is there? Are there better names than data frame for this.

J

Names are always hard right like it feels like after we understand what functionality it has, we can have an opinion on it, but at least me who was on vacation the whole time this was proposed. I I see this as can be left Department.

S

J

Call this thing: a car I, don't know what.

B

So just spend an hour and a half name.

J

B

Should have a good.

J

Was thinking like if the name data frame is.

E

J

Somebody already you know in this world, then yeah another that design. That's why the host, because we we do, we do tend to say like if you, if somebody who already has domain-specific knowledge, would see that type name and say oh I kind of know what it does. Then it's a decent typography as long as somebody who doesn't have the domain knowledge wouldn't believe it means something it isn't. So if it's overly generic we should well is.

C

That point- maybe we could put it.

J

Into a different space, all right, if we build it, it's appropriately, generic yeah.

Q

S

Q

Nice question: what does paying me? I gave data framework a new big big, that down like what is supposed to mean it's.

E

Q

E

Picture for the day square say a stack frames. They you will that, like data, you can call like a data view. We kind of mean the same thing as a data frame. Fortunately,.

J

We have an interface I did of you right so.

A

The problem is the type names data table data. Have you already taken yeah data.

B

A

There is that so and fortunately, though, the other ecosystems use the name frame so but we have a new single net yet so.

B

It an issue, a direct issue of this is spark net. Has the concept called a data frame right and it's like a distributed? Data frame right like it can represent data? That's you know some of its on this machine, some of its on that machine, some of its over here but you're running, and it represents all of all of the data across all of this, and since we actually want to use this type in the context of spark net in those UDF's that I talked about earlier, the names directly collide right.

B

They both want to be called data frame where this one represents an in-memory frame of data, where the other one represents like a pointer to data somewhere in the spark ecosystem.

A

Yeah I said he wasn't seeing all this for a second, because I mean this goes back to what crystals are all about. Let's make one feature or one technology with multiple features rather than competing technologies. But again, let's just say: I, don't know, I, don't see enough reason why not to get a different name than data frame, I. Think, once you talk about how this Ted bets and other things, maybe we'd have a different conversation about that, but I.

J

Mean if a primary, essentially.

A

J

This type is to work with an existing library. That already has a type of this thing. Yeah.

G

E

J

Is try really really really really really hard to come up with a different name here, because we don't like types that differ only by name space when.

A

J

Be used together sparked.

A

On that is also, is that stable, yet is it at all so beacons do name the other name as well right sure, no.

B

It kind of can't, because it's called a data frame in all the rest of spark yeah and so for C sharp to call it a data frame when Python Scala are all calling the data frame. It they're not gonna, rename it yeah. So.

J

Again, we'll put a pin in that and sure maybe it needs a better name, but.

J

R

I

R

A good record that was already a defined type outside of us does.

I

G

Translate eye-opening yeah.

R

B

Yeah, that's one thing that needs to change about this apiary tool is like types that aren't in this API review that aren't in, like net standard, should kind of be fully qualified. So that way, you know what they're talking about. I dataview in record matches exactly the problem. Yeah.

A

So I resolved all comments and also there collapsed. So we can look at the API, but look at some other comments. Okay,.

J

So why is columns returning in I list a rather.

M

Sorry, let me change my stress: why is it returning a list that there's a common business? We could change.

E

That was before the notebook had custom, formatting and support. Now that it does, we don't have to do the highways, because.

J

Our general guideline is, if you always return the same type, someone is going to depend on that and they're going to cast it yeah and only return the interface from a property or a method. If you actually return multiple different things from that interface, so you will break some way if you want so.

E

J

On in system link, we have a dot two list, extension method and it returns a list of monster. Okay. um Here we could return a read-only list.

A

J

Yeah, so it's does.

A

It or what I suggested isn't heavy own type, because then you can extend it. But if.

J

A

Want to have any nicks of a name for example, then, essentially it gets weird, because.

J

Its parameters should take the least specific type. Video and return type should be the most specific type, because someone's going to depend.

E

J

Was also another comment for the constructor here like maybe the constructor takes an ienumerable right, because it's the most generic in the end table. Okay, so.

A

Now, what do you have it? Cullen, Cullen Cullen's.

J

A

To be redundant Hanscombe, hopefully it comes to same number, allocating.

J

A whole list, if you already columns that I, guess it once it's read-only, you could return the same catch view right changes, but understanding the column count without asking what are actually all of the columns that seem valuable.

R

Eric, were you going to say something: I think it was yeah.

B

I was gonna, say on that. All on that old comment like what what's the recommendation here well, first, let's back one more thing up is somebody taking notes. Yeah.

A

B

If not, maybe we should start I guess what so the comment was column should be I read only list no.

J

B

List, whatever.

J

Whatever columns is returning like whatever it actually returns, so call columns then call get tight. That is what it should returns running a list Sophie a list itself, yeah.

R

J

Should be, should.

R

J

Your own custom column collection type. But if you want to put special index or as well.

E

It if it was get set.

J

Which would be if I always been difficult, then the I list would make sense. But since you are only um since you were only producing data and it's a collection, you should be the most specific collection type of camera. In.

B

J

For recommending.

B

Our own collection tape possible.

J

If you think that you would need to extend it in a weird way in the future, so you could do dot columns index in index into index and string index and so on, as opposed to having like a dot column method on the data.

J

It just feels a bit more natural for developers, so.

B

The other problem that I see with this API is that it's it's a list of strings right, which is the name of the columns right.

J

Cuz, it really is, we call them games. It's called right here in the columns, yeah right and.

B

I kind of think, like the whole scheming, this thing like what are the names of the columns? What are the you know the types etc I think that should just be a separate property. Well,.

P

If you had your end custom column view and then a column type, you could expose everything that way and extend in the future as needed.

B

So you mean if this was a list of dataframe column, whatever kind of list like dataframe column list or whatever yeah.

K

B

On those things we have name and type as properties right.

P

On the actual column, you could have name type value anything else. You wanted to extend it with in the future. As you saw right Joanie, you.

B

Don't need to split out schema from data right.

C

B

Can keep it in on one class? Okay,.

B

And then remove call Tom.

J

Unless it's important to a column that it not have a name, so it can be reused somewhere else without it's with a different name like if you don't have that use case than having the column understand. What Nina has is good. Does a column, have a collection.

R

Or have a concept of a parent? No okay! So it's literally just what's only before you say anything, yes,.

B

And it needs to, we need to allow it to be its own freestanding thing as well, or at least because in spark there's a case where you just want to pass in what in pandas, they call a series which is just one collection of data right.

Q

B

In this case, but in this case it's kind of weird right because we're like well here's a column and it's like well, it's a is that weird like if you just want to pass these things around, as kind of like like an array, would be to call it a column, not not parented, by a frame.

J

Doesn't sound weird if you have scenarios for a just exists on its own I meant.

B

To Nate like using that type column, I.

J

Mean assuming is if this is called data frame and that's called data frame column, then that seems I. Don't okay, I these days, I tend to make tree like types like that not go back up, because it gives you a lot more flexibility right there's. If it's not important that a column know about its data frame, then don't tell it, but especially what happens after you get one and you if you were movement from the dataframe like what universe are you in now? Just you know add up unless you need it can.

Q

You add a column, good data field that which has a row count. That's different from the rest of the columns that, therefore, if it throws so column, needs to have the concept of how many rows I have it where it has a length yeah.

J

Which is usually the name of the in returning property, that is the. What is the? What is one bigger than the biggest number you can put me on deck, sir yeah, so it seems like so.

Q

It doesn't seem like columns can be the stand notate all this ties to our data frame that can take a column for most likely if I take a caller from wonder if they want a dependent or other there's a fail, because, while chances that my row constant amount, you can also fill the colon that of nothing as.

J

Long as it has the right link, if you converging.

M

J

Entity problems are fine, so a data frame has to have columns and all the same way for so demanding. Yes,.

L

J

E

That's it is tabulated, you said: empty columns are fine, yes, so then standalone entity, calms, a funny or innate, are an empty data frame itself is fine.

R

Isis, so the the length of the column is checked when the called is inserted into the data for.

Q

Yes, okay, good put it instead be some default value. This, like a initialize the column with some default values that is exact same less than you for my trying to airplane. There's.

E

The default capacity.

Q

M

Won't have data, though you.

J

You can build it with innumerable that, whatever the thing is that just generates the sequence of the right length, 0 like now, you have a now. You have 500 0.

Q

J

Q

Capacity of a color match, my dear friends, live there, it's better than had it. No, you still throw it mean to actually be filled with data. Ok,.

R

The data don't have to render so the the row count of a data frame can change. What is it if strange today, even though all of the columns have to have the same way, there's.

E

No API or append yet no.

R

But will to append a column to a data file, no in a row xx no.

J

So if I, if I call the data frame, parameterless constructor, presumably row count is 0 yes and then, if I append a column to the data frame, razzuma bleah, the road health is now changed. Yes,.

E

For the first column, I love.

M

That can be recalled from the deuteron. Yes,.

J

What does work out, which are.

E

At that point,.

M

E

M

R

Sure, there's a bug now after you say that it's.

J

Probably what I mean.

R

What I'm getting at is, if you, if you call the parameter.

J

List constructor for data frame, does it make sense to say like this? Is the word count that you're going to have it's for all time? No.

R

J

Out of half the vegetables? Oh no! It's.

B

Definitely not locked for all time. We want to have a pen drove methods to this. You want to be able to build these things up as you go, but.

R

What I'm saying roko.

B

Will change yes exist.

A

Means at any given point in time or planted after at this annual. That.

J

It's not talking things like that: yeah, okay, because the other thing that I found a little strange as the constructor to data frame takes a list of columns. But the indexer is by a grown-up, by columns. So you have this mismatch between how the things constructed and how the thing.

R

Is queried maybe.

J

That's intentional: it.

R

Just seemed strangely there's.

E

Also an indexer for the for a column and a row so where it says this long row index, that's the row index er and where it says this spring column name, that's the column in Excel. So there's.

A

A lot more, the question is like: if you have two indexes one with Expo index and column index and even other one that takes my a Google event.

E

E

Your question is between long roam index an ienumerable into filter right.

I

So this guy here yeah, if I pass this the two element array wise, is different from passing things here. That seems weird.

J

Yeah like what what do the two filtery indexes do: I put.

R

With the list of rows right right, filter filter is my column. It has to be nil.

A

Because it's the way I see it is that, given that you have this guy here, yeah, which is talking about roles you're, just invoking this effectively interlude, no.

J

He'll enter the it'll, be an enumerable of lungs yeah.

A

J

Have to be by Colin da 500 only so that's clearly column, yes, and that's.

R

J

The way that type the ienumerable of tea is the only way that I can actually reason.

R

About that yeah.

J

That's him yeah yeah.

R

E

One was to quickly filter stuff in a notebook. So if I had a data frame and then you wanted to figure out the first weight, say, ten thousand votes or something.

E

B

The index of the rows of what you want back yes,.

L

B

If there's a hundred column or 100 rows- and you want the fifth, the tenth and the twentieth, you would pass in five ten twelve. So.

J

That should be renewable or.

B

J

Maybe right it should not be an indexer and it should be select rows and take an ienumerable of long, because.

K

F

J

Rest of your row index is arranged.

F

J

Like this yeah, this doesn't feel like the indexer is not getting smaller, which really suggests it's not I mean you're getting less data, but you you I, can't think of a type off the type of my head where the indexer returns the same type. Is it a common scenario that you already know the row indices you care about, and you just want to select those very particular indices. I. Think you guys going.

I

To say, innumerable bloggers range skim right. We.

B

Use a turtle for sure.

B

The filter column, the one right above it it is, is super useful as well like this is I, have a column, bullying's true-false, true-false true-false. That's selecting the rows that I want out of the data frame right.

J

But I wouldn't end that into an ienumerable, though I would just pass the actual mass back to you.

B

Yeah, that's the one, that's above the one. That's selected, yeah.

J

I think both of those I think both of the indexers that are returning a data frame should be methods and not indexers, because you're not while you are reducing the amount of data that comes out of this operation, you're not reduced like you're, not getting to a smaller and smaller type like list of T, goes to a t. String goes to a char because it's really listed.

J

P

J

Of but like I, the using an indexer and then returning the same type as the type of that the indexes on feels very awful. We.

B

Would never have an indexer, yeah and range that took a slice or like oh right, like a span, it has an indexer that takes a slice that returns a spam yeah.

J

I guess we said fine, we said the indexer that takes four range returns. These.

B

J

B

Right these two things are ranges.

Q

Yeah I, don't I, don't see it as all that different. Are they because they're like I know long filter is a desperate I mean yeah? That's this contiguous list. Oh that's a sparse break! Yeah. We don't have one of those.

C

Yet you're inventing a concept yeah.

J

But if we, if we had a source image, this would make sense, I think and if we had a sparse range and we were consistent across the frame like again I look at this and if I saw something using the indexer I would be back well.

P

With there being an index- or you could never make this a params filter, for example, because then it's going to conflict with the long end overload yeah.

J

Yeah, so just my recommendation is like this again feels like you're trying to be cute with operators and syntax, and this should be a method so.

Q

I already go back to explain what the baseball and extra filter does so.

B

Imagine I have a 10 by 100 data frame and I built up some predicate thing that gave me a column of boolean a hundred long and the truths are the records that I want to select out of the database or other the data frame and the pulses are the ones that I want. I don't want to select.

E

A

E

You go to the unit test and do commander for filter you'll, see an example that does that so.

A

You're, not altering a data frame that just has this one column. It's a copy. This one column just includes which roles do you want to keep? Yes,.

M

Yes, really, you have a pool here and your column has the have boolean data or bad things happen. The animals interest more of like a convenience thing, I mean the water base.

E

B

Gonna change the base column. When we we said we were gonna change that one, because it's gotta be it's gotta, be a boolean yeah.

J

B

The only way a filter would make sense, it's the.

J

Int one is itself on: the fly, generates a predicate effectively. It's the predicate of. Are you this row number.

M

B

The base column, one like if you, if you remember that equality operators before we had the conversation returned a predication column like you, can just take that boolean column, that you got back from the comparison operators and pass it into this indexer right rated again. It's about. Like writing compact code right. Yes,.

J

But again the over to say there are no.

E

J

Is it should be legible? Instead of it should be compact? We we have specific guidelines that say, prefer legibility over compactness.

A

Yeah I mean it's. Why we.

B

A

D

By instead of my opinion, it's.

B

Legible way for what you.

Q

Have you were saying something you've cut off, I was.

B

Saying, in my opinion, it's it's legible when you actually see it I.

A

Think that goes back to expectations, right I think like if you're, basically modeling what can be done in Python right and you, it just seems if you're looking at the type combinations here, it just seems surprising.

A

Maybe that's why we react this way, but and if you have an expectation of what these things do and then having compact intuition, it's obviously farming but I'm wondering whether people would look at those signatures and they'll have like we'll be able to tell what these things are doing. Does it mean clearly the wrong way, a lot of difficulties or what these things will do, because.

J

Something like this can always be added later once there does seem to be a feeling that, like a lot of people, are asking for it, because a lot of people feel that it feels natural and stuff like it's. You can always add things later. You can't ever take things away, and so these two I look at that and I look at them and I say they're not doing the same as the other indexers around them they're, not depending on something like range, which is we have the same pattern throughout the entire data framework.

J

This is a different concept, so one of these would be like select rose and another one would be like you know, filter and those are the two operations and not all like they're different. So we shouldn't be using the same syntax for give me this row, give me a thing that is logically like the set of rows but weird and give me rows based on predicate the value. Look like just different names, so you help clarity and again you can over time.

J

You can decide based on usability that this was it's better to have this and you can happen, but by the way, what it actually makes sense to add a range, an X or to this I know range, only goes up to int 32 on that value, but I mean if we have an index or by row, then that's.

E

Why that's one of the reasons for my indexing purple, silver, yeah, I.

J

Mean it does kind of ouch yeah in deck taking index is hard. Taking index is holographic. Okay,.

A

Well, you will have more than invest. Rose looks the BEC. Let's go for a covenant, yes,.

J

Sure, if this is all a memory, you can't have ever needed it's worth of herbs because.

E

J

E

For as like chunks of memory us understand, useful gigs, yeah, so column could be fair enough. Yeah I.

B

Guess you've done.

F

In the annual you.

B

Can you open what I pasted it in the chat I.

J

Mean- and we have proposals to have things like wall Narayan, longstead anymore, so.

S

A

Go there: okay, the.

P

Threats still going on right now,.

B

So, in this case, like what this is doing, can you keep the line, breaks that well whatever so housing data in this case is a data frame right. So what this chunk of code is doing is its splitting up 10% into a test data frame and 90% into the training data frame right so like the shuffle the shuffle method just taken an inter and amaizing it basically to you, don't worry about that, but what random indices gets. You is a random.

B

You know from zero to the count of data that you have all the indices, randomized rate, and so you know 16 and 17 is saying: take the test size of those random indices and split them up into two into two arrays and now from housing data. So the key is 19 and 20 right, like now from housing, data I can just say, select out the train rose and from 20 I can say, select out the test ropes. So.

J

This also means that the data frame that you get back is going to have the rose in the same order that they were requested, not necessarily in the same order. They exist. It originally right crank.

B

In the ienumerable of t right, okay,.

J

Like looking through the bar here is good because it's testing what you think that the expression does I'm good with up through 17 I, think I have a good idea. What it's returning 19 to me, I, see housing like data of a thing, I, don't know why it's taking multiple rows. But to me it's returning like one thing. So train must be a road or maybe it's a column. But, like you.

B

Know which I have no neural set up to 17? You know a train, you know what everything up to 17 is right.

J

And then I, look at 19 and now I need to go like I need someone to expand this bar, because I have no prediction of what that type is well. What's, what's, train rows, it's a! It is a collection of rows whatever that means it could be a list. It could be a data.

E

Frame because the data frames, a collection.

J

E

I, don't care it's the thing that has an indigestible number.

J

Of rows, yeah train train roses on it to write test roses inventory. Yes,.

M

Yes, this is different from what you said, not what Jeremy.

B

M

J

Once you know, you know.

M

J

Know what the bar does, if you can't look at the code and guess what the bar does without knowing like without super knowing what the type is. It means that you're abusing an indexer so but.

B

You didn't know what 16 was I thought.

P

I did apparently I was wrong. It's a mentoring, I mean if we follow the guidelines. Anyways you're not supposed to use bar here, even though I would use for well, that's the core thanks pendant well, but.

E

It's also it's on a note. Will you just press your mentor and you know what fine Rose is.

J

But but this is not a notebook feature, this is a dotnet feature you can use it.

M

J

You can also be reviewing code.

M

Someone wrote for other purposes: I would just respect for that by saying the indexer.

E

Is new move, my expectation is, it will be used mostly in the notebook and when you have intellisense in the spark to visual studio, then you can use any method name you want, because you know what it does.

M

Oh sorry, I got the Train Rose Room, because I didn't pay attention to the very.

J

Little bit on the way, I thought it was the data frame that was being indexed with the rates index.

B

Yes, so removing the bar, because I actually hate part as well gives you this, but you waste that in their email does.

R

That mean you would keep it as all-union.

E

Roll event, or you would make it I even roll along the only reason I put in this, because yeah I could not make an innumerable dog range, but that they're turned and then we're both long. It only returned the name. The I think.

M

J

But I mean that's you're designing you're, designing an API for something that is intended to hold potentially more than two billion rows and now you're saying that this particular index were can only access up to two billion, so I just think line 20. If it said housing, data that select rows, train rows would be way more clear than using the square bracket notation.

M

And why didn't.

B

We do that, then. Why didn't we do that for random indices of square bracket, test size, dot, dot? Sorry, won't you! Why didn't we do it for 17, then? Why didn't we make a method on 7 for 17? If.

J

You're, like I, mean.

R

This isn't selecting random.

J

It's selecting contiguous but I, guess.

B

I don't understand what why that makes a difference so again.

J

For one, this is the that rain. The type range that we added for better or worse, we came up with guidelines of what it does, which is that it does return the tea. A lot of people were upset with that, which is why I deleted that fact from my brain, and so this is a pattern that we already have, that we have it the inventing a new thing based on ienumerable that has a similar feel like it again we're inventing a concept.

P

And I don't have an arranger, I I don't think we are. This is still range indexed. It's just multi dimensional range indexing which range does not support. It's not multi-dimensional, although it's still technically a form of multi-dimensional life.

J

If that were true, that I would say the the array itself, the elements of the array would represent the dimensions of your Caravan.

R

You just discontinued, but yeah.

J

So, like an array already has a range index, er like I I, would be totally fine with dataframe having one like it find it matching.

E

Yeah, oh yeah, I, understand.

J

The one four four system that range yeah: it's when we're inventing new things like the the predicate column and give me these row numbers that I just think it's more clear that these things have words I think it would mean way more clear if our range thing we didn't use my recognition either, but I lost that or me.

B

But the reason we did is because it's more natural for people to drink this code.

J

So it sounds like the better way to write this code honestly, would have been housing data, dot, select and then inside of your select predicate just choose a random number between 0 and 1. If it's greater than 0.1 return false. If it's less than point 1 return. True, the.

P

Problem is you still, then have to manually think to yourself: where does indexing fall in the order of operations as compared to every other operation, and you have to manually, say: I have to rearrange my expression now to put indexing here so that it occurs at the right location in relation to everything else.

B

Levi your example doesn't do exactly what this does either, because this gives me exactly 10% know.

J

B

J

Tempers you know yeah and yours, you didn't need to figure out how to build the negative filter for getting a on and like this one's a priori. So it knows it's very different, but I mean.

A

I think so the one thing I think where it differs, is that range is a very specific type and so saying index of the tags range has a certain behavior I think is one thing.

A

The poem of idea mobile is that it's already an existing type used in building other places, and it seems like a a weird combination, I think if you would have had it, you know it discontinuous range somewhere, saying that a range person mixer has that same behavior as a range based it makes ur seems reasonable to me: I'm just not sold on the annual thing because to mean ienumerable, the problem is it's logically, a collection right, and so that to me is almost the same as multiple parameters you pass to the method, that's what I'm saying.

A

If you have one index that it takes a and B, and then you have an animal that that takes an int. How is the difference from a two element? Einen move. All the other overload the text to element that seems to me the bigger problem get.

Q

The solution for problems is a venir method that says next Electro's and you take it I either.

A

Have a named method or make it very clear in concepts right so, for example, that we don't represent an arranged as loose to ends. So it seems reasonable to me to say if you have an overlock that takes one long one end and if never all of it takes a range, but there's not really a conflict. What they do because they're completely different concepts. The fact that they are baked by similar types is almost irrelevant. The.

J

Textual range would fail if the collections greater than two billion elements well.

A

There's that all there's, also the you don't want contiguous, but.

P

I, don't what I'm saying is.

A

The difference really is that the type makes a difference in my open right. The.

P

Majority of donek code that takes a range in one end that takes two is, is going to end up being almost equivalent because the because range is new and so wanting it would have been doing start length for manual range previously. So.

P

That's probably another concept cap that has to be considered I.

A

Mean I, don't I mean it still might make sense to offer range, though, because if your collection is smaller than you know it, what is it in max value which seems reasonable like but a at the same time, because he, basically what you gain? Is you get? You gain the language syntax right right, but.

R

Now you have an abstraction where certain some taxes don't work, yeah I, don't know if you should support ranges, the backing. That is no.

A

If 99% of the usage is well.

J

Beyond themselves, impersonal like as if it's not they don't know, if that's true second off, do you actually know if you're in that world, with any type of reliability like if you're, if you're used to writing code that works 99% of the time and then just by muscle memory, you type it for something that it's not a short.

E

J

Make a thing that takes range instead of using the range above pattern then, like you, can see the thing that it that it did, the like dot dot trail off, and you can yourself look at the index and see that it's counting five from the end great there you they can. They can hook it themselves and turn it close.

E

Yeah they don't put, it's all range dot it doesn't it doesn't it.

J

Doesn't give you the full granularity of what you would expect from? Therefore, that's a problem, you wouldn't be able to reference things in the middle of the question.

J

There are things that, yes, there are things that you wouldn't be able to express yeah, but you're gonna get that when you try and take whatever thing you got, that was a long index and build the range object out of it either you yeah a dense and you were working with them or you cast the long to in it yourself and that's a sign of like something might go wrong here, like no one ever uses chapter box, there's a lot of Chuck block me and use of a sample. Well.

P

In multiplied by point, one, it's going to give you an inexact result with rounding error should be divided by 10 and then accounting for accounting for midpoint.

A

All right, so we are out of time, but we also nowhere near done so yeah. We here started them. So there's another two hours polling. We should we should spend walking the remaining API surface.

A

So my cursor is more for Eric and you know like do you have enough to work off of or do you need another video I.

B

Mean the goal, the goal the goal I established was I want a preview of this API, a new guitar org by ignite that's a month, so.

A

To me, like the one thing is I would generally say: the preview should be barred on what the feature team feels is appropriate right, I, don't think we have to resolve all issues in order to super preview. I can thank you how many questions we just had where you know it's interesting to see how customers react to that. You know what keep it do you get right, I, don't special.

B

If we have something and then take it away like that sure, but that's the.

A

Notion of every previous I mean. If then, there's across the board, I mean that one I'm less concerned about that's more I'm.

B

Not concerned about it, yeah I would be happy yeah.

A

B

Would be happy if we had something took it away and then somebody said hey I was using that and it's really great in this scenario. Then we know that it's useful right well,.

P

In like John, we shipped the arm 64 intrinsics as experimental, and we've completely changed how they're exposed for dotnet 5. So it's not unprecedented that we ship experimental package and then change everything underneath people I'm.

B

Totally fine with continuing it has to preview the.

P

F

B

This before the preview stage to get like the major kinks worked out right. Yes,.

L

B

No, this shouldn't be called base, column right that are pretty. You know pretty major but obvious right, yeah.

A

I think the one thing we should probably close on for tonight is: what's the package name in the next phase? Name right. That seems something that changing those later is always annoying, because you can no longer talk about what it is that you shipped and how it relates to this other new. It.

P

When we ship packages matching the name of the namespace yeah.

J

And so my my feedback on that is, if this is very, very transparently built on arrow.

J

Yes, that arrow should be in the next place over or like it shouldn't it shouldn't take Microsoft data, because it's not it's not the be-all and end-all of data. It is arrow compatible data like it's and maybe everything's arrow compatible data, but like if a new standard comes along tomorrow and we're like, oh that one's better than arrow, we want to do the.

M

Oh damn we took all the good names already like, but but the the opposite of that is, if.

E

I had a generic name and that we decided to change out the arrow implementation. Then I would've meant to change the name. All the fact.

J

If you didn't expose the fact that it was arrow data at all, then you could use the district yeah. Nobody.

M

Knows that it Carol, you had a constructor that the same you said with some sort of arrow or something any contractor. That's it no.

J

That means this type is now concretely linked. You have you've linked it together. It is in the public API. They are the same. You cannot change that detail now.

J

You could write a conversion on it, but then you're gonna get whatever massive performance penalty you get from being transparent to converting, so you're never gonna do that, but you would be okay if I had an extension tactics that take the arrow intro, so then you would be okay with having it 20 times, but because then this type when you're using this type, it has nothing to do with arrow. The import is not from arrow the exports not to arrow, that's all something else or derive types or whatever. That's all.

K

J

I mean cool right, so that's the if you want the data frame to be a generic data frame, get rid of all the air onus and then make an arrow data frame. That extends it. That's fine, okay, but it's it's don't link arrow into it and then call it something. Generic like Microsoft day today.

A

So my name is about what is the arrow time here.

E

B

The real the real tie-in is one that it's columnar right like this. This thing is definitely a columnar backed data frame, so.

A

What does it mean? It's call them order as nobody's name. Well, it columns.

B

J

B

Data stored, it means the data is stored in column order. So oh yeah.

I

B

If I have two columns column, a all the things in column, a are together sure.

L

B

Then call a B all the things in column, B, R, together I, don't have a be a be a be interspersed I.

A

Would say: okay, that's a specific characteristic of what we have here, but it's also relatively generic, but there's multiple column stores that exists in the because today the density just makes sense right there, but.

B

Then the other thing is, we do expose like things like the null bitmap right, which is saying, which is like a next step, closer to exactly what arrow specifies.

E

B

Has to be if we're, if we're exposing the, if we're exposing the backing data only as.

M

J

B

Need to know bitmap tin, like.

J

Because you have the string type somewhere, that's gonna, take the null like the the multi SD and the arrow format right, like that's an arrow ISM, so that should not belong.

M

In your API, that's an arrow string, it's not a string, that's not a data stream! That's an arrow string!.

J

Okay, so it's if you're taking thing if you have to, if you anything, is based on. Oh, you can see how you build this by looking at Apache arrows website. Then that's an arrow ISM and not a data frame of them if you're designing it and it looks like it and you're okay with the notion that you've copied their behavior, but it's the if some other things came along later that everybody else was doing and we what we thought was good to do to not just like bandwagon but like it had value.

J

If we're now gonna have oh, we want to go change. An internal implementation detail we're not using arrow string anymore, oh, but we export as arrow string. Okay. This went from a like non transformation based copy to a transformation copy, or this went from. We gave you a span over the data because it was free to work making a copy to give it to you now like these are very big changes and anything that would ever require that for switching off of arrow either this type.

C

Is an arrow data frame or they need to be moved out to a different drive type. So you aren't you I, ready in North before you and record recording.

Q

F

Q

I like before we stop streaming publish your notes. Why.

A

Would I publish them before in streaming? It seems like a minute if I could, after the meeting.

R

A

I mean that guy will publish them as a one-off thing in the API with this repo, because this is what we touch the.

E

A

E

A dataframe but.

A

I think it will know it's feeling, because we have done like that. I, like.

J

That I mean my guidance is I.

G