Apache Cassandra NYC* 2013, 8 Apr 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: NYC* 2013 - "Graph-based Recommendation Systems at eBay" by Thomas Pinckney

Description

Speaker: Thomas Pinckney, Senior Director of Engineering at eBay
SlideShare: http://www.slideshare.net/planetcassandra/e-bay-nyc
Recommendation and personalization systems are an important part of many modern websites. Graphs provide a natural way to represent the behavioral data that is the core input to many recommendation algorithms. Thomas Pinckney and his colleagues at Hunch (recently acquired by eBay) built a large scale recommendation system, and then ported the technology to eBay. Thomas will be discussing how his team uses Cassandra to provide the high I/O storage of their fifty billion edge graphs and how they generate new recommendations in real time as users click around the site.

A

A

My name is Tom Pinkney I'm from ebay and I'm, going to be talking about a recommendation system that that we're building so for about the first two thirds of the talk I'll, be talking more about the the theory behind the recommendation system, sort of how conceptually all the pieces work and for the second third, the final third I'll be talking about how we use Cassandra to actually implement this and handle the scale that ebay needs for this kind of recommendation system to work.

A

So when we think about a recommendation system, they're a bunch of different parts, there's understanding what the users intent is that hey they're looking for shoes, there's understanding their kind of unique context like they're, a size, 12 and so any shoe that we show them that isn't a size. 12 is kind of an irrelevant recommendation and then their matters of taste and aesthetics. So what color? What style? What are all the intangibles in those shoes that are going to make the person really want them and and and and like them?

A

So all three of those are big, difficult questions and for this talk, I'm just going to really talk about the taste and aesthetic part. How do we model and try to understand and predict what someone is going to like.

A

So we think of this as a the model of what someone's going to like is a taste profile. So our first count is: how do we build a taste profile? How do we understand what you're going to like what you're not going to like?

A

Well, the kind of ideal goal is that we have a list of everything that you like, and a list of you know for me like everything I don't like so my taste profile might be that I, like hiking to this cabin in New, Hampshire and I, like Python I, like reading the New York Times I, don't like plaid shirts I, don't like sushi. That was a terrible romance novel I. Don't think it was one of her best so now the problem is it. This is kind of like having a full scale map of the world.

A

It's perfectly accurate and not very practical. You know there's no way that you're really going to get this list from someone of everything they like everything they don't like, even if you could get them to sit down and tell you these things so study after study has shown that people aren't very in their own opinions about what they like or don't like.

A

So just for example, there was a study where people were asked shown a list of things and that's what they like them or don't like them, and then, two weeks later, they were shown the same list and asked again whether they liked them or don't like them, and their answers were only about the same eighty percent of the time. So there's a there's a serve an upper bound as to how much you can ask people, even if you could have infinite patience from them.

A

So our challenge is: how do we build a taste profile for someone and our sort of core thesis is that all of the things in your taste profile are not unrelated, that your likes and your dislikes are all highly correlated. So one of the things that we did previously at the startup we were at a startup called hunch that was bought by ebay about a year ago. We surveyed lots of people, so we survey tens and tens of thousands of people and would have some all sorts of questions.

A

And one of the questions we would ask is: do you like President Obama? Do you support him and another question we might ask Abby: do you prefer arugula or iceberg lettuce? And what do you know, but the old stereotype of arugula, loving liberals is true. So if you sort of look like the guy at the bottom, like where I grew up in South Carolina, you probably prefer iceberg lettuce statistically and if you voted for Obama, you have a bias toward actually liking arugula.

A

So this idea that things are correlated starts giving us a toehold or a weigh-in, just try to build a taste profile for someone. So let's start thinking about this so trying to figure how to formalize this a little bit. So this drawing. What we have is red circles represent users, so we have a user.

A

You know a and another person B, and then we have some things, so we have like the Republican Party and we have arugula and we have all sorts of different things there and when green arrows represent a user liking, something and express preference, we've asked them. They've told us they've liked this thing. A red arrow represents a dislike. They've said they actually dislike something. So we have here user, a I said they sort of like the Democratic Party they're like arugula.

A

We have some other user, they said they like the GOP party and they dislike arugula. So if we have some new user that we know a lot about this user see all we know is, if, like the Democratic, Party well sort of we squint and look at this and go AB, probably from what I note here in this drawing they probably would like.

A

Arugula is the only other person in this drawing who has liked the Democratic Party also liked arugula, so is getting to the idea that we're trying to find either somehow similar people or find similar patterns and connections to other people. So, let's try to sort of zoom in a little bit more detail about how we try to think about that. So imagine we plotted in sort of two dimensions, one in the x-axis.

A

Maybe you can rate in stars from zero to five, how much you like Obama and the the other, the vertical axis we have asked. You may be 0 to 5 stars how much you like arugula! So now we can start plotting people in this two-dimensional obama vs.

A

arugula space and if we actually survey people, we see- maybe not as exactly this data, but something similar to this, where maybe that user, a from the prior drawing, is in the upper right corner where they like Obama a lot and they like arugula a lot user be from the prior drawing who, like the GOP party, is in the lower left quadrant and we have some other people in there. And now we have this new user user see again who are trying to figure out their arugula preference.

A

Given that they've said they are on the x-axis at a known point, they like Obama, a certain amount. So what the question is? Where on the y-axis do we think they fall, and we can answer that question by doing a something like a linear regression through the data points, we see that there's really this better axis for understanding this data set.

A

It's really more powerful to understand this as a matter of sort of this concept of maybe political orientation versus trying to think about it in terms of lettuce and specific candidates that we see that this user SI falls on this regression line. At a certain point, we now know we can now infer a y-value for them how much they like arugula. We can also do some other interesting things. We're not just plotting people on this line.

A

We can plot things like the lettuce like different candidates, and so we can start intuitively thinking that similarity is somehow based on GMA distance along this line so that that user, a is close to arugula, which captures this idea that they, like each other or they're, similar user B, is very far apart from user. A on this line so capturing the idea that users, a and B, are very dissimilar from each other and instead of just doing this in one dimension around political affiliation.

A

We can do this around many many dimensions, so we can find what are called many latent factors that help explain people's behavior. So these are factors that typically are predictive but are very hard to directly measure. So it's very difficult to ask someone. You know: where are you on the introverted to extroverted pneus scale, but it's very predictive generally in terms of predicting what they will like or dislike to understand where they fit on that scale.

A

So what we're doing here is we're finding things that people have clear preferences on a specific product, a specific concept taking their known preferences from those very specific cases and using that to map them on to these more predictive scales. One of the interesting things here is that we didn't pick these court.

A

These dimensions ourselves, the the machine, picked them so the algorithms pick the dimensions that were most predictive for the either millions of hunch users or the hundreds of millions of ebay users and, interestingly, the dimensions that get picked turn out to match roughly very similar dimensions that sociologists used to describe people. So if you look at like the big four personality model, most of the those first four factors about people, one of which is introverted versus extroverted pneus- is something that also very much shows up in our data.

A

So in some ways it's kind of an empirical validation that these are at least the first few of these dimensions are some of the basic ways that to think about people and understand people's differences. Now in our particular models for doing product recommendations, we use about 50 of these different factors. So once you get past the first, maybe fifth of five or six of them, it becomes very hard to figure out what they are. You know the algorithms aren't labeling the axis that this is the masculine to feminine access.

A

What it is is saying, is: there's an access that stretches from lipstick and mascara to you know again. These are all sort of in some ways. Kind of these stereotypes of the data in some ways confirms from lipstick and mascara all the way to sort of cordless drills and a lot of other sort of stereotypically masculine products right.

A

So we sort of look at that and say: oh, this seems to be roughly masculine to feminine once you look at the sort of degree once you look at these axes of the algorithms produce above, like the first fifth or sixth of them, becomes pretty hard to have it at a sort of human intuitive explanation of what they are.

A

But you know, the key idea is that we can plot people and things into this high dimensional taste space, where proximity to each other equates to similarity and distance from each other equates to being dissimilar from each other.

A

So now going back to this original idea of how do we build a taste profile? We have a little bit of a simpler problem instead of enumerated you like and dislike. What we now need to do is build a plot. Everyone in everything into this high dimensional taste space, because then all it's a matter of doing is doing distance calculations to each other, to figure out what you like and what you don't like things that are far away from. You are things that are going or go into the don't like bucket.

A

So then the question is how to actually calculate these case coordinates now.

A

This is a little bit sort of a quick, so I apologize if I'm skipping over some some details here, but going back to that original example, earlier of user, a B and C and the lettuces, what we can do is start adding some numbers here to make a little bit more to build a little bit more of a formal model here so for every user and for everything I'm showing coordinates. So that's where they are in taste, space, so user a is a coordinate, location, 1-2 user B is at minus 1 to the arugula.

A

Is that coordinate one- point-five? How did we, how do we come up with those coordinates? Well won't go into the details, but we have this requirement that we, we think of every green edge in this graph, every like as having just somewhat arbitrarily the value to and every red edge representing a dislike, is having the value negative two, and then we say that the user and the item connected by that edge have to have a dot product of their coordinates that equals that edge value.

A

So this means that, in the case of the a and the and the lettuce dot product means take the x value of each multiply them together, add them to the product of the two Y values. So in this case X the two x's are 1 the 2 y's or minus two and minus point 5, you do the math, and that does indeed equal that edge value of two.

A

You can sort of think of it as the edges or adding constraints on where the coordinates can be in space and the dot product is one way of expressing that constraint.

A

So using this, we can figure out what the coordinates are for the Democratic noted in this. Drawing that, whatever his coordinates are they have to satisfy the constraint of the dot product with user. A has to be two and the dot product with user c has to also be too. You see the two simultaneous equations that result from this there's a solution to those equations of x equals to y equals 0.

A

So that is the coordinate in this system for the Democratic donkey here and the sort of key point there is that these coordinates are being picked in a way where they're positioned relative to each other ends up giving you meaningful insight about similarity and dissimilarity.

A

So now that's sort of how you might just given a static configuration of known set of preferences come up with the coordinates for everything, but on a site like ebay. You don't just have a bunch of data about people ahead of time where you just take all this transaction history. You know you look at purchases, call every purchase alike and compute everything that you've got people that you know they're buying stuff constantly new users are showing up constantly, so new things for sale are coming up constantly, so we can't just statically compute everything.

A

We have to do this incrementally in real time and have this work. So imagine you starting. You start in a situation where you're on the left side, where you have some users, see who likes this SLR camera mueser a and likes this camera lens user. B doesn't like that camera lens, but they do like this point and shoot camera.

A

How do we change everyone's coordinates and taste space when we get some new piece of information in that, for example, user a has just bought this SLR camera, so there's a new green like edge going between a and the camera. Well, we now resolve these simultaneous equations.

A

We know that the camera has assessed by the requirement that it dotted with a equals two and it dotted with C is also too and that I know if you can see in the slides here, but the coordinates with a camera and a have changed from the left hand to the right hand, side based on this new piece of information. So this is the process of folding in a new incremental piece of information to change the coordinates of where things are in in taste.

A

Space to, in this case, make a and the camera and the lens sort of all move closer to each other a little bit, because in some sense they are all similar they're all liked by the same kind of person.

A

So now, I'm gonna talk a little bit more about how we actually implement this so sort of less about the theory more about how we actually are building this there's two halves to the system. One half is this updating asynchronous process that happens decoupled from individual page loads on the site. So if someone comes in and purchases something an event is triggered goes over an event. Bus is eventually received by our updating processes and they then add this new edge into this graph structure.

A

So this entire graph structure is what we call our taste graph, that's what's stored in Cassandra, so we get this new event. We store it into the Cassandra graph and then we update all of the things are sort of affected by it.

A

So in the prior example, if we just got this new purchase between a and the camera, we add that edge between the two and now we go update a and the cameras coordinates, and we do that by having to go look up all of the other edges connected to a coordinates, we're reading all this sort of adjacent nodes and edges out of the graph recomputing all of their coordinates and then storing that back into the graph.

A

We also store all of those updated, coordinates into a separate database that is used purely for serving recommendations at runtime to customers. So that's the second half of the problem is when a page load comes in sets, say someone is looking at this compact camera. That makes a call over to our recommendation engine to say: hey I need safe, I've recommended alternative cameras of this user might want to look at. We go query in this recommendation database. We know the users coordinates. So we know the coordinates of the person is looking at the page.

A

We then go find all of the other cameras that have coordinates very close to this user and so that, because that represents similarity and predicted affinity between the user and those cameras, and then we show those- maybe that maybe would take the top five cameras that are closest to the user and those become that users recommendations. So this side of the problem is optimized, basically for just very fast read-only performance, because you know ballpark, we will probably when this is.

A

This is not fully ramped out right now, where this is a serving maybe ten percent of traffic on one page right now, and it's already doing about the busiest time about 15,000 recommendations a second so as we roll ramp and that's just in North America, so once we ramp up globally to 100, send traffic across all pages, we're gonna be doing many, many hundreds of thousands of recommendations, a second at our busiest peak.

A

So that's why this whole path is just optimized for read-only scalability as making things as fast as possible, so the zooming into that Cassandra piece, the where we store our taste graph of every user and everything and all the connections between them. We have in this first version that we've built about 40 billion different edges in our graph, connecting about 2 billion eBay listings things that are for sale or have been for sale connected to about 200 million different users. That's about five terabytes of data.

A

We replicate everything twice so about ten terabytes of data, and this is a very small starting piece since we're only looking at a few key signals right now. So things like purchases and bidding on an auction and things like that, there are a whole bunch of other signals that we want to bring into the graph to express new types of connections between peoples and thing things that will probably result in a roughly quadrupling of this. So we estimated eventually will will probably be at around nearly 200 billion edges in our graph stored on Cassandra.

A

We're in the process of upgrading from a 16 machine to a 32 machine cluster they these are beefier machines, so we're looking forward to that that beefier node support in the future. The machines are connected to an SSD disk array, / ice guzzi and 10 Gigabit Ethernet. This is not necessarily the ideal sort of set of hardware skews for this, but it's for a variety of reasons, a kind of preferred skew within ebay. So it's it's! What we use here. It's got a you know huge amount of I/o capacity.

A

The disk array is do about 400 to 500 thousand iOS a second. So it's a little bit there's a lot of RAM, obviously per machine. So it's a little bit different than maybe the the ideal skew we're using see. Cassandra 108 we're on size, tiered compaction. For now the moving, hopefully shortly to leveled compaction, to improve some of the the read latency. We want to get more people more apps, some other apps. Besides this recommendation app running on this data store and some of them have even stricter, read latency requirements.

A

So we think that level compaction will be great for us and we're also really looking forward to bloom filters and things like that being off heap, so we're running with an eight gigabyte heap right now, but otherwise we start seeing some pauses that again affect latency. So we're really looking forward to getting stuff off the heap.

A

The schema for how we're storing this data is that the edges are basically a wide rows where we just for given user. If we represent, you know ten edges between them and ten things that they have bought or bid on on the site. We just keep adding new columns to represent each edge and they're different types of edges like a purchase versus a bid. We treat them differently in terms of how we wait them in some of our calculations, so we're recording the the edge type.

A

The item ID that the user is connected to buy this edge. We also store all this data a second time from the items point of view, so just this makes it fast to index and look up and retrieve the data.

A

So since sometimes we want to say what items is this user connected to and other times, we want to say what I, what users is this item connected to, so we index all the edges from the items point of view as well, in the same way that every column is a list of the users connected to that item and then finally, we store the the coordinates of every user in every item and, in our case the we're using a fifty dimensional space.

A

So every user and every item has fifty floating point numbers to represent their exact coordinates in that space.

A

On this first version, we're this is showing we'd performance of the number of rows or keys being read per second, since we have very wide rows, four edges, a single edge can be read by a single key where's.

A

Our our nodes are not wide, there's been a one taste vector or one coordinate per node, so our read performance, if you look at number of keys, read per second is dominated by nodes, because a single read gives us all the edges for a user or an item, but we're today at our busiest time doing about fifty thousand reads a second off of off of the cluster, and this is actually. This was actually benchmarked against the 16 machine cluster.

A

So we we think that, as we double or quadruple the amount of data we're going to continue to be able to get very good, read scalability from our 32 machine cluster and then writes per second are never around somewhere between three and four thousand writes per second and we're doing a write every time someone does some action on the site that causes us to update their coordinates, their position in taste space. So today, purchases watches bids things like that.

A

So that's how we're thinking about building our and working on our next generation recommendation system, how we're approaching the the taste and aesthetic modeling part of the system at eBay I'm happy to talk about some questions and also happy to have a Vulcan girl who's. The engineer here who actually built all of the Cassandra side. We can get into tons more details than I, probably can but Oh.

B

Use some form of item to item matching so YouTube and Amazon, for instance, so it sounds like you're doing more user base. So I was curious. What your reasoning was there yeah.

A

That's a that's a great question, so we think it's more powerful to understand the user versus understand the item to item affinities. So it may be that you and I are looking at a book, but when you think about the other books that you might want to recommend to us, that's partly informed by the book we're looking at, but it's also partly for about who we are. And so, if you just do item to item similarity, you lose that second dimension of the problem.

A

So there might be many different types of people that are all interested in a in a particular book, but for slightly different reasons, and we want to be able to capture that so that the other books that we recommend take advantage of that additional information. We have there's another difference between sort of the ebay recommendation problem and nearly serve every other retail recommendation problem out there, which is that ebays listings are sort of all unique.

A

So people put up a picture write a title write, a description if it sells within seven days great, if not people typically take it down, so that sort of entity pops in for existence for seven days and then disappears, and so you have a lot of difficulty if you want to build up an item to item model, you're going to get you're, not gonna, get much of a signal you're going to see that oh well, 50 people looked at this item and you know if 10 of them bid on it and those 10 people didn't then do really much at all.

A

In common with each other again so you're not going to see like on Amazon, you might say that well, 10,000 people bought this book and then disproportionately, a large number of those 10,000 went off to buy this other book because both books have been on for sale on amazon for two years. So that's how you can accumulate 10,000 data points about each one. So you can do that item two item, similarity directly. That's never going to happen on eBay.

C

What are you doing with the extra room we.

A

Have some side projects? No, uh so buffer cache still gets a pic. Extra ram still gets used for buffer cache that the working set the the data file set is bigger than physical RAM, and so that buffer cache is is very valuable. There also is: there is some ability to store some data off heap already in the existing versions of Cassandra too, so we use some of that. But the simple answer is buffer cache.

D

You what I come into system right so I have a very, very simple question. As I know,.

A

Those the dangerous ones yeah.

D

Exactly the danger is, you can recommend, all you want, I'll till analogy is, can only take a horse to the water can make a drink, and what do you do when people don't buy your stuff, and do you read use it? Does it automatically get a new algorithm or are you developing a new one? Well.

A

You know: that's it, it's a good point that there are several parts, be you know that you can't just make a recommendation. You need to be able to justify it and explain it to people, so you might have the best idea in the world for what someone should buy, but if you can't explain it to the then they're not going to necessarily understand your brilliance, so partly one of the things we think about is how do you try to justify this, and so there are very simple things that people do like.

A

Just the old people who bought this bought that you're trying to provide an explanation that hey many other people have done this. Maybe you should look into it too. There are other things that you can try to do like in case of books. You can say hey more by this author right. These are very basic things, but they do help. Explain why you're making the recommendation? The other part of it, though, is if we do make a recommendation we track and log.

A

You know that we showed you these five things and you didn't click into any of them or you clicked in which is a little bit harder to interpret like if you didn't click into any of them. Does that mean you just didn't look at them and so didn't notice our recommendations, but there may be lower on the page, or does it mean they were all really bad? It's hard to figure out if you click on one of them, but not the other four, then that actually tells us something pretty informative.

A

That means that at least of those four or five recommendations we showed you, the one you clicked on was somehow the best one. If you went on to buy it even better and that somehow the other ones were less good recommendations, and so we can infer some negative feedback against those from you. So over time our model will start getting a little bit smarter about you not liking those kinds of items.

A

So, even though you didn't actually sort of XM out or say, this is no good for me by behavior, you showed that you preferred other ones better.

A

D

A

It's one of these, so on the hunch side, before we were part of ebay, we used like who you followed on Twitter and what you had liked on facebook, very, very informative, so less informative. Exactly about who your friends were on Facebook, more informative about what you would like to what your interests were very informative on Twitter about who you followed in the shopping context right you've got it a bunch of different challenges, so a lot of times people come to the site and they're not logged.

A

In the first thing, you do when shopping usually isn't login. So, unfortunately, the way a lot of retail works. Is you login only at the end of the process, so even knowing who you are the start and knowing much less who your friends are and things like that is not immediately available all the time. The second thing is just that you know a lot of people say: look. Why are you asking me who my friends are I want to buy this?

A

You know you know pair of shoes, so there's sort of like five percent of people that are sort of excited about that, and they want to say hey what do you think about this shoes I'm think I'm buying it look at the great deal I got and they do want to engage with their friends about it, the other. Ninety five just wants a good price and they want to find it and get on with their life.

A

You're telling me yeah, so the question was that it dealing with finding layton factors and a large, sparse matrix is very hard, and so the matrix here is every user versus everything and for every cell we have a did you buy it or bid on it, and the vast vast majority of these cells are empty because most people have not bought or bid on most things. So you know maybe one in a hundred thousand one in a million one in 10,000 of these cells are filled in and they're all the others are all unknown.

A

Question marks and our our served job is to fill in the rest of those cells with a prediction about whether you would like this or not so it's very difficult to find computationally very difficult to find layton factors in a very large, sparse matrix like this. So the the high-level approach is a method called alternating least squares, where in some ways you can think of it as start with a set of random factorizations or a bunch of random coordinates for people and then incrementally improve them.

A

So initially, if we pick random, coordinates for you and that camera that we know you like that camera the dot products, probably I, can be close to our target of two. So, let's tweak your coordinates and tweak its coordinates until it gets closer to two.

A

It's still very computationally intensive I mean the the the matrix factorizations when we, when we just do a straightforward, matrix, factorization, take weeks to run on a 50 core machine with hundreds of gigs of ram and that's with us hand tuning the assembly to make sure the inner loops of the matrix multiplies, are as cash efficient and as fast as possible.

A

Yeah we have heuristics about entering. In some sense, you want to update your neighborhood in the graph until the changes aren't very significant.

E

um Is there any reason that you guys chose to I mean it sounds like you guys rolled your own in some sense, but instead of using say, like Titan or Agamemnon, or something that already sits on top of Cassandra, or is that sort of a legacy hunch thing that you guys brought in an eBay.

A

A little bit of both, so we did try a bunch of different systems and generally because of the scale. One of the things is that, even if it doesn't necessarily at first seem to make sense to spend effort to optimize something, every five percent turns out to. You know actually matter a lot here, because the size is so big, so building something ourselves as long as it wasn't too complicated and too hard made sense, and in this case it turned out. You know it's actually not that hard to represent a graphing Cassandra, it's pretty straightforward.

A

It just kind of works that.

A

Made sense for us to do that, if you we also already had all the algorithms. We were running on top of that graph, so one of the reasons sometimes to use off-the-shelf graph database or graph packages not necessary to Tora JH representation, but for the algorithms that run on top of it, and we already had the algorithms that we're running on top of it also so that just took away another reason to to do that. But I'm not swearing that we made the right. The decision.

A

It was, we definitely strongly prefer using open source options whenever the possibility is because it's obviously less for us to do and less to go wrong, and but this was simple enough, that it just seemed it seemed and not could be optimized enough. That made sense for us.

F

Over here ever, how long has this been I guess live our? How live? Is it and had you look at other solutions and what what decision you know, what decision making process did you use and what do you but L said you look at that kind of sure.

A

So this is this sort of next generation platform is basically in very limited testing in North America right now so at peak. Maybe ten percent of people going to ebay on certain pages, we'll see recommendations powered by this system. So it's it's. Basically, the very, very early stages of being rolled out, I I, know that so I've now been at ebay for about a year.

A

So I was at this company hunch before I know, eBay has spent a a number of years looking at both outside vendors for this problem, as well as building a internal solution and I think the reason that all of you know, frankly, all of the external vendor solutions had failed previously was sort of getting around these problems of how real time this has to be that people and things for sale appear constantly. They disappear constantly and then separately. The idea that there's no catalog there's no product catalog. That says these are all the same item.

A

It's just a hundred different people might be selling adidas sneakers, maybe they're the same.

A

Maybe they're not maybe they're suddenly different, due to condition that, because they're used and things like that, so a lot of the outside sort of historic solutions that people had tried and looked at really depended on this sort of item to item collaborative filtering where you had a strong sales history for this sku and you could see that what other skews people went on to buy and that pattern took months years quarters to develop and be and be discoverable, and that's just not a situation that exists at ebay. Nothing sticks around for months.

A

It's much more dynamic.

G

A

Yep and those are all really hard problems that we don't have I won't say that we've solved the one piece you ask a couple different things about, for example, maybe I click into one item, but it's not really right for me. There's some simple things we can do like how quickly do you bounce back with a back button right? You know if you click into something, but immediately bounce off, that's actually probably more of a dislike than a like.

A

If you then go on to buy something else that we showed to you, we have an attribution problem, so one of things we're trying to understand is: can we track everything that you were exposed to through these recommendations and whether you clicked on it or not, but just that it was an impression and then look weeks months later, whether you bought it?

A

That gets hard, especially in multi-device world, because I saw it in my computer, but then I searched for to my iphone when I was waiting for the Train and I did both hold on not logged in. So you don't know that it was Tom in both cases. In the cases where it gets easier is where I do it on the same device or I'm logged in in both cases, and you can look sort of stitch sessions together over time and say that while actually two weeks ago, tom was exposed to this.

A

And then he went through sort of this research phase. And then he finally went to the purchase phase and he actually bought one of the things that he was first exposed through. The recommendations versus being first exposed through search.

A

So we're using honestly only a tiny fraction of the data we could be using and that's why I call this kind of the v1 version where we're looking at a couple, very basic factors or behaviors of people like what they've purchased, what they've bid on what they've watched there. We have a list of literally some alt, nearly 100 additional factors that we would like to use. It's just that we're taking a fairly you know deliberate process of every time. We consider a new factor.

A

We add to the model we see if the models predictive power goes up or down through analyzing things like prior purchase histories, doing things like crowdsourcing human judgments, where you get a few thousand people to look at the new algorithm versus the old algorithm. So it's a it's a fairly laborious process to add additional factors to the model.

A

Sure so we this this work sort of viewers a little bit also into all those other parts of making a good recommendation that are not about taste. So you know going back to say: let's thing about a book recommendation partner: the issue is well: does the user even want books right now generally on ebay? We sort of solve that by being fairly simplistic later, if you're searching for books looking at books, we take that as your intent, and we show you more of more things like that, so we can solve the intent.

A

Part then comes the context. Well, when books like I, don't want to recommend a book you've already read. Ideally, so it gets into doing things like a at a minimum. We should look at purchase. History, try to figure out what books you've already bought, not recommend identical books.

A

This is a little bit complicated by it's hard to actually figure out whether two books are exactly identical, may have different isp ends, but when's a hardback one's a soft back one's an audiobook. We don't want to still recommend that book. So there's clustering and text feature analysis that we try to do to understand that still topic wise. These books are all too similar to what you've already bought.

A

Let's try to find other stuff, and then it becomes this taste question of well of those other books that are maybe in the same genre that you've been recently looking into say. You know historic nonfiction. What are the books that taste wise match you best, so we sort of think of it as generating a recall set first, which are here, are a thousand books that represent all of the kind of objective criteria that we think you're going to want. It's a book and historical nonfiction.

A

It's not a book that we think you've already bought now, let's use taste as a ranking of those thousand books to figure out which thousand we think is going to most closely match. What you're going to like.

A

Maybe a thousand books, and that's one thing: that is over time. We want to read more and more and more and there's a lot of work we're trying to figure out. If how do you I mean this, can leverage a lot of sort of geospatial indexing, because a lot of this querying becomes, you know, find all of the items in us in a sphere that are closest to me right, so they're, better things to do in terms of how to query efficiently.

A

That recall set that we're still not doing that. We want to do.

A

So we've looked at a couple of different databases, so we have right now a large cluster running a this. This is one of several different large Cassandra clusters at ebay, and then we have some my sequel clusters: we're getting rid of the clusters.

A

Basically every you know every week, Vulcan here buys more machines for Cassandra and trades in machines for we're also buying more my sequel machines, in fairness to, though we but to get to your original question about why we chose this. Is that theoretically, we think that this is the the right architectural solution for a very high right, scalability problem that inherently has a very data, parallel architecture. You know our recommendations, even though there's this graph that does connect everyone to everything it can still be kind of divided into nice.

A

Parallel updates that you, when you do write updates, you update your immediate neighborhood when I do, updates to myself I update my immediate neighborhood. Those can go on concurrently in different parts of Cassandra, so architectural II that high degree of right concurrency matched our problem, and we thought that Cassandra was this architectural. The right solution to it.

A

Honestly, we use all the above I mean whatever you're about to name we use. So we have approximately 8,000 machines running Hadoop. We generate 10 terabytes minimum per day of logging data from the site.

A

We have giant HBase installs, also for building search indexes. We so yes that.

C

You've made or that you're looking to make to Cassandra core specific to solving graph problems, enhanced indexes anything along those lines. We.

A

The I'm ashamed to say we have not, we would like to. We haven't, had the opportunity to we actually were looking for an intern over the summer that might want to work on a few of these problems.

A

So if anyone knows a great intern that would want to work on hack on Cassandra should make it sort of a depth to work on a couple. These kind of graph related problems. We definitely are interested but unfortunate. We have not done that to date,.

A

We are using in other sections of ebay, so we're using it here to store this graph of people and things there is and I'm ashamed to admit. I actually don't know in detail. Some of the other applications. I know one of the other ones is, is around some social signals. So when people do facebook connect or tweet things, logging, those connections, those I think applications are a little bit more kind of logging oriented where the data is written and then queried less and overwritten less I.

A

Think maybe one of the interesting things about our application is that it's a very update, heavy and read right of that data. Heavy workload that.

A

Is my sequel, so we we started, we try to my sequel. Cluster had very bad results in terms of just bugs and not being reliable. We now just do our own, starting on top of my sequel, regular, my sequel, I know'd ebay, so about 24 machines that just serve the those recommendation coordinates out.

A

So today, this cluster ones in one data center. So it's it's not something. We've had to I.

A

Don't know I don't know volume, it.

H

So for a graph-based and you're using Cassandra, what is the most difficult challenge that you face? That Cassandra doesn't support and you have to build your own algorithm or.

A

Walk and maybe that's a.

H

This is a you're trying to do a graph based data structure on using cassandra. Is there any of the feature that you looked at from the graph based perspective? That Cassandra did not provide you, for which you have to write your custom.

A

Kind of the kind of dessert yeah.

G

It did work more or less out of the box if you're asking about like traversing the graph and other those type of like features that other graph databases that we never had to do. Those ourselves- and it's mostly just worked- we're reading, edges, reading, notes, pretty yeah well.

A

I mean I think one thing that we've thought about is if we could push computation into the Cassandra cluster itself. So, instead of right now we do our patterns. Basically read a ton of data from Cassandra into app servers. We do a bunch of we generate a series of linear equations, we solve them and we write the results back well.

A

Instead of bringing the huge amount of data to the app server, why don't we just push the computation into the database machines and so like when we I don't know if we can do that with triggers someday that sounded kind of intriguing or maybe it's a bad idea and you'll tell us it's a terrible idea that we shouldn't do but either way that that's the one thing we sort of thought about is because we do have a risk of saturating our networks.

A

That's where we're going on like 10, gig Ethernet and things like that with the amount of data we're pulling out of Cassandra to compute on and then push back in, it would save a lot of bandwidth for us if we could just push computation into Cassandra I.

A

Think we're almost out of time here. Maybe we are other time. Okay, another of them.

I

Along the lines of portion computation into the Cassandra cluster, have you considered embed in Cassandra itself into your application server? We.

A

Talked about that a while back that we could actually, just you know, base run our own app on the nodes or embed into our own app and then for lack of time. We just didn't pursue that further, and there is a simplicity angle.

A

Anything else great well, thank you very much.