GitHub Universe 2017 - Open Source, 25 Oct 2017

Previous Meeting

⏯

youtube image

►

From YouTube: Mine The Music: Searching Efficiently With Open Source - GitHub Universe 2017

Description

Presented by Flora Dai, Pandora.

Today, data is produced on a massive scale but the ability to retrieve real-time data with accuracy and efficiency is essential to utilize the information we store. We will discuss how to implement an efficient search with open-source technologies and how Pandora applied it to music curation.

About GitHub Universe:
GitHub Universe is a two-day conference dedicated to the creativity and curiosity of the largest software community in the world. Sessions cover topics from team culture to open source software across industries and technologies.

For more information on GitHub Universe, check the website:
https://githubuniverse.com

A

Hi, my name is flora, dye and I'm here to talk to you about mine, the music searching efficiently with open source. So a little about me, that's my face. I'm, a software engineer at Pandora, I'm work on the content, engineering team, graduated from UC, Berkeley and I have one animal, that's very near and dear to my heart, that's Alfred or alpha short, he did not willingly put on that short costume, but in exchange he is kind of an cat. So you win some.

A

You lose some and I just want to take a moment to say that for the North Bay or the North Bay fires, a lot of local animal rescues and shelters are working in conjunction to help with the fire relief. So I highly encourage you guys to look into your options to help with their cause as well.

A

So in order to talk to you about how we built an efficient search system for recreation team at Pandora, I'm, going to give you a little background about Pandora's content systems, we'll go over how this produced a search problem and how it formed our design design goals. Then we'll cover the implementation and infrastructure and then wrap up with the results and its impacts. So let's get started what makes up Pandora's content systems for a high overview.

A

Historically pandora's built its catalog through our curators physically buying CDs and ripping them as music consumption, moved away from CDs and toured mp3s. Our curators got content by making purchases through online music, retailers such as iTunes and Bandcamp, and so this music is curated and analyzed, and it forms what we refer to as the library of a content that has power, pandora radio and contribute to the Music Genome Project. For many years, however, in the world of on-demand content and streaming, the need for music to be immediately available requires a different system for getting music.

A

So with that pandora inked deals with content partners to provide music through the form of DDX messages. So what DDX messages are? It stands for digital data exchange and what they are is an XML format, standardization of commercial data. So in regards from music, this is information like what you would normally see on a CD pamphlet or purchase page.

A

So that's the information like the title, the artists, the track listing the producer musicians on each track and so on and so forth, and this new content delivery system which I'll refer to as ingestion provided content on a much larger scale than how we've gotten content. There are carriers in the library before, on average, our ingestion system is handling the delivery and storage of thousands of tracks a day. But now we have two separate content systems that are storing our content.

A

We have the library which has all the curated and musical analysis with it, as well as the ingestion system, with d/dx formatted content with no musical analysis tied to it whatsoever, and so the tools that our curators are using are built on our legacy frameworks, they're built on older and outdated frameworks and they're hard to use and clunky. They allow visibility into the library, but they don't have the tools to actually view into our ingestion system. So we need this tools for help them in order to facilitate getting content into Pandora.

A

So the question becomes: how do we bridge this gap? Well, first, we attempted the simple, naive solution: why not do it the easy way so utilizing existing infrastructure? We built a simple UI using existing api's from our ingestion and library systems. This was the fast and easy way. So obviously what went wrong well, first off the ingestion API, the response times vary tremendously. If we were pulling a single object, it returns in a normal millisecond fashion, but for search queries we would.

A

It would take seconds two minutes to only timeout and this would be a visual result to our curators, showing that we couldn't get the results for them, because the request itself timed out, so this led to our curators not being able to find content and going out to go purchase it duplicating content within our system. Another reason is that the search syntax itself was sequel based, so ingestion utilized the post Postgres, so that included having api queries at our api request, send arguments that are in the form of sequel, syntax.

A

This isn't the most friendliest of user queries, and so now this added more to our curators workload as they had to become familiar with a clunky syntax. Thirdly, the ingestion api didn't have a good scoring system to determine which results mattered so for sorting on basic parameters like the release date or sorting by an artist named alphabetically. This prompted expensive database queries which again led to a response, delay or timeout.

A

So, for example, if our curator was searching for Drake and saw all these results, they may be actually looking for the famous Drake that we all know of hot leave. Hotline bling fame or they might be looking to expand our catalog on underground rap artists and they may be actually looking for I. Think it's pronounced Drake, oh the ruler. How did google him not quite sure who he is, but he is a rap artist that you can find apparently on song count and YouTube.

A

So, with this, our curators had a problem of trying to find the content that they're looking for and having to sift through a lot of noise and then finally, we relied on asynchronous calls to provide our carriers with a full view of content through the pipeline, and so in order to make these asynchronous calls to retrieve metadata. They rely on identifiers from our ingestion system and because we're not getting a response from our ingestion system or their timing out this result in either delaying or not even making the request whatsoever.

A

So, with all these issues, our carriers still how to use their legacy tools with no improvement on their workload, barely innovate, any visibility into our ingestion system. So clearly we needed a new solution and we wanted to implement a search system that was actually usable. So what do we need? Well, first off it needs to be fast and FYI. These are just all gonna be catch ifs. The response time needs to be returned quickly. It needs to be reliable. Just like this cat defending.

A

We want our response to actually return in its and be consistent and to lead into that. It also needs to be efficient. So content, as I said, is coming in at a high volume within our ingestion system, and we need to be able to handle searching over a large growing set of objects. It also needs to be up-to-date.

A

So, if we're storing all this information to have to allow our curators to search, we want all the information we are retrieving to be up to to be what they are at that time of searching and then finally it needs to be user friendly. It doesn't make sense to build a search system that cannot be used by our users, so creating a search system from scratch. It's hard and it's complicated and it's time-consuming.

A

So why reinvent it when there are other solutions out there, so let's kind of dive into what went into building our search system. So in deciding between open source and third-party, we consider the design goals as mentioned before, and what we needed to meet those requirements. We also considered that for music, it's incredibly variable and we wanted a flexibility in our solution to handle any changes in handling content delivery.

A

So we believed open source would match those design goals, as well as provide us, the flexibility that we were searching for and then looking into open source solutions. We went actually with Kafka and solar, so let's kind of go further in detail into Kafka and solar and why they met our goals. So, first off, there's Kafka and for those of you who may not know Kafka, is a distributed streaming platform that utilizes message choose to support, distributed data processing, so kind of a high level. Producers are writing messages along topics.

A

Consumers are consuming those messages and recording what they've read through by recording an offset and the consumers handle the data. However they've been designed to so why Kafka well, first off it's an established open source, originally developed at LinkedIn. It's been around for a while and it's being used for multiple use cases commonly log, aggregation and stream processing of data and leading into that.

A

Our ingestion system was built on Kafka and Kafka was recently added into our company infrastructure, so our justin was reproducing kafka messages on incoming content, and so as it was already being used as a message delivery system, it made sense for us to also utilize it to keep our search system search system up to date on the latest content for both ingestion and other downstream systems.

A

It's also that Kafka messages are being produced in near real-time. So again, our search system will be getting the data in near real-time and then also the storage time for Kafka's messages can be configured so that we can have messages beery delivered andrey consumes, so we can rely on Kafka to be reliable and give us fault tolerance. So Kafka's is delivering content data to us in a fast and reliable way.

A

So how do we handle that data once we have it and where do we store it, and how do we surface it back to our users? Well, a big portion of that is due to solar, so solar is also a reliable and scalable search platform that provides indexing and full-text search as I just mentioned, and a big big reason of why we also chose solar for those of you who might be wondering is first of all in your head. Is why not elastic search is that we had previous developer experience.

A

We had also a proof-of-concept version that utilized solar in a similar way with DDX messages, so that became the basis of our project and we allowed to allowed us to not have to worry about building up the infrastructure waiting for the infrastructure so that we could build out the system in a timely manner without causing more delay for our curators.

A

Also with the Solar is that we had the schema flexibility so for the music there's, actually multiple types of objects that matter to our curators, such as album and tracks. So with the schema, that's flexible. We're allowed to represent the multiple types of objects that we needed for our curators.

A

Also for a for solar there's, a we are allowed to customize on the indexing, so we can normalize. However, we needed for using the tokenizer Xand filters as well, and also solar communicates through REST API s, and this would allow easy integration into our web-based UI platform. That would surface the data to our curators and finally, solar has a set of features called solar cloud that allows you to configure what each node in a cluster does as well as maintain replicas, so that we can have.

A

Awkward silence continues. Oh wait. Can you all hear me again: yeah I hope you found the jokes that the front row found very friendly funny back row. It's all good! You didn't catch it, it's fine!

A

Where did I leave off solar card fault, tyrants and high-availability cool stuff, all right so kind of let's go over the data flow for our ingestion and our library systems are producing messages which are being consumed, written to solar documents and then our curators are allowed to search by going through a parser, and so, let's focus in on the main aspects of our search system, which is highlighted in the box, which is the cough consumer.

A

The solar documents and a parser again kind of zooming in we're gonna go follow some arrows, so copy messages are coming in they're routed in parsed they and the data. When we parse it, we either update an existing document or we create a new solar document and then, as users send in search queries. The user query is translated into a solar query and then the solar query returns results back to the user. So let's break it kind of down.

A

Even further and focus on the Kafka and routing implementation, so for our search system to get the data it needs, we listen to multiple Kafka topics. These topics have messaging messages coming in at various rates, whether it's 100 per minute or 100 per second, it really depends on which system is producing and what they're handling at the moment. So once any new messages on any topic comes in, we utilize Apache, cough camel and spring to route messages to corresponding functions, to parse out the data in the message load into key value pairs.

A

So, while we're routing messages to different functions to handle, it is again that we're trying to represent multiple objects.

A

Is that better, all right, ironically, this is a music company. We keep having audio issues, it's not my fault, but I am cursed like that, so yeah. So we're writing actually two different functions because again we're trying to represent multiple objects and, for example, in this case, we're looking at albums and we're looking at tracks.

A

So what's interesting about the message, is the data load of our Kafka messages themselves is actually the API response of the object in json compacted with message pack. So what works about this is that we don't have to make API request to our ingestion system, as I mentioned before our ingestion system had some issues where we would have problems, retrieving results, so this allowed us to not have to worry about timeouts.

A

In that respect, it also is similar in to handling of other message systems, so we didn't have to ask ingestion system to create new topics for us. What kind of created some problems with this is that even handling messages like this and being having the message load compacted? It messages can be huge and they can still come at an extremely high rate, so handling all this information.

A

Or all these messages, our consumers could get pummeled and we would have to go into a live state in which we have to manually intervene to help kind of monitor that information we use Kafka manager, which is an open source, Kafka tool originally developed at Yahoo, and so in the screenshot. You can kind of see that we can look at our lag. We can look at the partitions that allow us to parallel eyes, consuming messages from topics.

A

We also have a tool that allows us to internally grep internet sorry, internally, a tool that allows us to grep by the message ID to view into the message itself. So these allow us to monitor the leg and try to find out why and where it's happening. So, let's kinda dive deeper into the solar aspect, space joke.

A

So for solar setup, our schema is 30-some fields, and this is actually just a portion of the API response that was in the Kafka message. We worked with our curation team to pinpoint the fields that are relevant to them in search from here. We utilize identifiers to see if the document already exists.

A

If it does, we merge the document, values, updating fields that have changed and keeping fields that haven't the same and so again why we actually have to update solar documents is the fact that, with music, there are multiple reasons why we would get new deliveries or recent or or content providers are resending the messages. This could be a typo or there are some business write, logic change, and so they have to resend this message so that could affect values that we store, and so we have to update our documents.

A

If there are no identifiers- and there is no solar document that exists- we create one and then finally, for each field that we've updated and/or created, it goes through filtering and indexing. So we use a lot of the filters that come out of the box. The solar, such as the end end-edge and gram filter, which breaks words into engrams that allow us to match queries in a search as you type way and another important one that we use in conjunction with that is a stop word filter, so the software filter.

A

We actually worked with our curation team to analyze, which words are so common that they actually make more noise in a search result so paired with that together, we are a lot scoring higher results. The engrams that are indexed to match closest to the search query with, for which the values that have been stripped of unhelpful words we also design or Solar schema as a singular schema, and so what I mean by this is that we represented multiple objects within one schema.

A

So solar, like a database, has multiple tables can have multiple schemas, and so what we've done is actually we had to find a field called release, type in which the value represents.

A

Whatever the release type is so now we can look across multiple recently release types like albums or track, and specifically at one time or both in general. So, what's great about this is that we don't have to do cross schema searches just like in databases when you have to make join, causes it's expensive and it's slow. What are some not-so-great things about doing this? Is that we're overloading fields?

A

So, for example, within an object, we could have a title field which it actually could represent either the album title or the track title, and so this affects readability and debugging.

A

With our solar cloud, we have split our information into our index and information into four shards, with a replication factor of one over three servers which results in 12 shards. So this again is giving us the replication and the fault, tolerance, and so lastly, we're gonna go kind of looking into the parser, so the parser we haven't really matched into whether either Kafka or solar, and so for. The reason for this is that, let's look at an example, solar query, so this is from the Solar documentation and pretty much it's a query.

A

That's looking for objects that have a song feature that air is in stock, that is a hundreds, whatever currency or less and was created within on march 6, 1976 or up to a year after it's a lot of words and it's not the most readable query again we're designing something for our curators to use, and so in order for them to use it, we worked with them to create a key-value syntax that was more simplistic and helped focus on what their search needs is so for in this case, like we have a drop-down that allows them to select what kind of main topic they're looking for like in this case artist and then add filtering on top of that.

A

So what worked about this is that it's an in-house implementation, so we know that we're designing specifically for our curators as well. Ok gives them a level of of abstraction so that they don't have to learn about how to make a solar query.

A

What has caused some problems is that, because we build it ourselves, we do have to maintain it, and so it's very interesting to have our couriers actually kind of be our testers in the case, because it's very interesting at what they can do with the syntax and how they interpret how that syntax should react.

A

So how did we do for the overall results between our initial implantation, which was the UI and the existing api's as an vs. our search system, the results included, speed. We improved our response times from a range of 3 seconds or longer to roughly around 200 300 milliseconds. We became more reliable in that we did not have to make calls to me just an API, resulting in no more timeouts.

A

We were efficient in that Solar provide us a way to manage, searching over a large growing set of data and provide those results in an up-to-date manner because of the messages that are coming in from Kafka at near real-time and allowing ourselves and allowing the in the data to be indexed by solar and then finally, we created a user-friendly syntax that made more sense to our curators and allowed them to actually use this tool. Beyond that.

A

What the our search system helped do is create a new interface with access to both content systems, so the ingestion and the library this helped facilitate product launches into the on-demand streaming space and, most importantly, it has become the foundation for future internal tool. Development for our curators. So currently, the search system is not only utilized by our curators, but other internal content teams at Pandora. Some improvements, we've made on top of our solar kafka search system is that for Kafka to handle messages to handle errors outside of our system.

A

We utilize a passion, camel to implement a retry and back off scheme by exponentially increasing the time delay between reprocessing messages that are coming in during under reprocessing topics for solar, we decreased inconsistencies, rhe indexing. So what happens during Orion Dex or the reason why reindex happens is because either we are adding or changing fields in the schema or there is some mass update from an upstream system like ingestion, and this causes our search system to be inconsistent during a certain period of time.

A

So we haven't implemented a manual stop fit or manual fix to force, update documents as a temporary stopgap, but there could be more done. We've also managed to look into our backups in our log rotations. So for our search system. What happens is that our solar cluster actually has gone down because we've ran out of space and memory.

A

So then, that was due to because the sheer backup size and the number of logs that we were trying to write to so we minimized our logs to be again essential data in logging, as well as managing our backup sizes, so that we don't overload our servers and leading into that some future things that we could work on is adding additional servers and improving the hardware of the servers that we are using to run our solar cluster to prevent those downtime issues.

A

We could also improve our monitoring to manage and see and monitor our memory usage to make sure we avoid them before it happens. For Kafka we could expand our usage into legacy. Let's we expand our usage of Kafka into our legacy systems and so that we can move away from our legacy systems by utilizing Kafka to send the messages, and we could also improve the monitoring.

A

Like I mentioned, we were using Kafka manager and we are using an internal tool, but we could do more into trying to monitor when we're having high bursts of messages coming in and coordinate with external internal external teams to manage when these streaming high streams of messages come in, and so there are alternatives. You specifically against picked Kafka and solar, but for your cases on creating a search system, maybe elastic search makes more sense. Maybe rabbitmq makes more sense.

A

There's a lot of options out there due to open source and that's the cool part of it. So in summary, efficient search can be implemented with open source, it's scalable and flexible to your design, needs and goals. And lastly, I want to thank all the people at pandora. My colleagues who have contributed to and continue to utilize and improve our search system, especially to special thanks to Sam Douglas Omar, Cruz, Josh, proline, Sharda, Paige, Damon, Wood and Tom LeClair for the continued work on our search system yeah. That was super fast guys, Thanks.

A