Experiments with Annoy, Elasticsearch, R, Nearest Neighbour

1100 Words | Approximately 5 Minutes Read | Last Modified on February 18, 2019

Finding “similar” music tracks

I’ve been working with an economist who’s studying the shape of the music industry in Eastern Europe. He’s the brains - I’m wrangling the data. One thing he wanted was some high level audio features which describe the music. We use essentia to calculate low-level audio features, but we also have the Spotify ids for our 2 million tracks, so it seemed pretty straightforward to hit the Spotify features API and retrieve some higher level features.

These are numbers for things like “danceability” and “speechyness”. I’ve recently been reading about Nearest Neighbour for classification and thought I’d experiment with the Spotify Annoy library for Approximate Nearest Neighbours. I chose to write the code in R and found the R bindings for annoy. Here’s a slidedeck giving you an idea how annoy works.

Once the index is built, we should be able to query for tracks which have the least “distance” to an original track - which is to say the features are similar. This isn’t really solving a useful problem, but it was fun to experiment.

Grab all the data

I polled the Spotify API and pulled down all the features as JSON documents into elasticsearch - I want to build an annoy index for the audio features.

Initially I created an annoy index with space for 2 million dimensions per item (I meant to allocate for 2 million items). I quickly ran out of memory. The Rcppannoy docs aren’t great and I had to refer to the code for things I couldn’t work out from the Spotify docs.

I also wasn’t selective enough in the features I was using. All your dimensions need to be on the same scale (in this case 0 -> 1), if you have an integer for bpm up in the 000’s, that’s going to warp your distances.

I was trying to preserve the existing Ids. My features relate to audio tracks in the state51 system, and those all have an unsigned mysql bigint for an id (our max instance id is currently 5263214235) . Annoy seems to have an int32 for an index, so I was going to overflow it. It was Friday afternoon so my best idea was to just create 4 annoy indexes. When you insert a record into annoy with a high id, it allocates space for everything up to that id. You see my problem, which I’d managed to multiply by having 4 indexes for 2 million records, I didn’t need all that space allocated and I ran out of memory pretty quickly. Newer versions of annoy have a “build on disk” option, but that isn’t currently supported in the R bindings, and it’s just as well. I would have had a huge and very sparsely populated annoy index.

So, the next attempt, which only needs one annoy index. We scroll through the index in ES and add the features from each document into an annoy index with an incrementing id. We keep an additional array of instance ids where the array index is the incrementing id.

library(elastic)
library(RcppAnnoy)

dimensions <- 9
total_records <- 2000000
es_records_at_once <- 10000

connect(es_port = 9200, es_host = "es01")

annoy <- new(AnnoyEuclidean, dimensions)

# for holding our index of instance ids. init to the fullest size
instance_ids <- vector("list", total_records)

# We are going to walk the whole index
res1 <- Search(index = 'spotify_audio_features', time_scroll = "1m", size = es_records_at_once)

count <- 0
hits <- 1
while (hits != 0) {
  tmp1 <- scroll(res1$`_scroll_id`)
  hits <- length(tmp1$hits$hits)
  if (hits > 0) {
    for (query_features in tmp1$hits$hits) {
      count <- count + 1

      v <- c(as.numeric(query_features$`_source`$`api_result`$`speechiness`),
             as.numeric(query_features$`_source`$`api_result`$`energy`),
             as.numeric(query_features$`_source`$`api_result`$`danceability`),
             as.numeric(query_features$`_source`$`api_result`$`liveness`),
             as.numeric(query_features$`_source`$`api_result`$`acousticness`),
             as.numeric(query_features$`_source`$`api_result`$`instrumentalness`),
             as.numeric(query_features$`_source`$`api_result`$`valence`),
             as.numeric(query_features$`_source`$`api_result`$`tempo`) / 300,
             (as.numeric(query_features$`_source`$`api_result`$`loudness`) / 100) * -1
      )

      # print(v)
      typeless_id <- as.numeric(substr(query_features$`_id`, 0, nchar(query_features$`_id`) - 4))

      annoy$addItem(count, v)
      instance_ids[[count]] <- typeless_id
    }
  }
}

print("build annoy")
# number of trees
annoy$build(50)
annoy$save("/tmp/spotify_audio_features.tree")

print("save instance ids")
saveRDS(instance_ids, file = "/tmp/spotify_audio_features_instance_ids.Rdata")

This takes about 7 minutes to run on my MacBook, and the resulting annoy database is 1.4G on disk.

Query the data

Now that we’ve spent the time building the indexes, we can make a separate script which just queries them.

For a given track id, we load the features from elasticsearch and use those to find some neighbours.

library(elastic)
library(RcppAnnoy)

dimensions <- 9

connection <- connect(es_port = 9200, es_host = "es01")

# "Xerrox Monophaser 1"
query_track_id = "33262978360087"

# get the features doc from ES
query_features = docs_get(index='spotify_audio_features', type='spotify_audio_features', id=query_track_id, verbose=F)
spotify_track_id = query_features$`_source`$`api_result`$`id`
spotify_track_url = paste("https://open.spotify.com/track/", spotify_track_id, sep="")

message("Querying track id: ", query_track_id, " ", spotify_track_url)

# get just the bits we need from the ES doc
v <- c(as.numeric(query_features$`_source`$`api_result`$`speechiness`),
       as.numeric(query_features$`_source`$`api_result`$`energy`),
       as.numeric(query_features$`_source`$`api_result`$`danceability`),
       as.numeric(query_features$`_source`$`api_result`$`liveness`),
       as.numeric(query_features$`_source`$`api_result`$`acousticness`),
       as.numeric(query_features$`_source`$`api_result`$`instrumentalness`),
       as.numeric(query_features$`_source`$`api_result`$`valence`),
       as.numeric(query_features$`_source`$`api_result`$`tempo`) / 300,
       (as.numeric(query_features$`_source`$`api_result`$`loudness`) / 100) * -1
)

# load the annoy up
annoy <- new(AnnoyEuclidean, dimensions)
annoy$load("/tmp/spotify_audio_features.tree")

# and the instance ids map
instance_ids <- readRDS("/tmp/spotify_audio_features_instance_ids.Rdata")

# and find some neighbours
result_id <- annoy$getNNsByVector(v, 2)

for (result in result_id) {
  # state51 instance ids have a class id on the end, track is 0087
  typeless_instance_id <- instance_ids[result]
  typed_instance_id = paste(typeless_instance_id, "0087", sep="")

  # the first result seems to be the one we queried with?
  if (typed_instance_id == query_track_id) {
    original <- result
    next
  }

  distance = annoy$getDistance(original, result)

  query_features = docs_get(index='spotify_audio_features', type='spotify_audio_features', id=typed_instance_id, verbose=F)
  spotify_track_id = query_features$`_source`$`api_result`$`id`
  spotify_track_url = paste("https://open.spotify.com/track/", spotify_track_id, sep="")

  message(typed_instance_id, " ", distance, " ", spotify_track_url)
}

It’s pretty swift!

Using this Alva Noto track as a query, we get this relaxing track back as this nearest neighbour sharing these features.

Tradeoffs and tweakables

Initially with 50 trees and 2 million records with 9 dimensions the annoy file on disk is 1.4G

I found myself a test track, and ran a simple query.

Double the number of trees to 100 and the file on disk is now 2.8G results are the same as the original.

Went to 500 trees, and now the file is 13G, but the results of the query are the same.

If we increase the searches:

annoy$getNNsByVectorList(v, 5, (500 * 5 * 20000), F)

we make the query much slower, but it doesn’t change the results.

It’s not much of a test, but I’ll revert to the defaults as they look fine to me.