Reworking the Database, Library, and Syncing

Table of Contents

After switching to diesel, learning new async tricks, and embracing bare functions, the database and sync code of Recordbox is now better than ever.

In the current release cycle for v0.11.0, one of the two primary focuses has been on cleaning up and reorganizing the database. Not only to make it more readable and maintainable, but also to attempt to fix issues that can cause various albums and tracks to be categorized incorrectly.

Since Recordbox has been public, various issues with the library have been reported and have either been fixed or solutions are still ongoing. (and hopefully addressed by these changes) Of the ones that have been fixed, several of them were due to various bugs in SQL statements and weird organization problems that led me to want to throw out all my janky SQL and replace it with the diesel ORM.

The Database #

What ended up being the main advantage of diesel over handwritten SQL, is that diesel all but requires a proper database schema in order to be usable. I discovered this quickly during the first, earlier attempt to port over to it when I was trying to reproduce my query to retrieve the albums from the database.

The Old #

Now, for context, the original database schema for Recordbox, which has been used in some form until its upcoming replacement, only really stored tracks. Basically, what the parser did was read the metadata out of a track and store it directly in the database as-is. So there was not an albums table. The only other tables were for artists (and later genres) and covers, because there could be more than one artist/genre on a track, and covers because their modified times needed to be tracked separately from tracks to know when to update them.

This meant that the query to retrieve albums was a monstrosity of unmitigated proportions:

SELECT
  DISTINCT t.album AS title,
  t.albumartist,
  (SELECT sort FROM artists a WHERE a.name = t.albumartist LIMIT 1) AS albumartist_sort,
  SUM(t.length) AS length,
  t.date,
  (SELECT DISTINCT path FROM covers c WHERE same_dir(c.path, t.path) LIMIT 1) AS cover,
  COUNT(t.path) as num_tracks,
  (SELECT json_group_array(name)
    FROM (SELECT DISTINCT a.name
        FROM artists a JOIN tracks ON a.path = tracks.path
        WHERE t.album = tracks.album AND t.albumartist = tracks.albumartist)) AS artists,
    (SELECT json_group_array(name)
     FROM (SELECT DISTINCT g.name FROM genres g WHERE g.path = t.path)) AS genres
FROM
  tracks t
{} -- WHERE clause must be filled by format! macro
GROUP BY
  t.album,
  t.albumartist,
  t.date;

I don’t think I have to explain why this is nuts, do I?

Anyway, yeah this was a mess.

Now, reconstructing this query with diesel would have been ridiculous, as I quickly discovered. So I did the only thing I could do—write an actual schema that included albums in it.

Making it Better (By Force) #

With a proper database schema, it meant that the queries to retrieve things from the database could be significantly improved, and since I was using diesel, they were now generated for me. And in general, the data was organized in a more representative way in regard to the actual structure of the content. As a result, I have noticed the size of the database is actually smaller than it used to be, and I’m pretty sure load times are improved.

Bugfixing should be much easier now too, because the overall complexity of interacting with the database is now much smaller.

Another advantage of diesel, is that the database model is embedded into the code and queries are built on top of structs. This has helped ensure that the correct types are retrieved. More generally, the fact that the queries are abstracted over by strongly-typed rust code, has meant that a certain class of bugs that used to occur—where some data would come out malformed due to special characters in the strings that I didn’t handle correctly in a format string somewhere–were eliminated entirely.

With this being said, it’s not like this change to diesel has revolutionized the codebase and eliminated enormous quantities of bugs. The bugs that it would fix and prevent were already caught and fixed. (for the most part) What the switch has done, however, is ensured that these bugs can’t happen, so it’s one less thing I have to think about. Now if there are bugs in the library code, I can be a lot more confident that it isn’t a problem with the queries or database internals, and instead focus on problems with the actual structure and organization of the contents.

The Parsing #

One of the classes of bugs that have not been resolved yet are the ones where music is organized incorrectly. There are two potential places where these can occur: in the code which loads the music into the UI, and the code for parsing and storing metadata in the database.

Organizing music based on metadata is—shocker–very complicated. This is because it is not very standardized, and there are lots of different kinds of music. What I have tried to do in Recordbox is to organize music in such a way that all tracks are uniquely identified by their file path, and belong to an album. Albums are then uniquely identified by a combination of their title, primary artist, and date. This poses an immediate problem: figuring out what the primary artist is.

You cannot just use the artist tag in the metadata. Because some albums have featured artists on some tracks, which are not the primary artist, and if treated as such would split the album. The albumartist tag exists for this purpose, but it is not as commonly used and is often not present at all. So what do you do?

What I did, and still do, is use albumartist and the primary artist if it exists. If not, fall back to finding an artist tag that is the same on every track on the album. If one is found, use that. If not, it is assumed that the album is a compilation album if it has any artist tags at all, and is marked as having an unknown artist otherwise.

Another issue arises here, in that how do you determine which tracks are part of an album? Since an album is identified by the title, primary artist, and date, you can’t identify it without the primary artist. Well, what I used to do was to consider all tracks in a directory as an album. This would mean that, in a case where an albumartist tag isn’t present, one is attempted to be determined from the artist tags of the tracks in the directory that is currently being parsed.

So Recordbox would parse tracks on a per-directory basis, where a whole directory would be parsed at once, any missing information inferred from there, and then all that would be put in the database.

However, the old per-directory parser was slow, because it was not multithreaded at all. A few updates back I updated the parsing to be multithreaded, which greatly improved the speed, but sacrificed the directory-based parsing. Instead, every track was parsed individually in a task assigned to a thread pool.

How were primary artists determined after this, you ask? Well, with an enormous SQL query.

Why it Was Dropped #

Honestly, if I remember correctly I removed the directory-based parsing mostly because I thought the SQL approach would be better. Also, because I was getting frustrated trying to figure out how to multi-thread the directory parsing. I feel like I could have just assigned tasks to parse directories to a thread pool and return a list of parsed tracks or something over a channel to the database thread, but IDK. It was a while back, I don’t quite remember.

Either way, up until this upcoming release, the sync code has worked by figuring out which files were modified, and assigning a task to parse each one to a thread pool. Another thread is also spawned to write to the database, and the parsing tasks send the parsed metadata to the database thread using an async channel. This meant that each parsing task only knew the context of the single track it was parsing, and any decisions about how to update and store values like the primary artist could only be made after it was put in the database.

What the Problem (Probably) Is #

So there have been some issues where tracks are put in albums they shouldn’t be, or some albums are merged, etc. And what I think the problem is, is that the way albums are inferred when a primary artist can’t be found is not working right, and including things that shouldn’t be part of the same album.

My hope is that be returning to directory-based parsing, is that these bugs can be fixed, because albums won’t be erroneously merged together if they are in separate directories. Hopefully. I’m going to put out a beta release, and I have a feeling that the universe is going to be unkind and that wasn’t the problem at all. But even then, it should be easier to fix with directory-based parsing, just because of the additional context of the directory to use to determine how to organize things.

Fixing It With Cool Async Stuff #

When I decided to go back to the directory-based parsing, I was originally planning to do something like assign a task to parse each directory to a thread pool and use channels to send it to the database thread like with the track-based parsing. But I didn’t really like this because it was still parsing each track sequentially within the task.

Then I discovered JoinSets. A JoinSet (or FutureGroup if using smol, which I would later switch to) is a collection of futures provided by tokio. With a JoinSet, you can put a bunch of futures that all resolve to the same type T in one, and you can do something like:

while let Some(result) = set.join_next().await {
 // do something with the result
}

Or you can collect it into a Vec<T> of the return type T of the stored futures.

These are really cool, because it meant I could really easily bring back almost the same structure for the new directory-based parsing, but keep having each individual track parsed in its own non-blocking thread.

So the new parse_dir() function looks like this:¹

async fn parse_dir<D>(dir: D) -> (Option<PathBuf>, Vec<Metadata>)
where
    D: IntoIterator<Item = DirEntry>,
{
    let mut cover = None;
    let dir = dir.into_iter().map(|d| d.path());

    let futures = dir.filter_map(|path| {
        if cover.is_none() && checks::is_cover!(&path) {
            cover = Some(path.clone());
            None
        } else {
            Some(smol::unblock(move || parse_metadata(&path)))
        }
    });
    let tasks: FutureGroup<_> = futures.collect();
    (cover, tasks.filter_map(|s| s).collect().await)
}

Essentially, it creates a FutureGroup (which is the smol equivalent of tokio’s JoinSet, I swapped async runtimes) containing the async tasks smol::unblock(move || parse_metadata(&path))). This spawns a new thread on the non-blocking thread pool to parse the metadata of the path provided, waits for it to complete without blocking, and returns a Metadata type. The FutureGroup contains one of these tasks for each of the tracks in the directory, and then when tasks.filter_map(|s| s).collect().await is called, each of the futures in the FutureGroup are run, and all spawn their parsing threads, and resolve to a list of Metadata structs.

Really, the only difference between this code and a purely sequential version would be to remove the smol::unblock call and just call parse_metatata directly, and collect to a Vec instead of a FutureGroup.

And this doesn’t only apply to the individual directories, each directory that is parsed with parse_dir is done with a FutureGroup as well; since parse_dir is async, it can be put in a FutureGroup.

Overall, this means the code is a lot cleaner, and it has the plus of eliminating the overhead of sending the parsed metadata between threads using channels, because the metadata can be just collected out of the FutureGroup.

What this also means, is that other than the unblock calls to spawn threads to parse the tracks, (which are necessary because reading files with lofty cannot currently be made async) all the parsing/sync code is running in one thread.

Furthermore, it’s not even spawned on the global async executor. Since the FutureGroup can either be collected to a Vec or iterated over as the contained futures complete, all the parsed values can be returned directly in the sync function, and the entire parsing future can be executed using block_on in the database thread.

That is, this is the top-level of the library sync code:

let tasks: FutureGroup<_> = to_parse.into_iter().map(parser::parse).collect();
transaction!(conn, |conn| {
   smol::block_on(tasks.for_each(|(cover, metadata)| {
        write::save_metadata(conn, cover.as_ref(), &metadata);
        send!(sender, ProgressUpdate::Progress);
   }));
   Ok(())
});

The transaction! macro here is just to simplify the boilerplate to begin a transaction on a connection obtained from the deadpool-diesel connection pool for the database. I.e, the database connection pool already has a thread for each open database connection, and with block_on, the entirety of parsing code can just be run in the transaction that is already needed to write to the database, completely eliminating the need to have a channel to send the data over.²

Like with the switch to diesel, this change has greatly simplified the code, and made it much easier to read. This time because of what I previously mentioned, where the async code in this context is structured pretty much exactly the same way the equivalent sequential code would be.

Another, smaller change has also contributed to the readability improvements as well, which I have also started to apply elsewhere.

Just Use Functions #

The inspiration for this change was quite literally some random comment somewhere online that I read in passing, about how in Rust you don’t need to be scared of bare functions and that everything doesn’t have to be encapsulated in a struct.

Well that random comment stuck with me for some reason, and when I was working on applying the changes discussed previously, I ended up changing the sync code to use exclusively bare functions.

Previously, I had a LibraryManager struct which did the parsing and sync operations. It pretty much only stored the root path of the library to scan, the database, and the sender side of an async channel to report progress updates to the UI. It had one public function called sync.

Since I created it, I felt like LibraryManager was a stupid name, because it didn’t really manage the library; it just did the syncing and parsing, LibraryParser was another idea, but I didn’t like that either. Now I realize the reason the name always didn’t fit was because the entire struct was pointless. It wasn’t even persistent! Every time a sync needed to be run, the MusicLibrary would create a LibraryManager, run the sync function, and then drop it immediately after it completed.

So I deleted the struct and just passed the database/library root/sender into the sync function.

Now that I’ve done it, it makes a lot of sense to me, as “syncing the library” is a process, not a persistent state. Structs are for states, functions are for processes. If there is no meaningful state that needs to persist, you don’t need a struct. In the case of the syncing, this was clear: the entire LibraryManager struct was being dropped immediately after a sync anyway.

Now this doesn’t bring any real performance improvements, but it does (again, this is a theme now) result in cleaner code. And it led me to apply this strategy to the lyrics saving process as well in my attempts to finalize the synced lyrics code, which in that case did lead to improvements other than cosmetic ones, as it fixed a bunch of bugs that resulted from the overly complicated way I was doing it before.

Tying Together #

So, to bring this to a close, the database, library sync, and parsing code is now much cleaner and a lot more logically organized. This has brought some minor performance improvements as a result of reduced overhead in some cases, and should hopefully fix at least some of the problems with incorrectly organized music, through being able to more accurately determine the full context of the music being parsed. And if it doesn’t fix them all, the ones that aren’t fixed will be much easier to work on, as there is now significantly fewer potential points of failure to debug.

A common theme among this post and at least the previous one about Recorbox which astute readers have no doubt noticed, is that a lot of the code was a huge mess—a raging dumpster fire as I believe I put it in the previous post. While “raging dumpster fire” is a bit of an exaggeration, I was definitely working under the philosophy of “work fast, write garbage” for a while, where I didn’t really slow down to keep code quality up. For most of the code this actually wasn’t a problem, I think the code around the player controls and music playback is in pretty good shape and I haven’t had to touch it for a while now, but the database and play queue were the two places where this was most evident—largely due to how many times they have been revised to attempt to reign in the complexity.

However, with the upcoming release of 0.11.0, I think I can now say that the code is, if not pristine, which is an unobtainable goal anyway, at least clean. And by clean, I mean that I could probably not look at the code for months, come back, and pretty much know how it all works still. Whether anyone else knows what it’s doing, IDK.

Regardless, I hope to be committing more time to Recordbox again soon. With my recent graduation from university and starting a new job, I have not had as much time as I was hoping to work on it, but should have more time in my schedule soon. Not much is left to do before I can release a beta release, and the main reason I’m planning to do a beta release is just to make sure the new database works the same as the old one and doesn’t cause any new regressions. If not, it could be released pretty much as-is for 0.11.0, once I get the synced lyrics polished up. Looking forward to getting that out, I think it’ll be a real killer feature, as to my knowledge not many other players available on Linux have that.

Anyway, onward to 0.11.0!

Ignore the side effect of assigning a value to cover in a filter_map closure, I know this is bad form, I keep forgetting to replace it with a for loop. ↩︎
Which is how it used to work, before, instead of the tasks.for_each(....), there would have been something like while let Some(metadata) = receiver.recv().await, and the metadata being sent from each of the individual parsing threads. ↩︎