smosa/adam
/code-and-technology
/


Calculating "trending" is hard

My favorite kinds of problems are those that seem so simple to solve, yet show their complexity only when you actually go and try to solve them.

Problem: Pretend we have a simple reddit-like board with many posts. Each post has a certain number of votes and new votes and posts keep coming in as time goes on. You want to show your users, at any given time, what posts have been trending in the last 24 hours based on those votes. For simplicity's sake, let's say there are only upvotes.

This is not a novel problem. Apps do this all the time, and plenty of lightweight statistical models exist to calculate this. So here's the twist.

Twist Your dataset of posts and votes is huge. You don't want to do something like z-scoring or the like because it requires you to interact with every single post and constantly calculate averages and each post's ranking relative to that average.

Instead, you want to use simple sorting and some memoized data that is saved only when a vote is made. Any other loops over the data should be very small and should not increase in performance needs as more data is added to the set.

The information we do have

So let's say we have some very simple model of posts.

class Post {
  constructor() {
    this.votes = [];
    this.id = Math.floor(Math.random() * (10 * 20));
  }
}

And posts get votes.

class Vote {
  constructor() {
    this.time = Date.now();
  }
}
class Post {
  constructor() {
    this.votes = [];
    this.id = Math.floor(Math.random() * (10 * 20));
  }

  vote() {
    this.votes.push(new Vote());
  }
}

Memoization

And to make our lives easier, we'll have the act of voting save some data we can use in place of the query-time calculations we're trying to avoid.

In a more realistic model, we'd never remove votes from the count, but since this is all about ranking trending popularity, we'll keep it simple.

Note we're calling this "static trending" because votes are only pushed out of that 24 hour window when a vote happens. No cron jobs here to update every single post every 24 hours.

class Post {
    constructor() {
        this.votes = [];
        this.last_voted_on = undefined;
        this.id = Math.floor(Math.random() * (10 ** 20));
        this.static_trending = 0;
    }

    vote() {
        this.votes.push(new Vote());
        this.last_voted_on = this.votes[this.votes.length - 1].time;
        this.update_static_trending();
    }

    update_static_trending() {
        this.expire_votes();
        this.static_trending = this.votes.length;
    }

    expire_votes() {
        this.votes = this.votes.filter(vote => !vote.older_than_24_hours);
    }
}
class Vote {
    constructor(time = new Date()) {
        this.time = time;
    }

    get older_than_24_hours() {
        return (new Date() - this.time) > 86400000;
    }
}

Now let's use what we have and keep this as simple as possible.

Attempt #1

Let's just return our static trending data and see what happens.

So we make a tool to help us calculate the top q number of trending posts. This will just sort everything by static trending and shave it down to no more than q items.

class PopularityCalculator {
    constructor(posts) {
        this.posts = [...posts];
    }

    trending_by_quantity(q = 5) {
        return this.posts.sort((a, b) => {
            return b.static_trending - a.static_trending;
        }).slice(0, q);
    }
}

And we just need to make sure we save the post to the popularity calculator whenever they are made.

// Make an array 10 items long of new posts.
let posts = new Array(10).fill(new Post());
let pc = new PopularityChecker(posts);
pc.trending_by_quantity();

This isn't great test data, so let's say we have some fixtures that have ten posts, in the reverse order we want them to be in. Will this function grab the last 5 in reverse order?

test('Top five out of ten trending posts correctly identified', () => {
    let posts = [];
    let x = 0;
    // Make these in the wrong order to start.
    // Least to most popular.
    while (x < 10) {
        posts.push(fixtures.trending_by_level[x]);
        x++;
    }

    const popularity_calculator = new PopularityCalculator(posts);
    const trending_ids = popularity_calculator.trending_by_quantity().map(p => [p.id, p.static_trending]);
    const expected_trending_post_ids = [
        posts[9], posts[8], posts[7], posts[6], posts[5]
    ].map(p => [p.id, p.static_trending]);

    expect(trending_ids).toEqual(expected_trending_post_ids);
});

Test results: 1 passed, 1 total

Problem

Imagine a post from five years ago was so popular, it was getting machine gun fire posts all day.

super_popular_post_5y_ago.static_trending = 999

But after that glory day, there was not a single vote ever again. That means, there was no vote outside of the 24 hour window that would have triggered it to expire its old votes.

Unless we find five other posts with higher static trending scores, this will incorrectly float to the top of posts made popular very recently.

We can see that in this failing test. It's the same as the first one but we added a line in the middle to make the highest rated post extremely old.

test('Top five recently trending posts does not include very old popular post', () => {
    let posts = [];
    let x = 0;
    // Make these in the wrong order to start.
    // Least to most popular.
    while (x < 10) {
        posts.push(fixtures.trending_by_level[x]);
        x++;
    }

    // Take the highest static trending post and make it very old.
    // This should bump down its dynamic trending value abysmally.
    let a_long_time_ago = new Date(0);
    posts[9].last_voted_on = a_long_time_ago;
    posts[9].votes = posts[9].votes.map(v => new Vote(a_long_time_ago));

    const popularity_calculator = new PopularityCalculator(posts);
    const trending_ids = popularity_calculator.trending_by_quantity().map(p => [p.id, p.static_trending]);
    const expected_trending_post_ids = [
        posts[8], posts[7], posts[6], posts[5], posts[4]
    ].map(p => [p.id, p.static_trending]);

    expect(trending_ids).toEqual(expected_trending_post_ids);
});

Test results: 1 passed, 1 failed, 2 total

Attempt #2

It feels like we're almost there. The nice thing about the previous approach is you just have a small amount of data you need to look at which is q or in our case the default 5.

What if each time we computed the top 5 static trending, we ran a small query just on those to see if any were expired? Then for any we found, we updated only those, and kept this going until all top 5 static trending posts were verified nonexpired?

We're looking for any votes that are outside of 24 hours. The votes are also ordered first-come, first-serve so if the oldest vote is outside the window, that's enough to go on.

class Post {
    // ...

    needs_update() {
        const oldest_vote = this.votes[0];
        return this.votes.length && oldest_vote.older_than_24_hours;
    }
}

Now we'll use this function in the popularity calculation to recurse until we have all freshly updated static trending posts.

First, we'll handle the happy path where nothing needs updating.

class PopularityCalculator {
    constructor(posts) {
        this.posts = [...posts];
    }

    trending_by_quantity(q = 5) {
        let posts = this.posts.sort((a, b) => {
            return b.static_trending - a.static_trending;
        }).slice(0, q);

        let posts_needing_updates = posts.filter(p => p.needs_update);
        if (!posts_needing_updates.length) return posts;
    }
}

Now we add the recursion. Also note we added this get_post_by_id(id) function as we're being careful not to mutate the original this.posts array unless it's necessary.

class PopularityCalculator {
    constructor(posts) {
        this.posts = [...posts];
    }

    trending_by_quantity(q = 5) {
        const grab_top_5_trending = () => {
            let posts = [...this.posts].sort((a, b) => {
                return b.static_trending - a.static_trending;
            }).slice(0, q);

            let posts_needing_updates = posts
                .filter(p => p.needs_update)
                .map(p => this.get_post_by_id(p.id));

            if (!posts_needing_updates.length) return posts;

            posts_needing_updates.forEach(p => p.update_static_trending());
            return grab_top_5_trending();
        }
        return grab_top_5_trending();
    }

    get_post_by_id(id) {
        // Return the actual post object by reference
        return this.posts.filter(p => p.id === id)[0];
    }
}

Test results: 2 passed, 2 total

Problem

So far we're being a little arrogant here to think that there's enough traffic to our application that there will be at least 5 posts that have received votes in the last 24 hours. A popular app like Reddit might not have this problem but for us, who's to say the traffic doesn't slump or we have a holiday break that impacts traffic?

So if all our posts haven't been voted on since 1970 (I choose this date because the code is easy), the votes will just be cleared out, making the static trending scores just 0.

test('Top five recently trending posts includes old posts if there are only old posts.', () => {
    let posts = [];
    let x = 0;
    // Make these in the wrong order to start.
    // Least to most popular.
    while (x < 10) {
        posts.push(fixtures.trending_by_level[x]);
        x++;
    }

    // Take the highest static trending post and make it very old.
    // This should bump down its dynamic trending value abysmally.
    let a_long_time_ago = new Date(0);
    posts.forEach(post => {
        post.last_voted_on = a_long_time_ago;
        post.votes = post.votes.map(v => new Vote(a_long_time_ago));
        post.update_static_trending();
    });

    const popularity_calculator = new PopularityCalculator(posts);
    const trending_ids = popularity_calculator.trending_by_quantity().map(p => [p.id, p.static_trending]);
    const expected_trending_post_ids = [
        posts[9], posts[8], posts[7], posts[6], posts[5]
    ].map(p => [p.id, p.static_trending]);

    expect(trending_ids).toEqual(expected_trending_post_ids);
});

Test results: 1 failed, 2 passed, 3 total

Attempt #3

What does it mean for a post to have a static trending score of 0? This should mean it's a new post with no votes on it. When we think about old but previously popular posts, we'll realize we haven't fully defined what it means to be "trending."

My instinct is that votes should never drop out of factoring into trendi status completely, but they should degrade in value without ever getting to zero. So we need to think about logarithmically scaling down the vote value.

So let's take a look at the Vote model and have it compute a weighted value. Currently, it's implicitly weighting itself as 1 if it's in the post's data or 0 if it's not.

To make this logarithmic, we'll use the simplest equation, y = 1/x making sure x > 0; This also works out because it makes sure fresh votes count as 1.

class Vote {
    // For testing, allow me to set this manually.
    constructor(time = new Date()) {
        this.time = time;
    }

    get days_old() {
        return Math.floor((new Date() - this.time) / 86400000);
    }

    get weighted_value() {
        return 1 / (this.days_old + 1);
    }

    get older_than_24_hours() {
        return this.days_old >= 1;
    }
}

Then we need to adjust Post to calculate this value.

class Post {
    // ...

    update_static_trending() {
        this.static_trending = this.votes.reduce((p, n) => p + n.weighted_value, 0);
    }

    // ...
}

We're also removing this.expire_votes(). With this strategy, we need to keep votes forever.

But we're still not done because we need to rethink the recursion we wrote earlier that makes cleverly looks through the top 5 static trending to make sure none of them are in need of an update.

Previously we could just look at the votes as timestamps, so let's just use a basic last_updated property on the post itself.

Here's what Post looks like now

class Post {
    constructor() {
        this.votes = [];
        this.last_updated = Date.now();
        this.last_voted_on = undefined;
        this.id = Math.floor(Math.random() * (10 ** 20));
        this.static_trending = 0;
    }

    vote() {
        this.touch();
        this.votes.push(new Vote());
        this.last_voted_on = this.votes[this.votes.length - 1].time;
        this.update_static_trending();
    }

    touch() {
        this.last_updated = Date.now();
    }

    update_static_trending() {
        this.touch();
        this.static_trending = this.votes.reduce((p, n) => p + n.weighted_value, 0);
    }

    get older_than_24_hours() {
        return this.last_updated < Date.now() - 86400000;
    }

    get needs_update() {
        return this.votes.length && this.older_than_24_hours;
    }
}

We also need to update our last two tests so we're mocking up old data correctly, such as adding this last_updated piece.

    posts.forEach(post => {
        post.last_voted_on = a_long_time_ago;
        post.last_updated = a_long_time_ago;
        post.votes = post.votes.map(v => new Vote(a_long_time_ago));
        post.update_static_trending();
    });

Testing results: 3 passed, 3 total

For good measure, I'll add another test to flatly look at a bunch of posts that are identical in popularity but just aging by a day each.

test("Everything else being equal, posts a day apart degrade in trending status", () => {
    // Make 10 identical posts with three same-day votes each.
    let posts = [];
    let x = 0;
    while (x < 10) {
        posts.push(fixtures.trending_by_level[3]);
        x++;
    }

    const age_date_by_days = (date, days = 0) => {
        const one_day_in_ms = 24 * 60 * 60 * 1000;
        return date - (one_day_in_ms * days);
    }

    // Make each 1 day older than the previous one, including their votes.
    // Makes the array oldest to newest.
    posts.forEach((post, index) => {
        const new_date = age_date_by_days(post.last_voted_on, 10 - index);
        post.last_voted_on = new_date;
        post.last_updated = new_date;
        post.votes = post.votes.map(v => new Vote(new_date));
        post.update_static_trending();
    });

    const expected_trending_post_ids = [
        posts[9], posts[8], posts[7], posts[6], posts[5]
    ].map(p => [p.id, p.static_trending]);

    const popularity_calculator = new PopularityCalculator(posts);
    const trending_ids = popularity_calculator.trending_by_quantity().map(p => [p.id, p.static_trending]);

    expect(trending_ids).toEqual(expected_trending_post_ids);
});

Problem

And what if our site is popular? We are going to use up a lot of database rows if we have to track every vote. Maybe we don't want to do that.

What if instead we stored the vote count as an integer separately from the votes themselves? We could then revisit this idea of expiring votes. However, this time we'll do it based on a specified maximum number rather than expiring based on vote age.

test("Posts have a maximum number of votes to be considered (100) and the oldest are retired first", () => {
    // Make a post with 100 votes.
    const max = 100;
    const post = new Post();
    let x = 0;
    while (x < max) {
        post.vote();
        x++;
    }

    // Make the first vote the oldest one.
    let a_long_time_ago = new Date(0);
    let yesterday = new Date() - (24 * 60 * 60 * 1000);
    post.votes[0] = new Vote(a_long_time_ago);
    post.votes[1] = new Vote(yesterday);
    post.update_static_trending();

    // Now vote one more than the maximum we will hold as raw votes.
    post.vote();

    // Regardless of our efficiency in shoving off old records, the compressed voting
    // score is just a simple number persisted to the record. That should keep incrementing.
    expect(post.compressed_vote_count).toEqual(max + 1);
    expect(post.votes.length).toEqual(max);
    expect(post.votes[0].time).toEqual(yesterday);
});

Testing results: 1 failing, 3 passed, 4 total

Attempt #4

Solving this is easy. Here are the key places that changed

class Post {
    constructor() {
        // ...
        this.compressed_vote_count = 0;
        this.maximum_raw_votes = 100;
    }

    vote() {
        this.touch();
        this.votes.push(new Vote());
        this.last_voted_on = this.votes[this.votes.length - 1].time;
        this.compressed_vote_count += 1;
        this.update_static_trending();
    }

    reduce_votes_to_maximum() {
        if (this.votes.length > this.maximum_raw_votes) {
            this.votes.shift();
        }
    }

    // ...

    update_static_trending() {
        this.touch();
        this.reduce_votes_to_maximum();
        this.static_trending = this.votes.reduce((p, n) => p + n.weighted_value, 0);
    }

    // ...
}

Now each post contains a rolling history of its full raw votes. That makes me feel better knowing that each post has some containment on the amount of memory and computing it will need. 100 is also a decent number to get a handle on how popular it is.

Testing results: 4 passed, 4 total

Problem

We've been so obsessed with votes and degradation that we forgot about good old fashioned updated_at. What happens when we have two posts that are completely equal and both have the same number of votes, even the votes themselves are at the same time but one post is just a little older?

test("Two posts with the same one vote each, both made today but one is slightly newer", () => {
    const older_post = new Post();
    const newer_post = new Post();

    older_post.vote();
    newer_post.vote();

    const now = new Date();
    const five_minutes_in_ms = (5 * 60 * 1000);
    const five_minutes_earlier = now - five_minutes_in_ms;

    newer_post.last_updated = now
    older_post.last_updated = five_minutes_earlier;

    // Set these up in the reverse expected order.
    let posts = [older_post, newer_post];

    const popularity_calculator = new PopularityCalculator(posts);
    const trending_ids = popularity_calculator.trending_by_quantity().map(p => [p.id, p.static_trending]);

    expect(trending_ids).toEqual([newer_post, older_post].map(p => [p.id, p.static_trending]))
});

Testing results: 1 failing, 4 passed, 5 total

Attempt #5

The solution is just to include a secondary sort, in the event comparing the static trending values doesn't show any difference.

class PopularityCalculator {
    constructor(posts) {
        this.posts = [...posts];
    }

    trending_by_quantity(q = 5) {
        const grab_top_5_trending = () => {
            let posts = [...this.posts].sort((a, b) => {
                return (b.static_trending - a.static_trending || b.last_updated - a.last_updated)
            }).slice(0, q);
  // ...

Conclusion

See the full code


Edit  Delete  Reply