This past week I spent time wrestling with a particularly difficult MongoDB aggregation. For my own sake, I figured I should document what I learned.

Imagine you have a set of users working through various workflows. The users’ current workflow state is held in a subdocument whos name relates to the name of their active workflow. For example, the first two users below are currently in workflow “A”, while the third user is in workflow “B”:

/* 0 */
{
  _id: "...",
  workflow: {
    A: {
      pos: 3
    }
  }
}

/* 1 */
{
  _id: "...",
  workflow: {
    A: {
      pos: 1
    }
  }
}

/* 2 */
{
  _id: "...",
  workflow: {
    B: {
      pos: 4
    }
  }
}

My goal is to find out how many users are currently in workflow “A”, and how many are in workflow “B”. Without further ado, here’s the aggregate query I ended up with:

db.users.aggregate([
    {
        $group: {
            _id: null,
            in_workflow_a: {$sum: {$cond: [{$gte: ['$workflow.A, null]}, 1, 0]}},
            in_workflow_b: {$sum: {$cond: [{$gte: ['$workflow.B, null]}, 1, 0]}}
        }
    }
])

Let’s break it down piece by piece.

Since I’m interested the states of all users in the system, I’m not using a $match block. If you wish to you limit your query to a specific subset of users, you could do that filter in an initial $match:

db.users.aggregate([
    {
        $match: {
            _id: {
                $in: [
                    user Ids...
                ]
            }
        }
    },
    ...
])

The first thing you’ll notice about the $group block is that we’re specifying a null _id. This use of _id is fairly well documented, and essentially means we’ll group all of the documents we’re matching over into a single result document.

The next bit is the interesting part. We’re defining two fields that will appear in our result document: in_workflow_a, and in_workflow_b. We build up each of these fields with a $sum. We want to add 1 to these fields if the respective workflow subdocument exists, and 0 if it does not. We accomplish this by using the $cond aggregation operator.

The way we’re using $cond is fairly interesting. My first attempts at writting this aggregation used $exists within the $cond, but Mongo was unhappy about that:

uncaught exception: aggregate failed: {
  "errmsg" : "exception: dotted field names are only allowed at the top level",
  "code" : 16405,
  "ok" : 0
}

David Weldon led me out of the darkness with an interesting comparison trick. By using $gte to compare the potentially non-existant workflow object against null, we can easily determine whether it exists or not. When Mongo compares values of different types, it uses this comparison order. As you can see, aside from the internally used MinKey type, every other BSON type is considered canonically greater than null. This means that an empty or null field will return false, and any other types or values will return true!

The final result of our aggregation looks something like this:

/* 0 */
{
    "result" : [ 
        {
            "_id" : null,
            "in_workflow_a" : 2,
            "in_workflow_b" : 1
        }
    ],
    "ok" : 1
}

Victory! The aggregation loops through all users in the system, incrementing the counter for each users current workflow, and finally returns the total counts.