admin

The Census examples are from the MongoDB Documentation and Jay Runkel of MongoDB. The mongodump of his database may be found in our Examples->MongoDB->dbs.

Motivation for the Aggregation Framework

We have all this data, we would like to do some analyses and ad hoc queries.

Remind folks of the Unix Pipeline concept (Thank you Doug McIlroy!)

Basic idea of the MongoDB Aggregation Framework

SLIDE 2

  • A conveyor belt with three “stations”
  • RED, BLUE, and GREEN squares are coming onto the conveyor
  • $match filters out the GREEN squares
  • $project changes the squares into circles
  • $group counts the blue and red circles to determine the ratio of red to blue

Actual aggregate pipelines look like this: SLIDE 3

db.collection.aggregate([ { stage 1}, { stage 2}, {... stage N} ], {options})

Stages employ aggregation operators to perform actions like match and project, and options include things like explain .

Example RDBMS Table and query

SLIDE 4

dirtBikes

make model cc priceNew
Yamaha TTR50 50 1500
Yamaha TTR110 110 2100
Honda CRF50F 50 1500
Honda CRF125 125 2999
Honda CRF125BW 125 3399
Kawasaki KX65 65 3700
SELECT make, COUNT(*) as numModels
FROM dirtBikes
GROUP BY model;

Sample output

make numModels
Yamaha 2
Honda 3
Kawasaki 1

Let’s create a similar mongoDB collection:

SLIDE 5

db.dirtBikes.insertMany([
  { make: "Yamama", model: "TTR50", cc: 50, priceNew: 1500 },
  { make: "Yamama", model: "TTR110", cc: 110, priceNew: 2100 },
  { make: "Honda", model: "CRF50", cc: 50, priceNew: 1500 },
  { make: "Honda", model: "CRF125", cc: 125, priceNew: 3000 },
  { make: "Honda", model: "CRF125BW", cc: 125, priceNew: 3400 },
  { make: "Kawasaki", model: "KX65", cc: 65, priceNew: 3700 },
]);

How would we do a similar query? Using a single stage aggregate pipeline:

Illustrate on board

db.dirtBikes.aggregate([
  {
    $group: {
      _id: "$make",
      num_models: {
        $sum: 1,
      },
    },
  },
]);

results in

{
    "_id": "Kawasaki",
    "num_models": 1
} {
    "_id": "Honda",
    "num_models": 2
} {
    "_id": "Yamama",
    "num_models": 2
}

Each stage consists of a pipeline operator and an expression. Pipelines are always an array of one or more stages. Stages are composed of one or more aggregation operators or expressions. Expressions may take a single argument or an array of arguments.

Here, $group is an operator in the first stage of this aggregation pipeline. Operators always begin with $ and we can generally have as many stages as we want.

Don’t confuse these aggregation operators with query operators.

The $make is a variable used within the aggregation. All variables begin with a $.

Bookmark the Aggregation Pipeline Quick Reference

We can add more to the stage if we want.

SLIDE 6

db.dirtBikes.aggregate([
  {
    $match: {
      priceNew: { $gt: 2700 },
    },
  },
  {
    $group: {
      _id: "$make",
      num_models: {
        $sum: 1,
      },
      avgCost: {
        $avg: "$priceNew",
      },
    },
  },
]);
{
    "_id": "Kawasaki",
    "num_models": 1,
    "avgCost": 3700
} {
    "_id": "Honda",
    "num_models": 2,
    "avgCost": 3200
}

The aggregation model is a pipeline, similar to Unix pipes.

pipelinepic

collection$project$match$group$sortresult\text{collection} \Rightarrow \underbrace{\$project \Rightarrow \$match \Rightarrow \$group \Rightarrow \$sort } \Rightarrow \text{result}

These pipeline stages can occur more than once and in any order you need.

SLIDE 7

stage action Doc mapping
$project selects values from the documents, even deep inside. 1:11:1
$match filter n:1n:1
$group aggregation, like sum, COUNT n:1n:1
$sort sorting 1:11:1
$skip skips results n:1n:1
$limit limit the results n:1n:1
$unwind flatten or denormalize the data 1:n1:n
$out redirect output to a collection instead of a cursor 1:11:1
$redact limits documemts based on access permission info m:n,mnm:n, m \ge n
$geonear limits documemts based on location info m:n,mnm:n, m \ge n
     

Any number of stages in any order.

Let’s use Visual Studio Code and Compass

Let’s use $match to limit our search:

db.dirtBikes.aggregate([
  {
    $match: {
      priceNew: {
        $lt: 3500,
      },
    },
  },
  {
    $group: {
      _id: "$make",
      num_models: {
        $sum: 1,
      },
      avgCost: {
        $avg: "$priceNew",
      },
    },
  },
]);
{
    "_id": "Honda",
    "num_models": 3,
    "avgCost": 2633.3333333333335
} {
    "_id": "Yamaha",
    "num_models": 2,
    "avgCost": 1800
}

Switch to Census data (courtesy of Jay Runkel of MongoDB)

Let’s see what the general schema is:

db.cData.find()

db.cData.count()

db.cData.distinct("name")

Add a sort:

db.cData.distinct("name").sort()

Now let’s begin exploring aggregate pipelines

$match

db.cData.aggregate([
  {
    $match: {
      region: "South",
    },
  },
]);

db.cData.aggregate([
  {
    $match: {
      region: {
        $ne: "South",
      },
    },
  },
]);

$project - tailoring the pipeline flow

Getting rid of data you don’t need, renaming fields, reshaping the document, and doing calculations.

db.cData.aggregate([
  {
    $project: {
      abv: 1,
    },
  },
]);

And let’s get rid of that annoying _id field.

db.cData.aggregate([
  {
    $project: {
      _id: 0,
      abv: 1,
    },
  },
]);

Selecting subfields (‘.’) force us to use quotes.

db.cData.aggregate([
  {
    $project: {
      "center.coordinates": 1,
    },
  },
]);
db.cData.aggregate([
  {
    $project: {
      "center.coordinates": 1,
      _id: 0,
    },
  },
  {
    $limit: 10,
  },
]);

You can do renaming too using another $

db.cData.aggregate([
  {
    $project: {
      coords: "$center.coordinates",
      _id: 0,
    },
  },
  {
    $limit: 10,
  },
]);

Calculations using an aggregration pipeline

use sample_mflix
db.movies.aggregate(
    [{
        $match: {
            rated: 'UNRATED'
        }
    }, {
        $project: {
            title: 1,
            'tomatoes.viewer.rating': 1
        }
    }, {
        $group: {
            _id: 'avgRating',
            averageViewerRatingOfUNRATED: {
                $avg: '$tomatoes.viewer.rating'
            }
        }
    }]
)

Example from MongoDB:

db.solarSystem.aggregate([
  {
    $project: {
      _id: 0,
      name: 1,
      myWeight: {
        $multiply: [
          {
            $divide: ["$gravity.value", 9.8],
          },
          86,
        ],
      },
    },
  },
]);

Lots of other math operators

$group

The $group aggregate operator defines what field should be used to bundle documents together.

db.cData.aggregate([{ $group: { _id: "$region" } }]);

Ok, but not too interesting. Add an accumulator:

db.cData.aggregate([
  {
    $group: {
      _id: "$region",
      regionCount: { $sum: 1 },
    },
  },
]);

This adds 1 to the new variable region_count for each _id every time a new document matches.

And add a sort stage to add order.

db.cData.aggregate([
  {
    $group: {
      _id: "$region",
      regionCount: { $sum: 1 },
    },
  },
  { $sort: { regionCount: -1 } },
]);

How about total area by region:

db.cData.aggregate([
  {
    $group: {
      _id: "$region",
      regionCount: { $sum: 1 },
      regionArea: { $sum: "$areaKM" },
      avgStateArea: { $avg: "$areaKM" },
    },
  },
  { $sort: { regionCount: -1 } },
]);

The aggregate ignores fields in the $sum or $avg calculation that have an unexpected datatype or are missing entirely.

Suppose we want to look at all the documents rather than grouping them by region.

// (Jay Runkel, MongoDB)
db.cData.aggregate([
  {
    $group: {
      _id: null,
      totalArea: { $sum: "$areaM" },
      avgStateArea: { $avg: "$areaM" },
    },
  },
]);

When you group by null you’re asking MongoDB to group all the documents together as one.

The $group stage has many other handy calculations like $max, $min, $push, and $addToSet.

$project can do accumulations too, HOWEVER they operate on an array in the document NOT on the document set as a whole.

use sample_restaurants
db.restaurants.aggregate([
    { $project: { _id: 0,
                 name: 1,
                 zipCode: "$address.zipcode",
                 maxScore: { $max: "$grades.score"},
                 minScore: { $min: "$grades.score"}}  }
])

end of part 1


Compass Demo

$push

// (Jay Runkel, MongoDB)
db.cData.aggregate([
  {
    $group: {
      _id: "$region",
      totalArea: { $sum: "$areaM" },
      avgArea: {
        $avg: "$areaM",
      },
      numStates: {
        $sum: 1,
      },
      states: {
        $push: "$name",
      },
    },
  },
]);

Here $push is just adding $name to a growing array named states. Note, however, that $push doesn’t ensure that its resulting array is a proper set. That is, it doesn’t guarantee that there will be only one occurrence of each value in the resulting array. If you want a proper set as the result, you must use $addToSet.

Two others aggregation expressions, $first and $last, are available, but they only have meaning after the group is sorted. More on this later.

What if we want to access data that’s embedded in an array?

$unwind

Undoes an array into separate entries and $first to grab the first of a set of results.

// (Jay Runkel, MongoDB)
db.cData.aggregate([
  { $unwind: "$data" },
  { $group: { _id: "$data.year", totalPop: { $sum: "$data.totalPop" } } },
  { $sort: { totalPop: 1 } },
]);

$unwind is used to reach into an embedded array of n elements inside a document and create n documents each with a single field instead of the array, each new document having one of the n elements.

We can use the $match stage to filter the documents for $unwind before continuing to the next stage:

// (Jay Runkel, MongoDB)
db.cData.aggregate([
  { $match: { region: "South" } },
  { $unwind: "$data" },
  { $group: { _id: "$data.year", totalPop: { $sum: "$data.totalPop" } } },
]);

Now we want to show how we can do some calculations using the data the pipeline returns, and leveraging $first and $last.

// find how state populations changed from census to census
// (Jay Runkel, MongoDB)
db.cData.aggregate([
  { $unwind: "$data" },
  { $sort: { "data.year": 1 } },
  {
    $group: {
      _id: "$name",
      pop1990: { $first: "$data.totalPop" },
      pop2010: { $last: "$data.totalPop" },
    },
  },
  {
    $project: {
      _id: 0,
      name: "$_id",
      delta: { $subtract: ["$pop2010", "$pop1990"] },
      pop1990: 1,
      pop2010: 1,
    },
  },
]);

and if we want to sort the output by delta we just add on a $sort:

// find how state populations changed from census to census
// sorted by the delta in populations
// (Jay Runkel, MongoDB)
db.cData.aggregate([
  { $unwind: "$data" },
  { $sort: { "data.year": 1 } },
  {
    $group: {
      _id: "$name",
      pop1990: { $first: "$data.totalPop" },
      pop2010: { $last: "$data.totalPop" },
    },
  },
  {
    $project: {
      _id: 0, // rename to "name"
      name: "$_id",
      delta: { $subtract: ["$pop2010", "$pop1990"] },
      pop1990: 1,
      pop2010: 1,
    },
  },
  { $sort: { delta: 1 } },
]);

$lookup

Roughly equivalent to SQL’s Left Outer Join, it allows you to compare documents in two collections from the same database. To each input document, the $lookup stage adds a new array field whose elements are the matching documents from the “joined” collection.

An example from the documentation:

db.orders.insert([
  { _id: 1, item: "almonds", price: 12, quantity: 2 },
  { _id: 2, item: "pecans", price: 20, quantity: 1 },
  { _id: 3 },
]);

db.inventory.insert([
  { _id: 1, sku: "almonds", description: "product 1", instock: 120 },
  { _id: 2, sku: "bread", description: "product 2", instock: 80 },
  { _id: 3, sku: "cashews", description: "product 3", instock: 60 },
  { _id: 4, sku: "pecans", description: "product 4", instock: 70 },
  { _id: 5, sku: null, description: "Incomplete" },
  { _id: 6 },
]);

// Now we use $lookup
db.orders.aggregate([
  {
    $lookup: {
      from: "inventory",
      localField: "item",
      foreignField: "sku",
      as: "inventory_docs",
    },
  },
]);

results in

{
   "_id" : 1,
   "item" : "almonds",
   "price" : 12,
   "quantity" : 2,
   "inventory_docs" : [
      { "_id" : 1, "sku" : "almonds", "description" : "product 1", "instock" : 120 }
   ]
}
{
   "_id" : 2,
   "item" : "pecans",
   "price" : 20,
   "quantity" : 1,
   "inventory_docs" : [
      { "_id" : 4, "sku" : "pecans", "description" : "product 4", "instock" : 70 }
   ]
}
{
   "_id" : 3,
   "inventory_docs" : [
      { "_id" : 5, "sku" : null, "description" : "Incomplete" },
      { "_id" : 6 }
   ]
}

$geo operators

We’ve mentioned the rich geospatial capabilities of MongoDB . Here’s an example where we want to compare the number of people living with 500Km of Hanover, NH in 1990, 2000, and 2010.

db.cData.aggregate([
  {
    $geoNear: {
      near: { type: "Point", coordinates: [72.2887, 43.7044] },
      distanceField: "dist.calculated",
      maxDistance: 500000,
      includeLocs: "dist.location",
      spherical: true,
    },
  },
  { $unwind: "$data" },
  {
    $group: {
      _id: "$data.year",
      totalPop: { $sum: "$data.totalPop" },
      states: { $addToSet: "$name" },
    },
  },
  { $sort: { _id: 1 } },
  //{$out : "peopleNearDartmouth"}
]);

There are lots of these $geo operators.

$redact

The best way for a document to be protected from unauthorized viewing is to have the document protect itself.

The $redact operator can restrict access to documents based upon access control information stored in the documents.

Here’s an example from the MongoDB docs:

Suppose we have a document:

{
  "_id": 1,
  "title": "123 Department Report",
  "tags": ["G", "STLW"],
  "year": 2014,
  "subsections": [
    {
      "subtitle": "Section 1: Overview",
      "tags": ["SI", "G"],
      "content": "Section 1: This is the content of section 1."
    },
    {
      "subtitle": "Section 2: Analysis",
      "tags": ["STLW"],
      "content": "Section 2: This is the content of section 2."
    },
    {
      "subtitle": "Section 3: Budgeting",
      "tags": ["TK"],
      "content": {
        "text": "Section 3: This is the content of section3.",
        "tags": ["HCS"]
      }
    }
  ]
}

A user with access privileges for tags STLW and G wishes to see all the documents for 2014. We store the user’s privileges into a variable and then invoke a predefined query:

// store the user's privs
var userAccess = ["STLW", "G"];

// run the predefined query
db.forecasts.aggregate([
  { $match: { year: 2014 } },
  {
    $redact: {
      $cond: {
        if: {
          $gt: [{ $size: { $setIntersection: ["$tags", userAccess] } }, 0],
        },
        then: "$$DESCEND",
        else: "$$PRUNE",
      },
    },
  },
]);

and here’s the result:

{
  "_id": 1,
  "title": "123 Department Report",
  "tags": ["G", "STLW"],
  "year": 2014,
  "subsections": [
    {
      "subtitle": "Section 1: Overview",
      "tags": ["SI", "G"],
      "content": "Section 1: This is the content of section 1."
    },
    {
      "subtitle": "Section 2: Analysis",
      "tags": ["STLW"],
      "content": "Section 2: This is the content of section 2."
    }
  ]
}

$out

Use the $out operator as the last stage in a pipeline to create a new collection in the current database. Otherwise, the results of a pipeline are lost (after possibly displaying them).

The value for this operator is the quoted name of the new collection.

Be careful! This operator will happily overwrite an existing collection of the same name.

. . .
  { $out: 'myShinyNewCollection' }
. . .

Aggregation performance considerations

Some recommendations from MongoDB and from “MongoDB in Action”

Try to reduce the flow ASAP

Try to reduce the size and number of documents as quickly as possible in the pipeline.

Leverage indexes

$match and $sort can take advantage of indexes which could speed things up a lot.

After you use any operator other than $match and $sort, indexes will not be used.

So use $match as early in the pipeline as possible.

the aggregation framework’s explain function

The explain option can help developer’s understand how the pipeline is actually getting its work done. It can tell you things like when indexes are being used. You can add the explain after the pipeline definition. See the example below.

Here’s a final example from Jay

Which US division has the fastest growing population density, given

  1. only considering states with more than 1M people
  2. only include divisions with more than 100K square miles

Here is the final aggregate. We’ll look at it here, and then we will look at it in Compass.

// bigquestion.js
db.cData.aggregate(
  [
    { $match: { "data.totalPop": { $gt: 1000000 } } },
    { $unwind: "$data" },
    { $sort: { "data.year": 1 } },
    {
      $group: {
        _id: "$name",
        pop1990: { $first: "$data.totalPop" },
        pop2010: { $last: "$data.totalPop" },
        areaM: { $first: "$areaM" },
        division: { $first: "$division" },
      },
    },
    {
      $group: {
        _id: "$division",
        totalPop1990: { $sum: "$pop1990" },
        totalPop2010: { $sum: "$pop2010" },
        totalAreaM: { $sum: "$areaM" },
      },
    },
    { $match: { totalAreaM: { $gt: 100000 } } },
    {
      $project: {
        _id: 0,
        division: "$_id",
        density1990: { $divide: ["$totalPop1990", "$totalAreaM"] },
        density2010: { $divide: ["$totalPop2010", "$totalAreaM"] },
        denDelta: {
          $subtract: [
            { $divide: ["$totalPop2010", "$totalAreaM"] },
            { $divide: ["$totalPop1990", "$totalAreaM"] },
          ],
        },
        totalAreaM: 1,
        totalPop1990: 1,
        totalPop2010: 1,
      },
    },
    { $sort: { denDelta: -1 } },
    { $limit: 1},
  ],
  //, {"explain": true}    // Just remove the "//" at the beginning of this line
);

{
    "totalPop1990" : 42293785,
    "totalPop2010" : 58277380,
    "totalAreaM" : 290433.4,
    "division" : "South Atlantic",
    "density1990" : 145.623006858027,
    "density2010" : 200.656604922161,
    "denDelta" : 55.0335980641345
}