Schema design

Admin

Many of the following guidelines are taken from the MongoDB Documentation and from MongoDB in Action: Second edition (a.k.a., “MiA”) .

To embed, or not to embed

For Document-oriented databases like MongoDB this is the #1 non-syntax question.

RDBMS designers focus on removing duplicate information. With NoSQL databases, and MongoDB in particular, this isn’t always the best choice.

Like many questions of design and style, the best answer is “it depends”.

When designing a MongoDB schema, you need to start with a question that you’d never consider when using SQL: what is the cardinality of the relationship? Put less formally: you need to characterize your “One-to-N” relationship with a bit more nuance: is it “one-to-few”, “one-to-many”, or “one-to-squillions”? William Zola, MongoDB

SLIDE 8 Here’s a simple guidelines from MiA:

Embed when the child objects never appear outside the context of their parent. Otherwise, store the child objects in a separate collection.

Consider the Blog comments example. If comments only occur with respect to a specific post and the comments for posts only ever appear in chronological (i.e., ObjectId) order with the post, then embedding should be fine.

If your use cases include the ability to display post comments by author name, date ranges, or other comment-specific attribute, then you should consider a reference model, most likely using ObjectIds.

Here’s the guideline offered by the late great William Zola, Lead Technical Support Engineer at MongoDB:

you need to characterize your “One-to-N” relationship with a bit more nuance: is it “one-to-few”, “one-to-many”, or “one-to-squillions”? Depending on which one it is, you’d use a different format to model the relationship.

Here are some examples similar to the ones Will used:

SLIDE 9

one-to-few

db.student.findOne()
{
	fname: 'Lisa',
	lname: 'Simpson',
	sId: 'x0831562',
	addresses: [{
			street: '16 Hilfiger Hall',
			city: 'Cambridge',
			state: 'MA'
		},
		{
			street: '742 Evergreen Terrace',
			city: 'Springfield',
		}
	]
}

Simply embedding multiple addresses works well, even though there is a chance of duplicate storage of address entries if another Simpson is lives at either address at the same time.

Embedding does provide a slight performance advantage over the alternatives but, again, the magnitude of that advantage will vary by the application.

Embedding does make it a little harder to reach these embedded fields, as in queries like “all students with an address in Springfield, MO” or “all post comments by user Alice in the last 30 days”.

SLIDE 10

one-to-many

Situations where one-to-many relationships arise are commonplace

describing a parts list for an assembly
pick list for order fulfillment
manuscripts in a journal

If the “many” is no more than a few hundred, the following scheme works well:

db.items.insertMany([
  {
    _id: ObjectId("59242cee60ae8a3ae6aaeee0"),
    itemNo: "FM191201",
    name: "Nerdit Gamer Backpack",
    qty: 1,
    area: "B427",
    rack: 4,
    bin: "702",
  },
  {
    _id: ObjectId("59242cee60ae8a3ae6aaeee1"),
    itemNo: "FM191203",
    name: "Nerdit Gamer Pack Strap",
    qty: 1,
    area: "B427",
    rack: 4,
    bin: "702",
  },
  {
    _id: ObjectId("59242cee60ae8a3ae6aaeee2"),
    itemNo: "FM191205",
    name: "Nerdit Gamer Pack bottle",
    qty: 1,
    area: "B427",
    rack: 4,
    bin: "704",
  },
]);

Each order has an array of the ObjectId’s of the items in the order.

db.orders.insertOne({
  orderNo: 91269192,
  itemList: [
    // array of references to Part documents
    ObjectId("59242cee60ae8a3ae6aaeee0"), // reference to Backpack
    ObjectId("59242cee60ae8a3ae6aaeee1"), // reference to another item
    ObjectId("59242cee60ae8a3ae6aaeee2"), // reference to another item
  ],
});

SLIDE 11

Then getting the list of items out of the order is easy with a MongoDB join:

// Fetch the order document
order = db.orders.findOne({
  orderNo: 91269192,
});
// Fetch all the Parts that are linked to this Product
orderItems = db.items
  .find({
    _id: {
      $in: order.itemList,
    },
  })
  .toArray(); // return all the elements of the cursor as an array in memory

This scheme also paves the way for many-to-many relationship between orders and items (assuming the items are not “one of a kind”, as in an art auction or real estate transaction).

“one-to-squillions”

SLIDE 12

Suppose your IT department has all its web servers sending their logs to a one system which places them in a MongoDB database.

This situation brings up MongoDB’s 16MB limit on document size. If a web server is particularly busy, then its log entries would overflow a 16MB document rather quickly.

So we turn the model upside down and have the log messages reference the web server they came from.

SLIDE 13

db.webserver.insertOne({
  _id: ObjectId("59242cee60ae8a3ae6aaeee0"),
  name: "www1.beebleford.com",
  ipAddr: "127.66.67.68",
});

db.logmsg.insertOne({
  server: ObjectId("59242cee60ae8a3ae6aaeee0"),
  time: ISODate("2017-05-23T17:42:51.382Z"),
  ipAddr: "182.77.43.137",
  cmd: "GET /~campbell/cs50/programming.css HTTP/1.1",
  codes: "200 8052",
  url: "http://www.cs.dartmouth.edu/~campbell/cs50/programming.html",
  details:
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
});

SLIDE 14

To review the last 1000 transactions after some event we just do :

// find the parent ‘server’ document (assuming a unique IP address)
server = db.webserver.findOne({
  ipAddr: "127.66.67.68",
});

// find the most recent 1000 log message documents linked to that web server

last1000 = db.logmsg
  .find({
    webserver: server._id,
  })
  .sort({
    time: -1,
  })
  .limit(1000)
  .toArray();

SLIDE 15

Will’s $1\text{-to-}N$ guidelines summarized:

Will the entities on the $N$ side of the $1\text{-to-}N$ ever need to stand alone?

What is the cardinality of the relationship: is it one-to-few; one-to-many; or one-to-squillions? then based on those answers, choose:

Embed the $N$ side if the cardinality is one-to-few and there is no need to access the embedded object outside the context of the parent object

Use an array of references to the objects on the $N$ side if the cardinality is one-to-many or if the objects on the $N$ side should ever need to stand alone for any reason

Use a reference to the one-side in the objects on the $N$ side objects if the cardinality is one-to-squillions

Many-to-Many

SLIDES 16-17

Doubly-linked documents

A ToDo List example, where we need to find all the tasks in a project as well as the people assigned to do each one.

SHOW EXTENDED EXAMPLE todo.json from Examples

(Example based on and entry on Will Nola’s blog)

There’s a collection holding person documents, a “tasks” collection holding Task documents, and a One-to-N relationship from Person to Task. The application will need to track all of the Tasks owned by a Person, so we will need to reference Person to Task.

With the array of references to Task documents, a single Person document might look like this:

db.person.findOne()
{
    _id: ObjectID("AAF1"),
    name: "Kate Monster",
    tasks [     // array of references to Task documents
        ObjectID("ADF9"), 
        ObjectID("AE02"),
        ObjectID("AE73") 
        // etc
    ]
}

On the other hand, in some other contexts this application will display a list of Tasks (for example, all of the Tasks in a multi-person Project) and it will need to quickly find which Person is responsible for each Task. You can optimize data retrieval for this purpose by putting an additional reference to the Person in the Task document.

db.tasks.findOne()
{
    _id: ObjectID("ADF9"), 
    description: "Write lesson plan",
    due_date:  ISODate("2014-04-01"),
    owner: ObjectID("AAF1")     // Reference to Person document
}

This enables the two-way searches, but it will require a double update if a project task needs to be reassigned.

IMPORTANT: This will be a non-Atomic update since two documents are being updated serially (i.e., one after the other).

Materialized paths

SLIDE 18

Maintain a path in a document by concatenating _id’s (with ‘:’ separators). Useful for webpage paths, comment or email threads, etc.

//materializedPaths.js
{
	_id: ObjectId("4d692b5d59e212384d95003"),
	depth: 2,
	path: "4d692b5d59e212384d95001:4d692b5d59e212384d951002",
	created: ISODate("2017-02-26T17:20:01.251Z"),
	username: "homer",
	body: "Now where did I put that beer ... DOH!",
	thread_id: ObjectId("4d692b5d59e212384d95223a")
}

and using code or RegEx’s you can dig around in it :

db.comments.find({
  path: /^4d692b5d59e212384d95001/,
});

Denormalization

Denormalization only makes sense when you have a high read to write ratio

Will’s example (fragment):

// code fragment
{
	name: 'left-handed smoke shifter',
	manufacturer: 'Acme Corp',
	catalog_number: 1234,
	parts: [ // array of references to Part documents
		ObjectID('AAAA'), // reference to the #4 grommet above
		ObjectID('F17C'), // reference to a different Part
		ObjectID('D2AA'),
		// etc
	]
}

If you are often doing an app-level join to get the part names, then perhaps denormalization is a good idea:

// code fragment
{
	name: 'left-handed smoke shifter',
	manufacturer: 'Acme Corp',
	catalog_number: 1234,
	parts: [
		// Part name is denormalized
		{
			id: ObjectID('AAAA'),
			name: '#4 grommet'
		},
		{
			id: ObjectID('F17C'),
			name: 'fan blade assembly'
		},
		{
			id: ObjectID('D2AA'),
			name: 'power switch'
		}
	]
}

Then if you need other fields for a part, it’s up to the app to get to it.

// Fetch the product document
>
product = db.products.findOne({
	catalog_number: 1234
});

// Create an array of ObjectID()s containing *just* the part numbers
>
part_ids = product.parts.map(function (doc) {
	return doc.id
});

// Fetch all the Parts that are linked to this Product
>
product_parts = db.parts.find({
	_id: {
		$in: part_ids
	}
}).toArray();

Here map applies the function to each document visited by the cursor and collects the return values from successive application into an array.

Here’s a larger logfile example from Will. A new log message is added to the logmsg :

//  Get log message from monitoring system
logmsg = get_log_msg();
log_message_here = logmsg.msg;
log_ip = logmsg.ipaddr;

// Get current timestamp
now = new Date()

// Find the _id for the host I’m updating
host_doc = db.hosts.findOne({
	ipaddr: log_ip
}, {
	_id: 1
}); // Don’t return the whole document
host_id = host_doc._id;

// Insert the log message, the parent reference, and the denormalized data into the ‘many’ side
db.logmsg.save({
	time: now,
	message: log_message_here,
	ipaddr: log_ip,
	host: host_id)
});

// Push the denormalized log message onto the ‘one’ side
db.hosts.update({
	_id: host_id
}, {
	$push: {
		logmsgs: {
			$each: [{
				time: now,
				message: log_message_here
			}],
			$sort: {
				time: 1
			}, // Only keep the latest ones
			$slice: -1000
		} // Only keep the latest 1000
	}
});

$each appends multiple values to an array field
$push operator appends a specified value to an array
$slice limits the number of array elements.

You should note the use of the {_id:1} specification that avoids the network overhead of sending back the whole document, instead sending back only the ObjectId.

You’ll see the log message is saved as a document twice, once kept forever logmsg and another in hosts where only the last 1000 messages are kept handy.

Detailed example from MiA

describe the app: inventory for ToolsRUs
- gardening tools
- repairs
- hardware
Sketch hierarchy
- Catagory documents
  - ancestors[ {_id, name}… ]
  - parent_ancestor
  - name
  - description
describe base product info
- _id, name, full_description
- details: weight, weight_units, model, manuf, color
- reviews
- …
- price_history [
- { retail, sale, date_start, date_end}
- { retail, sale, date_start, date_end} ]
- catagory_ids: objectid, objectid, …
- tags: {“…”,”…”,”…}
Go back and add a “slug” to avoid having Objectid’s in URLs
. . .