Replica Sets

admin

Stuff Happens

Like any computer system to database systems are vulnerable to bad things. What stmuxorts of things come to mind?

network problems
hardware problems
preventative maintenance downtime to avoid networking hardware problems
loss of power (brownouts, weather, natural disaster, etc.)
business and social issues (strikes, hacking, etc.)

MongoDB Replica Sets

Replication is how MongoDB achieves fault tolerance and high availability.

Replication is asynchronous, via in-memory pages and the journal (oplog).

SLIDES 1-16

All the mongos mirror each other, and there is always one primary and the others are secondaries. This assignment may change dynamically.

Drivers (like pymongo) connect to the primary for writing

Generally, Replica Sets are invisible to the user. All the app needs to know is at least one of the replica set servers… the others may be discovered thereafter.

If the primary goes down or is disconnected due to a network problem MongoDB performs a failover process. During this process, no writes to the DB can be made until a new primary is elected. The remaining systems in the replica set hold an “election” to identify a new primary. Later, if the previous primary comes back on line, it will simply wait for the next election.

Write consistency

Your driver will only write to a primary, and typically reads from there too. It can be configured to read from the secondaries there is always a chance you’ll read stale data.

rs.getSecondaryOk() says it’s ok to read from a secondary (since it could be stale).

Live Demo

Setup replica Sets

Example from MongoDB docs:

➜ mkdir -p ~/data/cs61rs1 ~/data/cs61rs2 ~/data/cs61rs3
➜ mongod --replSet cs61 --logpath "cs61rs1.log" --dbpath ~/data/cs61rs1 --port 27017 --oplogSize 64 --fork
➜ mongod --replSet cs61 --logpath "cs61rs2.log" --dbpath ~/data/cs61rs2 --port 27018 --oplogSize 64  --fork
➜ mongod --replSet cs61 --logpath "cs61rs3.log" --dbpath ~/data/cs61rs3 --port 27019 --oplogSize 64  --fork

Typically do this on separate servers on the same port (27017). If running on a single system, use different ports.

If on a single system, you have to use ps -ax to look at all processes to see them:

➜ ps -ax | grep mongo
30020 ??         0:05.17 mongod --replSet cs61 --logpath cs61rs1.log --dbpath /Users/ccpalmer/data/cs61rs1 --port 27017 --oplogSize 64 --fork
30024 ??         0:05.15 mongod --replSet cs61 --logpath cs61rs2.log --dbpath /Users/ccpalmer/data/cs61rs2 --port 27018 --oplogSize 64 --fork
30128 ??         0:05.12 mongod --replSet cs61 --logpath cs61rs3.log --dbpath /Users/ccpalmer/data/cs61rs3 --port 27019 --oplogSize 64 --fork
36061 ttys001    0:00.00 grep mongo
➜

The replicas are allocated but not initialized. Use a file like this:

// init_cs61replicas.js
// initialize replica set cs61
config = {
  _id: "cs61",
  version: 1,
  members: [
    {
      _id: 0,
      host: "localhost:27017",
    },
    {
      _id: 1,
      host: "localhost:27018",
    },
    {
      _id: 2,
      host: "localhost:27019",
    },
  ],
};

rs.initiate(config);
rs.status();

then

mongosh --port 27018 <init_cs61Replicas.js

Write Concern

w and j are settings made with your driver that essentially define the write paranoia level for the DB. Write concern describes the level of acknowledgment requested from MongoDB for write operations to a standalone mongod or to replica sets or to sharded clusters.

The various language drivers provide a way to choose these values:

w
- 0: Don’t wait for acknowledgement from the server
- 1: Wait for acknowledgement, but don’t wait for secondaries to replicate
- =2: Wait for one or more secondaries to also acknowledge
wtimeout - how long to wait for secondaries to acknowledge before failing
- 0: indefinite
- 0: time to wait in milliseconds
j: If true block until primary journal write operations have been committed to the disk.

These can be set based on a connection, a collection, or as a default across the entire replica set.

Heartbeat and failover

Every replica set member pings all the other members every two seconds by default. The rs.status () command output shows the last timestamp for all these heartbeats along with a guess as to that replica’s health (0 or 1).

The rs.status() command output also identifies the Primary and Secondary(ies).

If any node is unresponsive then something must happen. There must always be exactly one primary node among the replicas.

If the primary is the one no longer responding then an election takes place amongst the others if there are at least a majority of the replicas set nodes are visible (available).
If a secondary goes offline, as long as there is a majority of the replica set still visible to each other the replica set will continue as is until the secondary is back online.

The election is pretty straightforward. The most up-to-date secondary will be chosen and if there’s a tie there are other strategies can can be used.

It’s possible that a majority of the nodes will not be visible after some event. If this occurs, any remaining primary will notice it and demote itself to secondary. This prevents any further writes, and gives the remaining secondaries a chance to catch up with each other if necessary.

The oplog

The oplog is a Collection in the local database on every node. It is the “Journal” that we’ve been talking about.

Every time a client program writes data to the primary, an entry with enough information to reproduce that write is added to the primaries blog. Once that entry is replicated to a secondary, the secondary’s oplog is also updated to store a record of the write.

Some of the fields are:

ts: timestamp of the oplog entry
h: a unique identifier for the oplog entry
op: type of operation performed (usually i/u/d for insert, update or delete)
ns: database & collection affected
o: the new state of the document after the change

swamp: [25 - Replicas] (master)$ mongosh
MongoDB shell version v3.4.4
connecting to: mongodb://127.0.0.1:27017
MongoDB server version: 3.4.4
Server has startup warnings:
2017-05-24T08:57:48.999-0400 I CONTROL  [initandlisten]
2017-05-24T08:57:48.999-0400 I CONTROL  [initandlisten] ** WARNING: Access control is not enabled for the database.
2017-05-24T08:57:48.999-0400 I CONTROL  [initandlisten] **          Read and write access to data and configuration is unrestricted.
2017-05-24T08:57:48.999-0400 I CONTROL  [initandlisten]
cs61:PRIMARY> use local
switched to db local
cs61:PRIMARY> show collections
oplog.rs
replset.election
replset.initialSyncId
replset.minvalid
replset.oplogTruncateAfterPoint
startup_log
system.replset
system.rollback.id
system.tenantMigration.oplogView  [view]
system.views
cs61 [direct: secondary] local> db.oplog.rs.findOne()
{
  op: 'n',
  ns: '',
  o: { msg: 'initiating set' },
  ts: Timestamp({ t: 1668131037, i: 1 }),
  v: Long("2"),
  wall: ISODate("2022-11-11T01:43:57.473Z")
}
cs61:PRIMARY> db.oplog.rs.find({"op": "i"})
[
...
{
    op: 'i',
    ns: 'config.system.sessions',
    ui: new UUID("3ae3fec2-2c21-467c-8039-ed4948ef7210"),
    o: {
      _id: {
        id: new UUID("7ba4594b-1fe8-4f01-a201-073fb067f127"),
        uid: Binary(Buffer.from("e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", "hex"), 0)
      },
      lastUse: ISODate("2022-11-11T02:14:43.588Z")
    },
    o2: {
      _id: {
        id: new UUID("7ba4594b-1fe8-4f01-a201-073fb067f127"),
        uid: Binary(Buffer.from("e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", "hex"), 0)
      }
    },
    ts: Timestamp({ t: 1668132883, i: 3 }),
    t: Long("1"),
    v: Long("2"),
    wall: ISODate("2022-11-11T02:14:43.588Z")
  },{
    "ts": Timestamp(1495630921, 2),
    "t": NumberLong(1),
    "h": NumberLong("-7754511547043739871"),
    "v": 2,
    "op": "i",
    "ns": "rstest.test",
    "o": {
        "_id": ObjectId("59258449c045bf8131ff6a2d"),
        "dept": "CS"
    }
} {
    "ts": Timestamp(1495632469, 1),
    "t": NumberLong(1),
    "h": NumberLong("5792912823235359366"),
    "v": 2,
    "op": "i",
    "ns": "rstest.test",
    "o": {
        "_id": ObjectId("59258a55ecb46fb177ec1f14"),
        "dept": "EC"
    }
}
...
]
cs61:PRIMARY> use rstest
switched to db rstest
cs61:PRIMARY> use local
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
cs61:PRIMARY> show dbs
admin   0.000GB
local   0.000GB
rstest  0.000GB
cs61:PRIMARY> use local
switched to db local
cs61:PRIMARY> db.oplog.rs.find({"op": "u"})
{ "ts" : Timestamp(1495632997, 2), "t" : NumberLong(1), "h" : NumberLong("3760718468331560622"), "v" : 2, "op" : "u", "ns" : "rstest.test", "o2" : { "_id" : ObjectId("59258449c045bf8131ff6a2d") }, "o" : { "$set" : { "building" : "Sudikoff" } } }
cs61:PRIMARY>

Connecting to a replica set

Using the mongosh shell, it’s what we’ve been using to connect to Atlas:

$ mongosh mongodb://localhost:27017,localhost:27018,localhost:27019/?replicaSet=cs61
Current Mongosh Log ID:	636db3dbf2d54643dc3b4238
Connecting to:		mongodb://localhost:27017,localhost:27018,localhost:27019/?replicaSet=cs61&serverSelectionTimeoutMS=2000&appName=mongosh+1.6.0
Using MongoDB:		6.0.1
Using Mongosh:		1.6.0
...
cs61 [primary] cs61> rs.printReplicationInfo()
actual oplog size
'64 MB'
---
configured oplog size
'64 MB'
---
log length start to end
'176776 secs (49.1 hrs)'
---
oplog first event time
'Sun Mar 02 2025 22:11:43 GMT-0500 (Eastern Standard Time)'
---
oplog last event time
'Tue Mar 04 2025 23:17:59 GMT-0500 (Eastern Standard Time)'
---
now
'Tue Mar 04 2025 23:18:01 GMT-0500 (Eastern Standard Time)'
cs61 [primary] test> rs.printSecondaryReplicationInfo()
source: localhost:27017
{
  syncedTo: 'Tue Mar 04 2025 23:17:59 GMT-0500 (Eastern Standard Time)',
  replLag: '0 secs (0 hrs) behind the primary '
}
---
source: localhost:27019
{
  syncedTo: 'Tue Mar 04 2025 23:17:59 GMT-0500 (Eastern Standard Time)',
  replLag: '0 secs (0 hrs) behind the primary '
}
cs61 [primary] cs61> use test
switched to db test
db test
cs61 [primary] test> show collections

cs61 [primary] test> db.newcoll.insertOne({"name": "Bart Simpson", "age": 10})
{
  acknowledged: true,
  insertedId: ObjectId("636db3eaf2d54643dc3b4239")
}
cs61 [primary] test> use local
switched to db local
cs61 [primary] local> db.oplog.rs.find({"op" : "i"})
...
{
    lsid: {
      id: new UUID("5290f1ad-d60e-4dfd-ae5d-26d73b497096"),
      uid: Binary(Buffer.from("e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", "hex"), 0)
    },
    txnNumber: Long("1"),
    op: 'i',
    ns: 'test.newcoll',
    ui: new UUID("dac9345f-4df4-4802-b311-c11c6a8e4aa3"),
    o: {
      _id: ObjectId("636db3eaf2d54643dc3b4239"),
      name: 'Bart Simpson',
      age: 10
    },
    o2: { _id: ObjectId("636db3eaf2d54643dc3b4239") },
    stmtId: 0,
    ts: Timestamp({ t: 1668133866, i: 2 }),
    t: Long("1"),
    v: Long("2"),
    wall: ISODate("2022-11-11T02:31:06.417Z"),
    prevOpTime: { ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") }
  }
]

Here’s a Python example from the MongoDB folks:

import pymongo
import sys

c = pymongo.MongoClient(host=["mongodb://localhost:27017",
                              "mongodb://localhost:27018",
                              "mongodb://localhost:27019"],
                               replicaSet="cs61",
                              w=1,
                              j=True)

db = c.test
depts = db.depts

try:
    print "inserting"
    depts.insert_one({"dept":"", "Economics":"EC"})
    print "inserting"
    depts.insert_one({"dept":"Computer Science", "abbrev":"CS"})
    print "inserting"
    depts.insert_one({"dept":"Electrical Engineering", "abbrev":"EE"})
except Exception as e:
    print "Unexpected error:", type(e), e
print "completed the inserts"

Run the example and then return to look at the oplog.

When failover occurs

If the Primary goes offline, all writes stop. Reads may continue, depending on rs.slaveOk(), but default is no.

If you write code that does not catch exceptions, when an anomaly occurs your code will likely abort. This is a bad idea.

Here’s an example that does a bunch of inserts while not watching for exceptions. We’ll disturb the replica set during its execution to see what happens.

#!/usr/bin/env python
"""
Insert documents without handling exceptions (NOT RECOMMENDED)
[rep_insert_things.py]
"""
# Code from MongoDB for Developers (Python) class
#
import pymongo
import sys
import time

c = pymongo.MongoClient('mongodb://localhost:27017,localhost:27018,localhost:27019/test?replicaSet=cs61')

db = c.cs61

things = db.things
things.delete_many({})   # remove all the docs in the collection


for i in range(0,500):
    try:
        things.insert_one({'_id':i})
        print "Inserted Document: " + str(i)
        time.sleep(.1)
    except Exception as e:
        print "Exception ",type(e), e

Let’s try it:

determine which rs is primary
connect to that with mongo
run the python in another window

back to primary in the shell and run

 db.adminCommand({replSetStepDown: 1, secondaryCatchUpPeriod: 1, force: true})

then back to the python screen to see if writes slowed or were lost

Example from MongoDB for how to properly handle anomalies during execution

#!/usr/bin/env python
"""
Insert documents in a safe manner and catches expected errors.
"""
# Code from MongoDB for Developers (Python) class
#
import sys
import time

c = pymongo.MongoClient('mongodb://localhost:27017,localhost:27018,localhost:27019/test?replicaSet=cs61')

db = c.cs61

things = db.things
things.delete_many({})   # remove all the docs in the collection


for i in range(0,500):
    for retry in range (3):
        try:
            things.insert_one({'_id':i})
            print "Inserted Document: " + str(i)
            time.sleep(.1)
            break
        except pymongo.errors.AutoReconnect as e:
            // primary failure, network anomaly, etc.
            print "Exception ",type(e), e
            print "Retrying.."
            time.sleep(5)
        except pymongo.errors.DuplicateKeyError as e:
            // just in case the insert had succeeded when
            // the AutoReconnect came in
            print "duplicate..but it worked"
            break

Let’s try it:

determine rs primary
connect to that with mongo
run the python in another window

back to primary in the shell and run

db.adminCommand({replSetStepDown: 1, secondaryCatchUpPeriod: 1, force: true})

then back to the python screen to note no writes were lost

This is how you should write code to do inserts on Replica Sets !

After the failover

After failover occurs, the original primary may return online. When that primary went down, all of its writes may not have been fully replicated to the others. Thus, those writes aren’t in place on the secondary systems, including the one that is now the primary.

Then suppose that original primary comes back up. If it detects that it has writes that were never replicated, that system will rollback all of those writes. The details are saved in a file just in case.

Other settings for `w` and `j`

If your db updates are really important, you may want to use different settings for w and/or j.

In our example with three replicas, setting w=3 tells the primary to block write operations until they have replicated to w-1 replicas.

# Code from MongoDB for Developers (Python) class
#
import pymongo

read_pref = pymongo.read_preferences.ReadPreference.SECONDARY

c = pymongo.MongoClient(host="mongodb://localhost:27017",
                        replicaSet="rs1",
                        w=3, wtimeout=10000, j=True,
                        read_preference=read_pref)

db = c.m101
people = db.people

print "inserting"
people.insert({"name":"Andrew Erlichson", "favorite_color":"blue"})
print "inserting"
people.insert({"name":"Richard Krueter", "favorite_color":"red"})
print "inserting"
people.insert({"name":"Dwight Merriman", "favorite_color":"green"})

orderly replica set shutdown

on each secondary, issue use admin and then db.runCommand({ replSetFreeze: 120 }) to make it flush, cleanup, and wait. This keeps them from stepping up as a primary.
on the primary, issue use admin and then rs.stepDown() to have it demote itself to secondary.
once every member of the replica set is a secondary, run use admin and then db.shutdownServer() on each of them.