[^ref]Key references used for this NoSQL lecture: Fowler, et.al., “NoSQL Distilled” ; Robinson, et.al., “Graph Databases (Early Release)”; McCreary & Kelly, “Making Sense of NoSQL”. Other references include texts by Coronel, Widom, Ullman, Jukic, and Silberschatz.

NoSQL Motivation

In NoSQL, “SQL” refers to traditional RDBMS’s. It arose due to the fact that for some systems some other system might be more a better choice. Thus, “Not Only SQL”.

The term entered common use in 2009, thanks to Eric Evans of the Apache Software Foundation.

Users: Youtube, Google, Flickr, Twitter, Facebook, eBay, Wikipedia,

RDBMS values

“Efficient, reliable, convenient, and safe multi-user storage of and access to massive amounts of persistent data” – Widom

  • Persistent data access

    • We want our data when we want it!
    • Archives, warehousing, or even this morning.
    • Analytics and BI
    • Main memory is fast, but finite
  • Concurrency

    • Never let the customer wait
    • internal as well as external
  • Integration with business processes

    • large # of processes and applications need to share data.
    • resulting in a shared database environment

Some of these capabilities may be more than we need,

Some of these capabilities may prevent us from doing what we want!

Impedance Mismatch

Developers think in data structures that reside in memory. These in-memory data structures are typically more flexible and more representative of the real world.

Example: The data making up a product order appears as an entity on the UI, but it really represents several entries in various RDBMS tables, such as “customers”, “orders”,”order items”,”credit-card-info”, etc., with all the foreign keys and such to ties them together.

In memory it’s likely a single or a few data structures.

The issue is the requisite simplicity of RDBMS entries.

  • the values in an RDBMS tuple must be simple - no vectors or structure or nested data structures.
  • So, if it makes sense to use a richer in-memory data structure, you have to struggle sometimes to fit it into an RDBMS.
  • Thus, the term impedance mismatch : two different representations requiring translation back and forth.

The rise of Web Services

There are other mismatches

  • in the RDBMS world common schemas must match.
    • If the customer’s table columns change then queries and applications may fail.
    • Indexes needed by one application might get in the way of another

Place a single application in front of the RDBMS and have it channel, convert, & map requests to it. This single application has to know all about the DB structure, rather than all the user applications.

This moves interoperability requirements to the interfaces, and thus to the network.

The deluge

Next came the problem of the four V’s, leading to scalability challenges.

Volume of data

large data sets become difficult to use when stored in RDBMS

queries begin to take too long, joins get HUGE ( “join pain”)

due to the underlying data model which tends to build a set of all possible answers to a query before filtering to get the right one.

Velocity of data

rate that data changes over time

rarely static - bursty

data store must be able to handle regular high-speed writes as well as bursts

the values of specific properties can change , sometimes quite rapidly

even the structure of the data can vary over time as well, sometimes quite rapidly

Causes?

  • business needs can change quickly
  • the data being collected can also change quickly
  • environmental or geopolitical changes

Variety of data

data may sometimes be well structured, and unstructured other times dense, sparse, connected or not

Veracity

It has always been wise to verify your data (“Trust, but verify”). It’s the same now, but at an enormously higher data rate.

Emergence of NoSQL

All this led to the NoSQL (Not Only SQL) revolution -NOSQL, meaning a non-relational data store. In 1998 the term NoSQL was first used, but NoSQL databases really took off in 2009 with the rise of Web Services and Web 2.0 (user-generated content (blogs, social), easy to program, everything can communicate with everything).

NoSQL consistency models are more lenient SO companies had to choose between bigger and bigger centralized computers, but at the time that only addressed some of the challenges.

Concerns about resilience arose, and the very real drive toward widely distributed, always-on, applications encouraged clusters of commodity hardware instead.

The RDBMS’s of the day were NOT designed to run on clusters …

Some systems could handle writing to shared disk subsystems, but that wasn’t always possible in widely distributed systems AND it didn’t help resilience.

The web has led to applications that must easily store and process data which is bigger in volume, changes more rapidly, is often sparse across the domains, and is more structurally varied than traditional RDBMS can handle.

As Ivano Malavolta (University of L’Aquila) described the situation: “Do you have a large set of uncontrolled, unstructured, data that you are trying to fit into a RDBMS?”

  • Storage of large amount of non-transactional data, such as log analysis, web statistics, etc.
  • Caching results from slower databases ( Twitter, SnapChat, Instagram )
  • Data denormalization of expensive join queries
  • Manage data that is not easily analyzed in a RDBMS such as time-or location-based data
  • Real-time systems, such as games, financial data, social media,, etc.

CAP Theorem (Brewer 2000)

This distributed database idea had a problem though.

Slide 1

Proven by Lynch and Gilbert in 2003

It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

Consistency The same definition we’ve been using: every read returns the most recent data, or an error.

Availability Every request receives a response (no error) – but without a guarantee that it contains the most recent data

Partition-tolerance The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

Note: It seems unlikely that we would ever elect to forego Partition-tolerance in favor of the other two, since doing so would effectively prevent their success.

Example

Slide 2

Alice in London, Bob in Mumbai, both want to reserve a room at Hotel Zed in Paris

Reservation system has two nodes, one in London and the other in Mumbai, with a comm-link

There’s only one room left

Problems can arise

If there was only one server

  • It’s always consistent and available (if it’s up)
  • It isn’t subject to partitioning

This is the typical DB today

The two nodes (systems) have to agree on serializing their requests

  • This fails if the link goes down

  • One node can be Primary, the other a Secondary
  • Requests from the Secondary must go thru the Primary

  • If the link goes down, we have a failure of Availability in Brewer’s terms since Alice can talk to London but can’t make a reservation

  • Both nodes can be allowed to continue to take reservations, but overbooking can result

Broken consistency

  • This is might be acceptable since there are usually “no-shows” to remedy any overbooking

(OPTIONAL) Other implications of relaxing ACID

Relaxing durability

  • In-memory DB would have great performance, but all is lost after a shutdown

  • Capturing telemetry at high speed might be more important than if you miss some data is the server crashes

  • Replicated data might be lost if the Secondary sends it up to the Primary but the Primary crashes before replicating it back out to the others.

ACID is nice, but it comes with a price

ACID vs. BASE (from Brewer)

BASE is diametrically opposed to ACID. Where ACID is pessimistic and forces consistency at the end of every operation, BASE is optimistic and accepts that the database consistency will be in a state of flux. Although this sounds impossible to cope with, in reality it is quite manageable and leads to levels of scalability that cannot be obtained with ACID. – Dan Pritchett, Ebay see full article here

  • Basic Availability
    • works most of the time. Trade-off is really latency rather than availability.
    • if a node fails, some portion of the data will not be available, but the rest of the nodes stay operational
  • Soft-state:
    • Data stores don’t have to be write-consistent, nor do different replicas have to be mutually consistent at all times.
    • The state of the system may change over time, even without input
  • Eventually consistent:
    • Data stores exhibit consistency at some later point
    • Amazon example of ALWAYS being able to write to your shopping cart, even if you end up with several.
    • On checkout Amazon just collects them all and lets the customer sort them out.

Of course, this more lenient data store simply transfers to the program the responsibility of figuring out any data problems.

For example, what about durability? Surely this is inviolate?

  • Consider in-memory DB’s that support extremely high-throughput apps. Perhaps a lazy replication of the data to disk store is sufficient to cover the transactions and if there’s a crash only a few transactions may be lost.
  • Strict durability systems depend on transaction logs. These can be slow without hardware assist, but using hardware allows for the possibility that the system will fail due to a power loss.

What kinds of applications are like/tolerate this?

  • data acquisition systems - can afford to miss a data point
  • most social media

NoSQL emergence led to another advancement: Polyglot Persistence

Instead of always using an RDBMS, we choose a data storage mechanism based on the nature of the data being stored and how we want to use it.

As a result, an organization may employ a variety of persistent data stores. whichever one is best for the data and the circumstances.

Primary reasons to consider NoSQL

  1. ability to handle data access with sizes and performance that necessitate a cluster of machines.
  2. improvement of productivity of app development by using a more convenient or intuitive data interaction style.
  3. 3V’s data, and including sparse data
  4. desire for highly distributed systems running on commodity hardware

Shift to aggregates

RDBMS tables cannot hold nested data. The table entries must be simple types and consistent (no variable types).

Nice, consistent, and enables RDBMS efficient operation, but limiting.

Aggregate orientation assumes that you sometimes want to operate on data with more complex structures than simple scalars of base types.

Aggregates also more closely match the in-memory datastructures used by developers.

on board ERD example: think of the Amazon order system RDBMS would use several tables:

  • customer,
  • billing address,
  • shipping address (variable, so must select from a table with a ShippingAddressID),
  • Cart containing Items (each of which is a product with attributes),
  • Each item has shipping preferences (speed, giftwrapping), and payment info
  • payment info is ccard/giftcard/etc.

Show with multiplicities; explain how JOINs will be needed

  • e.g., address will be a separate table FK’ed back to a person relation.

An aggregate system might group them this way (JSON):

SLIDE 3

// customer info collection
{
    "id":42,
    "fName":"ebenezer",
    "mi":"j",
    "lName":"coot",
    "addresses": [
      { "addrType":"billingAddress",
        "mailingAddr":"PO Box 99",
        "city":"Santa Fe",
        "St":"NM"
      },
      { "addrType": "shippingAddress",
        "streetAddr":"42 Wrinkled Way",
        "city":"Taos",
        "St":"NM"
      }]
}
// carts
{
    "id":74829312,
    "customer":42,
    "itemList":[
            {
                "UPC":293012429,
                "price":79.95,
                "name":"apple 85w power adapter",
                "shippingSpeed":"2nd day"
            },
            {
                "UPC":829381427,
                "price":59.95,
                "name":"apple touchpad mouse"
            }
        ],
    "paymentInfo":[
        {
            "ccard":"1234-5678-9876-5432",
            "exp":"0115",
            "ccxact":"111111"
        }
    ]
}


Survey of NoSQL Databases

“You can’t always get what you want.”


Key-Value DB

Slide 4,5

The aggregate is opaque to the database - a blob

Store whatever you want - no structure. The key maps to a value.

  • the “value” can have a type, such as string, set of strings, string then number, etc.

Here the focus is on SCALABILITY, designed to handle massive load

Only access the blob via a key.

You can only search via a key - you cannot search within the blob. it’s “opaque”.

No additional indices to maintain!

Very different from RDBMS, where ad hoc queries are allowed.

The application can store its data in a schema-less manner. It can be stored as a datatype of a programming language, such as an object. There is no fixed data model.

Fields may be added at will and may be nonuniform or even nested.

However, once written, Fields may not be updated, only deleted and added again.

They act like large, distributed hash maps where these usually opaque values are stored and retrieved by a key.

Key Value
1 42
2 “kumquat”
3 [‘a’,’e’,’i’,’o’,’u’]
4 “xyz”,”qr”
5 “ABC”,1232,”aString”,[9,8,42]


The Key space of the hash map is spread across multiple buckets on systems in the network

Keys like Usernames, Email addresses, Lat-Long, SSN’s, zip codes, etc., are all typical

for Fault tolerance you replicate the buckets on multiple systems.

The various machines are not full replicas of each other, to aid in load-balancing

Some systems do consistent hashing

  • If the target system is down for a WRITE, the hash is temporarily remapped and the data stored. When the system comes back up, the data is moved to where it should be.

If you don’t have the key, most of these DB’s won’t provide you with a listing

  • So you need to ensure you can get the key (or generate it), or it’s based on a timestamp or other external info, or that users will always remember it ;-)

Typical uses

  • Storing session information
  • User preferences
  • Shopping Cart Data

Column-Family Stores

SLIDE 16-7 From Google’s “BigTable” 2006 paper: Bigtable: A Distributed Storage System for Structured Data

Essentially a two-level aggregate store.

A key is used to select a “row” which holds a map of links to the various aggregates

These second-level aggregates are referred to as “Columns”

Queries can pick out specific columns, such as GET ( ’1234’, ’name’)

The Columns are gathered into “column families” with other columns that are often referenced together.

SLIDE 6,7

Typical uses

Blogs

Store blog entries with tags, categories, links, etc., in separate columns.

Comments can be in the same row, or placed in a separate keyspace.

Usage/license expiration

A column can be set to “expire” in some of these systems, and when that happens it is removed from the DB

HOWEVER, inserts can be expensive, and careful planning of queryies is necessary.

Shards

Since these NoSQL data stores are focused on performance, it can be useful to distribute different parts of the data on different servers – sharding.

Something in the key identifies which node will have the data

Each of those servers handles the reads and writes for that particular set of shards

Simple sharing can bring trouble if the server for a set of keys goes down

The aggregate model offers a convenient way to chop up the data

  • e.g., if certain aggregates are more likely to be accessed from Boston, place them in shards nearby
  • e.g., to balance the load on aggregates equally likely to be accessed, distribute them evenly across the servers.
  • Replication may also be employed for performance and resilience

Example: Lotus Notes

The Secure Distributed Storage idea arises here too.

Graph Databases

Slide 8,9 While the others were motivated by the need to run on clusters, Graph DB’s were motivated by a need for smaller records with complex and dynamic interconnections (relations).

While you can do similar things with RDBMS’s, as the relationships get increasingly complex, the joins required to find what you’re looking for leads to poor performance.

When we store a graph-like structure in a RDBMS, a single relationship isn’t too hard: “Manages”

As we need to add more relationships, it begins to get complicated.

In the RDBMS, we essentially have to think of the traversals we will want, and then to design the RDBMS accordingly

If you want to add/remove relationships dynamically, well you can’t really.

Graph DB’s shift the bulk of the work of navigating the graph to INSERTs, leaving QUERYs as fast as possible.

QUERY is really just a very fast traversal of the graph from some starting point.

The relationship between nodes isn’t calculated during a QUERY, the relationship is always already there.

There is no limit to the number and kind of relationships a node can have.

Relationships can be dynamically created / deleted

Some graph DB systems like Neo4j allow you to specify Java objects as a relationship, while others have simpler relationship capabilities.

Example queries

Slide 10

“FIND AN EMPLOYEE OF BigCO WITH A FRIEND WHO LIKES THE SAME BOOK IN THE DATABASE CATEGORY THAT WAS WRITTEN BY TWO FRIENDS.”

“FIND THE BOOKS IN THE DATABASE CATEGORY THAT ARE WRITTEN BY SOMEONE WHOM A FRIEND OF MINE LIKES”  no match

Example vs. RDBMS

See this simple SQL script (demo in MySQLWorkbench)

SLIDE 11,12

Query for BOB’s friends:

SELECT PersonFriend.friend
FROM Person
JOIN PersonFriend ON Person.name = PersonFriend.name
WHERE Person.name = 'Bob';

Query for the inverse: who are friends with Bob? (directional)

SELECT Person.name
FROM Person
JOIN PersonFriend ON Person.name = PersonFriend.name
WHERE PersonFriend.friend = 'Bob';

NOW we ask “Who are Alice’s friends of friends?”

SELECT pf1.name AS PERSON,
       pf3.name AS FRIEND_OF_FRIEND
FROM PersonFriend pf1
  JOIN Person ON pf1.name = Person.name
  JOIN PersonFriend pf2 ON pf1.friend = pf2.name
  JOIN PersonFriend pf3 ON pf2.friend = pf3.name
WHERE pf1.name = 'Alice'
  AND pf3.name <> 'Alice';

Typical uses are apps requiring lots of relationships and very fast QUERYs

  • Connected Data, such as Social Networks or (like Facebook or tracking down members of criminal groups)

Routing, dispatch, and location-based services

  • Relationships between nodes can have a “distance” property
  • Using distance and location properties in a graph of points of interest can enable an application to make recommendations of nearby services

Recommendation engines

  • which products are usually bought together
  • which products when bought together should raise an alarm
    • fertilizer bomb components
    • sugar+yeast+fruit juice = wine
  • searching for patterns in relationships that might indicate fraudulent transactions

Document Store DB

First example is a row in a regular RDBMS

Second won’t fit into the same RDBMS since it has different attribute names

This isn’t a problem in a Document Store DB

  • the database can see the structure in the aggregate.

SLIDE 13,14

A document is essentially a list of keys and values. Each field can have 0, 1, or many values

Limits what you can put in it, but you gain flexibility in accessing it

queries can be based on the fields within the document aggregate

there are no strict schemas of the document content - you can if you need to

You can search on any of the fields, and retrieval can be parts instead of the whole aggregate

Indices may be built on those fields

Typical Uses

Slide 15

Multi-system event logging

  • lots of different formats, depending on what subsystem is making a log entry

Blogging platforms

Web analytics

  • For example, It’s easy to update certain fields of a document as the information being monitored changes
  • page hit counts

Network traffic analyses

Slide 16

Examples of the various kinds of NoSQL Data Stores

For a list of the NoSQL databases of each type see nosql.mypopescu.com—nosql

Activity

Slide 17