Concurrency and inconsistency

Once we have these databases all setup, we hope people will want to use them. We begin to run into problems when a lot of them want to use the database at the same time.

And it might not be people at all. Multiple continuously running applications may be simultaneously maintaining connections with the database.

As the value of the information in the database rises, so does the number of users and their dependence (or expectation) of its availability. Thus, in addition to conflicts among simultaneous access by clients and applications, you also encounter system needs such as replication, backup/restore, and maintenance.

As we saw earlier in the term, database consistency is very important in most applications.

Recall the example of the ATM machine and to bank customers trying to withdraw from the same account at nearly the same time. That was an example of what Widom calls attribute-level inconsistency.

/* Client 1 - Deposit */
    UPDATE CheckingAccount
    SET Balance = Balance+100 WHERE AccountNumber=123456;
--------
/* Client 2 - Deposit */
    UPDATE CheckingAccount
    SET Balance = Balance+150 WHERE AccountNumber=123456;

Since these two clients initiate their transactions independently, the steps of the two transactions can occur in different sequences.

  • Client 1 might complete first, and then Client 2 would complete, and everything’s as expected.
  • Alternatively, the steps might interleave:

    1. Client 1 might run first by reading the Balance,
    2. Client 2 runs and reads the Balance.
    3. Then Client 1 runs again, does the +100, and stores Balance.
    4. Client 2 runs again, does the +150, and stores Balance.

Thus there are three different possible values for the Balance after these steps.

There are other levels of inconsistency: Relation-level inconsistency

/* Client 1 - Give High Cost Area increases to top performers */
UPDATE Employees
SET Hours = Hours * 1.25
WHERE Rating>90;

/* Client 2 - Tri-State hourly rate Increases for full-time workers */
UPDATE Paychecks
SET Rate = Rate * 1.05
WHERE (State=NY OR State=CT OR State=NJ) AND Empid IN
    (SELECT Empid FROM Employees WHERE Hours>=40);

Here some employees would get both increases when they only deserve one.

Finally, there is Multiple-Statement inconsistency

/* Client 1 - promote students based on hours earned */
INSERT INTO Seniors (SELECT * FROM Juniors WHERE Hours > 90);
DELETE FROM Juniors WHERE Hours > 90;
...

/* Client 2 - Calculate class sizes */
SELECT COUNT(*) FROM SENIORS;
SELECT COUNT(*) FROM JUNIORS;

It’s unclear which students will be counted where, or even whether they will be counted once!

As we just demonstrated, half-completed or non-sequential transactions can wreak havoc with a database’s consistency and integrity.

So is the answer to force all transactions to be serialized (isolated)?

  • That defeats the purpose of a large database serving many users.
  • We want concurrency so we can have the highest possible performance

What about multi-threaded or multi-core solutions which could potentially runn against thoe same database? We don’t want to serialize these!

What about system failures

Stuff breaks … networks, computers, disks, … it happens. Sometimes someone breaks stuff: disgruntled employee, hacker, terrorist, …

We also want the database to have as good a chance of surviving such attacks as possible.

  • So many stock changes happening so fast after XYZ corporation announced their plans to acquire WVU Industries that the database server crashed.
  • The power failed just as the archive of the day’s retail sales transactions process began.
  • The system might crash in the middle of our update to promote Juniors to Seniors, missing the DELETE to remove the promoted students from Juniors/.
  • The disgruntled employee logged into the database server as root and ran the command rm -rf / .

Transactions to the rescue

A Transaction is a Logical unit of work that must be entirely completed or completely aborted.

Our goal is the ability to execute Transactions (SELECT, UPDATE, INSERT, DELETE, etc.) while maintaining our carefully designed and managed DB in its consistent state with all data-integrity constraints satisfied.

A Transaction typically begins automatically as the first SQL statement is run. A COMMIT ends the Transaction.1

Think of it as a timeline:

... COMMIT | SELECT | UPDATE | DELETE | COMMIT ...

Transaction works because it tells the DB that it needs some peace and quiet - some uninterrupted access. This is vital to the concept. Just make sure that you never start a transaction and then wait for some unpredictable amount of time (synchronization with some other process, external state input from a sensor like thermometer, user input, etc. ). It would tie up the DB !!!

So this example from Widom would be bad:

BEGIN TRANSACTION;
-- get input from something or someone
Do some SQL commands using that input;
-- Confirm the results with something or someone
IF (OK?) THEN Commit; ELSE Undo;

MySQL does AutoCommit by default, which means anytime a table is changed it is written to disk. You can turn this off: SET AUTOCOMMIT = 0;

-- extended example from MySQL Docs section 15.7.2.2
DROP TABLE IF EXISTS xacttest_1;
CREATE TABLE xacttest_1 (a INT, b CHAR (20), INDEX (a));

-- Do a transaction with autocommit turned on
START TRANSACTION;
INSERT INTO xacttest_1 VALUES (10, 'Heikki'); -- not an auto commit statement
COMMIT;

-- Do two transactions with autocommit turned off.
SET autocommit=0;

START TRANSACTION;                 -- forces an autocommit
DROP TABLE IF EXISTS xacttest_2;   -- forces an autocommit
CREATE TABLE xacttest_2 (c INT, d CHAR (20));  -- forces an autocommit
INSERT INTO xacttest_2 VALUES (12, 'Samson');

START TRANSACTION;						    -- forces an autocommit
INSERT INTO xacttest_1 VALUES (15, 'John');
INSERT INTO xacttest_1 VALUES (20, 'Paul');
DELETE FROM xacttest_1 WHERE b = 'Heikki';

-- Now we undo those last 2 inserts and the delete.
ROLLBACK;
SELECT * FROM xacttest_1;
SELECT * FROM xacttest_2;

MySQL 5.6+ has START TRANSACTION READ ONLY which promises that the actions taken by this script do not modify the database - they are read only.

  • helpful for concurrency

What do transactions need?

Atomicity

All operations of a transaction must be completed. If not, the transaction is aborted. If things don’t go well with the transaction, we want it to abort and clean up everything it changed.

In support of Atomicity, the system maintains a transaction log.

  • Keeps track of all transactions that update the database
  • DBMS uses the information stored in a log for:
    • Recovery requirement triggered by a ROLLBACK statement
    • A program’s abnormal termination
    • A system failure

In SQL, we use ROLLBACK to clean up almost everything that changed. We say almost because ROLLBACK doesn’t cleanup:

  • SQL variables that might have been changed,
  • out-of-band actions such as
    • dropping the soda can into the delivery area
    • creation/update of a file or a different database.

Consistency

Permanence of database’s consistent state.

DB designers spend a lot of time setting up integrity constraints to insure the DB is always in a valid state.

The Consistency goal means that any transaction can assume that the DB is in a consistent (valid) state when it starts, and the system will ensure that the DB is in a consistent state when the transaction finishes.

Isolation

Data used during transaction cannot be used by second transaction until the first is completed

We want each client/application to think it’s running fully isolated from all others, but without really doing it (i.e., running them serially)

We strive for isolation by by analyzing the potentially concurrent actions to determine an interleaving that has the same results as if the actions were run serially.

Durability

Ensures that once transactions are committed, they cannot be undone or lost.

They will persist across transaction execution, application executions, and even system crashes.

Serializability

Ensures that the schedule for the concurrent execution of several transactions should yield consistent results.

We can see that the examples we did earlier could be remedied via serializability.

Serializability means that the actions of transactions may be interleaved, but the end result must be the same as if the transactions were run in some sequential order.

This can bring DB overhead and can reduce concurrency.

A common approach to implementing serialization is via locks within the database to prevent two transactions (or two functions within two transactions)from accessing the same ddata elements at the same time.

For example, when buying tickets online for a soccer game, once you choose the seats you want there is a little countdown window that appears, showing how long you have to complete the transaction. This is a means of serialization on the Seats resource. Note that it is a lock only on the seats you have begun to buy, not all the seats.

Real Incident: in the middle of a transaction, after selecting seats, my WiFi router pooped out.

I quickly restarted the router and got to the seats selectiohn page and the seats I wanted weren’t available. The LOCK had persisted beyond my transaction. NOT GOOD! I had to wait for about 8 minutes for it to expire, then restarted and saw the seats available again.

This example shows how seat selection was allowed to continue, due to the support for transactions and serialization, resulting in the appearance of concurrency.


[SLIDE 11-1

ACID+S

The A,C, and D principles apply to all databases, while the I and the S only applies to multi-user databases (ok, or really busy single-user DBs).

What is serializability?

  • Basic Assumption – Each transaction preserves database consistency.
  • Thus, serial execution of a set of transactions preserves database consistency.
  • A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule.
  • Different forms of schedule equivalence give rise to the notions of:
    1. conflict serializability
    2. view serializability

SLIDE 10-6

Take a look at the Transaction log and see how it really works.

Concurrency control is all about making each user/client feel like their transaction(s) are running alone on the system.

Transactions can get in each other’s way, often without either knowing it!

Review the ways this can happen.

SLIDE 10-7..10-15

It’s quite a challenge to come up with a Scheduler that can determine the best way to interleave transactions without allowing these interferences.

The Scheduler determines the order in which the operations of the transactions are executed, aiming to maintain serializability while supporting concurrency.

The scheduler also tries to optimize the DB’s use of the host system services.

SHOW PROCESSLIST;

Concurrency using Locks

A lock guarantees exclusive use of the things it locks up. Things like:

  • a section of memory, a network adapter, the CPU cache, etc.

The size of the thing a lock can protect determines the granulatiry of that lock.

Locking can take place at the database, table, page, row, or even field level, depending upon the DB being used.

SLIDES 10-16..10-19

Field level locks are the lowest you can go … concurrent transactions can access the same row IFF they don’t access the same attribute (column). If you think of the DB table as a spreadsheet, this is invididual cell protection.

Smaller lock granualarity improves the concurrency of the database by reducing contention to lock database objects. However, smaller lock granularity also means that more locks must be maintained and managed by the DBMS, instead of by the operating system (e.g., page locks), requiring more processing overhead and system resources for lock management.

Concurrency demands and system resource usage must be balanced to ensure the best overall transaction performance. In some circumstances, page-level locks, which require fewer system resources, may produce better overall performance than field-level locks, which require more system resources.

Most DB implementations stop at row-level access.

Kinds of locks

You need to be able to TEST the lock, SET the lock, and OPEN the lock.

SLIDE 10-21

Binary Lock Example

  • Exclusive locks are just that - exclusive.

    These exist when access to a data item is specifically reserved for the transaction that locked the object. The exclusive lock must be used when a potential for conflict exists, e.g., when one or more transactions must update (WRITE) a data item. Therefore, an exclusive lock is issued only when a transaction must WRITE (update) a data item and no locks are currently held on that data item by any other transaction.

  • Shared locks are used with concurrent transactions are both allowed READ access and no exclusive locks on the resource exist.

Locks can operate in two modes:

  • exclusive (X) mode. The protected item can be read and written. lock_x
  • shared (S) mode. The protected items can only be read. lock_s

Either way, locks are requested from and managed by the DBMS.

When requesting a lock, the DBMS either grants the request or not based on what other locks already exist on the item. If it cannot grant the request, the requesting client is blocked on the lock. This is an important point.

Silberschatz summarrizes the valid lock transitions in a table:

  lock_s lock_x
lock_s OK NOT OK
lock_x NOT OK NOT OK

There can be zero or more locks on a given item, held by 1 or more clients.

A lock request may be granted if the request is compatible with other locks already on the item.

A simple example (also from Silbershatz):

T1 action
  lock_s(A)
  Read (A)
  unlock(A)
  lock_s(B)
  Read(B)
  unlock(B)
  print(A+B)

This seems good, but does it enforce serializability? No, since the update of A or B in between the read of A or B (respectively) could lead to an error.

Two-Phase locking (2PL)

This is an orderly way of acquiring and freeing locks.

SLIDE 10-22

Two phases

  • growing phase
    • a transaction acquires all the required locks without unlocking any data.
    • once all locks have been acquired, the transaction is in the locked state.
  • shrinking phase
    • a transaction releases all its locks, and cannot obtain a new one until all the locks are released.

In addition the 2PL protocol has three rules:

  1. Two transactions can not have conflicting locks.
  2. No unlock operation can precede a lock operation in the same transaction
  3. No data are affected until all locks are obtained - until the transaction is in the locked state.

Deadlocks

Now, just because you use locks doesn’t mean the resulting schedule is serializable … and locks can sometimes create DEADLOCKS or a deadly embrace.

A related challenge is starvation.

T1 T2
lock_x(B)  
read(B)  
B=B-1  
write(B)  
  lock_s(A)
  read(A)
  lock_x(B)
lock_x(A)  
  • as in T1 waiting for a lock_x(A) while many lock_s(A) requests are granted to a long line of other transactions.

A deadlock occurs when two transactions are waiting for each other to unlock data becore proceeding.

SLIDE 10-23

We have various techniques for preventing deadlocks.

  • avoidance schemes like 2PL work well.
  • detection schemes that periodically check the DB for deadlocks and aborts any deadlocked transactions and reschedules them.
  • prevention schemes wherein a transaction requesting a new lock is aborted if there is a chance that a deadlock would occur. All its previously obtained locks are freed and the transaction is rescheduled.

Time Stamping methods of concurrency control

A time stamping approach uses system wide, guaranteed unique time stamps to order transactions. These timestamps have two required properties:

  • unique
  • monotonic increasing

All the DB operations of a transaction must have the same timestamp. The DB executes transactions in strict timestamp order, ensuring serializability.

The disadvantage is that they introduce more overhead in both storage and computation, since two timestamps are required:

  • last time read.
  • last time written

wait/die and wound/wait schemes

SLIDE 10-24

In the wait/die scheme, the older transaction (lower timestamp) waits for the younger one to complete and release its locks.

  1. If the T requesting the L is older, it will wait for the younged T to complete and release its locks.
  2. If the T requesting the L is younger, it will ROLLBACK, giving up its locks, and be rescheduled.

In the wound/wait scheme, the older transaction rolls back the younger transaction and reschedules it.

  1. If the T requesting the lock is the older of the two, it will prempt (wound) the younger transaction by rolling it back. The younger T is rescheduled using the same time stamp.
  2. If the T requesting the ick is the younger of the two, it will wait until the other T is completed and its locks released.

Other factors affecting recovery

  • Deferred-write technique or deferred update.
    • Ensures that transaction logs are always written before the data are updated.
  • Redundant transaction logs.
    • Ensure that a physical disk failure will not impair the DBMS’s ability to recover data.
  • Buffers.
    • Temporary storage areas in a primary memory.
  • Checkpoints.
    • Allows DBMS to write all its updated buffers in memory to disk.

Isolation levels

If a transaction has written some data to the DB but not yet committed it, that data is Dirty Data.

If then another transaction reads that data, that is called a Dirty Read.

You can imagine the issues here.

Since serializable has costs, SQL lets us specify when to use weaker isolation levels that modify how READs are done. These specifications are transaction based, implying they only affect the transaction they’re part of.

  1. READ UNCOMMITTED - Dirty Reads are ok; In general, SQL assumes transactions are READ WRITE, except when you allow Dirty Reads. Since that’s so risky, SQL assume the transaction is READ ONLY unless you specifically override it.
  2. READ COMMITTED - Forbids reading Dirty Data
  3. REPEATABLE READ - Implies that repeated reads of the same tuple will have identical results. Reasonable, except that the query might return phantom tuples due to some other updates to the database. This is the default setting for MySQL.

REPEATABLE READ aims to ensure that reads within the same transaction read the snapshot established by the first read – read consistency. This means that if you issue several plain (nonlocking) SELECT statements within the same transaction, these SELECT statements are consistent also with respect to each other.

Here’s a REPEATABLE READ use case:

Suppose you have a relation R

R(id INTEGER,type VARCHAR(8))R \hspace{.07in}( \hspace{.07in}\underline{id} \text{ INTEGER} ,\hspace{.07in} type \text{ VARCHAR(8)} )

and a transaction T1T_1 with REPEATABLE READ enabled (which is the default for MySQL):

-- -Transaction log
START TRANSACTION;
-- -
SELECT ... ;
-- -  Begin some complex calculation that uses the following result
SELECT COUNT (*) FROM ENROLLMENT WHERE ClassDept = "CompSci";
-- -  do some other stuff, then get that same result again to
-- -  finish the calculation, and this count had better be the
-- -  same as before!
SELECT COUNT (*) FROM ENROLLMENT WHERE ClassDept = "CompSci";
-- -  more stuff
COMMIT; -- This ends the transaction


Then, while this transaction is running, a second transaction $T_2$ is begun which may add new CompSci enrollments. The $T_2$ transaction runs to completion and commits before $T_1$ finishes. With the $T_1$ transaction set to REPEATABLE READ, its calculations will be unaffected by these concurrent inserts to the ENROLLMENT table.

It would work the same if there were DELETEs instead of INSERTs.

Perhaps the ticket purchase app I was using uses READ COMMITED:

  1. transaction with READ COMMITTED begins
  2. customer chooses seats
  3. transaction commits the selection with the PENDING value set TRUE.
  4. a Trigger fires because of the COMMIT and PENDING=TRUE that sets a timer to run while updating the little “Time left” window.
  5. other transactions cannot see those seats because of the COMMIT.
  6. if the timer completes, another TRIGGER is initiated which reverses the transaction and COMMITS.

SLIDE 10-25

So in summary (From Garcia-Molina):

Isolation Level Dirty Reads NonRepeatable Reads Phantoms
READ UNCOMMITTED OK OK OK
READ COMMITTED NOT OK OK OK
REPEATABLE READ NOT OK NOT OK OK
SERIALIAZABLE NOT OK NOT OK NOT OK

Checkpointing

Periodically a DB will stop processing transactions and write everything that’s pending to the disk. This is checkpointing, and it’s noted in the log.

Thus, after a failure the DB recovers by.

  1. find the previous checkpoint in the log.
  2. assume everything before that was written to disk.
  3. Go through and redo everything after that checkpoint.

Recovery examples from the text

Slide 10-26

  1. Find the last checkpoint (TRL ID 423)
  2. Since the second to last checkpoint, review the transactions
  3. T101 started & finished before checkpoint at 423, so it’s fine
  4. For each transaction committed after the last checkpoint, the DB wil use use the transaction log to write changes to storage, using the “after” values. Example for T106.
    a. find the associated COMMIT (TRL ID 457).
    b. Use the back-link to find the beginning of the transaction (TRL ID 397).
    c. Work through the steps to recreate the transaction’s effects.
    d. Do this for the remaining transactions committed after thelast checkpoint (T 155).
  5. Any other transactions that ended with Rollback or were still active when the crash occurred are ignored.

  1. Several other statements like BEGIN ..., CREATE ..., DROP ..., and others can also end a transaction. See the MySQL Documentation for a full list.