SQL

For this next part we will use the database northwind.sqlite3. To get started, download the file and upload it to c9.io. Then enter sqlite3 northwind.sqlite3 from the same directory where the file is found at the c9 prompt. The command .tables will provide you a listing of all of the tables and the command .schema will provide you the structure of each table.

Exercise: Display all of the rows of the table Customers and write an sql command that counts the total number of rows.

Distinct

A column of a table may contain duplicate values. In our Customer table, for example, there are multiple customers from the same city. We can see this by using the command SELECT City from Customers; If, however, we only wanted a list of distinct cities, the results of this command would return duplicates. We can use the qualifier DISTINCT to return distinct rows: SELECT DISTINCT City from Customers; will return one row per city.

Order by

When selecting rows from a table, there is – somewhat surprisingly – no guarantee in what order your data will be returned. You can, however, force the rows to be sorted based on one or more columns. The command SELECT * FROM Customers ORDER BY Country;, for example, will select all of the rows of the Customers table and return them sorted by country name, starting with “Argentina” and ending with “Venezuela”. The default sorting order is ascending (0…9 or A…Z). The sort order can be specified explicitly using the ASC or DESC option:

SELECT * FROM Customers ORDER BY Country ASC;

SELECT * FROM Customers ORDER BY Country DESC;

Although ascending is the default, I prefer to always specify the sort order so that my code does not require someone reading it to know the default behavior. You can also sort on more than one column by specifying two or more, comma delimited, column names. This command, for example, sorts first by “Country” and then by “City”.

SELECT * FROM Customers ORDER BY Country ASC, City ASC;

Update

Recall that we can add a row to a table using the INSERT command.

Exercise: Add a row to the Customers table with your personal information. This table has 11 columns: CustomerID, CompanyName, ContactName, ContactTitle, Address, City, Region, PostalCode, Country, Phone, Fax. You can use a random 5-letter code for the CustomerID. To verify that you inserted this row correctly, select the row of Customers with your specified CustomerID (or any other unique column).

In addition to inserting an entirely new row into a table, we can also update specific columns of a previously existing row. For example, I can update my phone number from the Customers table using the Update command:

UPDATE Customers SET Phone='603-646-1000' WHERE ContactName='Hany Farid';

The SET portion of this command specifies a column name (Phone, in our example) and the new value of this column. The WHERE portion of this command specifies which row(s) are to be updated. Be very careful with this command. If the WHERE portion of the command specifies more than one row, then all rows satisfying this match will be udpated. And, if no WHERE is specified then all rows will be updated!

You can also update more than one column’s data by specifying two or more, comma delimited, column names. Ths command, for example, updates both the Phone and Fax.

UPDATE Customers SET Phone='603-646-1000', Fax='603-646-2000' WHERE ContactName='Hany Farid';

Delete

Deleting rows from a table is effectively the same as searching a database. We start by specifying a table and search criteria and then delete (instead of print) the resulting row(s). For example, this command will delete France from the world table.

DELETE FROM Customers WHERE ContactName='Hany Farid';

You can verify that this row has been deleted by evaluating this search which will not return anything.

SELECT * FROM Customers where ContactName='Hany Farid';

If your search returns multiple rows, then all rows will be deleted. For example, this command deletes all rows whose country name starts with the letter “S”.

DELETE FROM Customers WHERE Country LIKE "S%";

In addition to removing rows from a table, you can delete an entire table using the DROP command with the following command DROP TABLE <table>.

Be careful when deleting – there is no undo button!

Date

Storing, searching on, and comparing dates is fairly common to most database applications. In SQLite, there is no explict date data type (dates are stored as “text”) and there are some quirky aspects to how dates are stored that can impact your ability to compare dates.

SQLite supports three variables CURRENT_DATE, CURRENT_TIME, and CURRENT_TIMESTEP. The variable CURRENT_DATE contains the current date in the format YYYY-MM-DD, the variable CURRENT_TIME contains the current time in the format HH:MM:SS, and the variable CURRENT_TIMESTEP stores the current date and time in the format YYYY-MM-DD HH:MM:SS.

Because all dates are stored with the data type “text”, it is possible to mix and match different date formats. For example, let’s build a table with three columns, “id”, name“, and”date“, where”id" is a unique auto-incremented identifier:

CREATE TABLE People (id integer PRIMARY KEY AUTOINCREMENT, name text, date text);

Let’s now insert three rows with three different time/date formats.

INSERT INTO People VALUES (NULL, 'Alice', CURRENT_DATE);

INSERT INTO People VALUES (NULL, 'Bob', CURRENT_TIME);

INSERT INTO People VALUES (NULL, 'Charlie', CURRENT_TIMESTAMP);

The command SELECT * FROM People; yields the following:

1|Alice|2017-02-26

2|Bob|17:17:15

3|Charlie|2017-02-26 17:17:15

The difference between the date and time format are obvious. The difference between the date and timestamp may be less obvious. In particular, if we want to find all entries with a specific date, we might use the following command:

SELECT * FROM People WHERE date='2017-02-26';

This command, however, will only return one row:

1|Alice|2017-02-26

This is because the comparison of a date requires an exact match, and a timestamp contains both the date and time and so it does not match against the date string ‘2017-02-26’. If you only need the date, then it is best to use CURRENT_DATE. If you need both the date and time, then it is best to create two separate columns and store the CURRENT_DATE AND CURRENT_TIME separately. This will give you maximum flexibility to search for specific dates or date ranges.

Exercise solutions

Print all rows of the table Customers.

SELECT * from Customers;

Count the number of rows in the table Customers (there are 91).

SELECT COUNT(*) from Customers;

Add a row to the Customers table with your personal information. To verify that you inserted this row correctly, select the row of Customers with your specified CustomerID.

INSERT INTO Customers VALUES ('ABCDE', 'Dartmouth', 'Hany Farid', 'Professor', '6211 Sudikoff Lab', 'Hanover', 'NH', '03755', 'USA', '603-646-2761', '603-646-1672');

SELECT * FROM Customers WHERE CustomerID='ABCDE';

JOIN

Most databases will contain multiple tables with related data (hence the term relational database). In the database northwind, for example, there is a Customers table and an Orders table. We can see the column names/dataypes of each table using the .schema command. At the sqlite prompt, type .schema Orders and .schema Customers.

Note that the Orders table and the Customers table each share a common column called CustomerID. In the Customers table, this is the primary key while it is simply another column in the Orders table. This common field allows us to combine information from both tables. For example, the Customers table contains the CompanyName which is not included in the Orders table. The JOIN command is used alongside the familiar SELECT command to combine information from two tables.

The basic structure of a JOIN command is as follows:

SELECT table1.column1, table2.column2 FROM table1 INNER JOIN table2 ON table1.common_field = table2.common_field;

This command compares each row of table1 with each row of table2 to find all pairs of rows which satisfy the condition that the command_field in table1 is equal to the common_field in table2. When this condition is satisfied, column1 from table1 and column2 from table2 are combined into a single row of the results.

This command returns the CompanyName from the Customers table and the OrderDate from the Orders table by matching the CustomerID across these two tables.

SELECT Customers.CompanyName, Orders.OrderDate FROM Customers JOIN Orders ON Customers.CustomerID=Orders.CustomerID;

SQL vs. NoSQL

There are two main types of databases: SQL and NoSQL, or relational and non-relational databses. The primary difference between these databases is how they are built, the type of information they store, and how they store this information. Relational databases are highly structured – data are stored in a table with pre-specified columns and data types. A non-relational database does not incorporate a table model but instead stores data in a single unstructured document file.

An SQL (relational) database is the more rigid, structured way of storing data. A relational database consists of tables with columns and rows. Each row represents an entry, and each column stores a specific type of information, like a name, address, or phone number. The data stored in a relational database has to be structured in a specific and organized way. This structure has the advantage that it guarantees consistency of your data and it easy to retrieve related data stored across multiple tables. This rigidity, while advantageous in several ways, can also cause problems. A column designed to store U.S. phone numbers, for example, might require 10 digits because that’s the standard for phone numbers in the U.S. This has the advantage of ensuring that all of your data are in exactly the same format. If, however, you need to change the format to include an international phone with more than 10 digits, then the entire database needs to be updated.

A NoSQL (non-relational) database, on the other hand, is an unstructured way of storing data. A non-relational database stores data in a single document that can be easily found, but isn’t necessarily categorized into specific fields (columns) like a relational database. For example, a relational database may store a phone book in a table with three columns corresponding to name, address, and phone number. A non-relational database may store a phone book as a series of folders within each may be stored a name, address, phone number and optionally a photo, contact person, driving directions, map, etc. That is, each entry in a non-relational database need not have a perfectly consistent data format.

There are no hard and fast rules on which type of database to use (and databases are emerging that are hybrid versions of SQL and noSQL databases). As a general guideline, you would choose a SQL database if your data is structured, unchanging, and no expanding rapidly. You should choose a NoSQL database if you are storing large volumes of data with little or no structure, if you need to take advantage of cloud computing and storage (which can be tricky with a SQL database), and if the nature of your data is changing rapidly.

Database search

One of the most basic problems in computer science is finding the index of an item in a database. Let’s assume the list is sorted. Maybe it’s a list of country names, sorted alphabetically.

countries = [“Afghanistan”, “Albania”, “Algeria”, “Andorra”, “Angola”, “Anguila”, “Argentina”, “Armenia”, “Aruba”, …, “Zimbabwe”]

Let’s say you’d like to find the index of a country in the list. (Perhaps “Yemen” or “Afghanistan”.) I can think of at least three approaches.

Random search. Pick a random index. Check whether that location contains the item you were looking for. If not, repeat.
Linear search. Loop over the items in the list, starting at the index at index 0, then the item at index 1, index 2, and so on until you find the country name you are looking for, or you have exhausted the entire list without finding it.
Binary search. (Possible only because the list is sorted.) Maintain indices into a sublist under consideration. If the sublist is ever empty, then the country name is not found. Otherwise, divide the current sublist in half. Check whether the item at the midpoint is what we’re looking for. If so, return that index. Otherwise, check to see whether the midpoint item is greater than what we’re looking for or less than what we’re looking for. Based on the results of the comparison, discard either the first or second half the sublist, updating the indices demarcating the sublist appropriately, and repeat with the new sublist.

Let’s think through the efficiency (and correctness) of each of these approaches. A good way to think of efficiency is to consider how much time is required to run the algorithm in the “best case” and in the “worst case”.

Random search. In the best case, we get lucky, and the item is the first one we look at. The time taken would be however long it takes to examine a single item in the list. In the worst case, however, random numbers are picked forever, and the item is never found. Moreover, if the item is not in the list at all, we’ll never find it by picking random indices and therefore will not know when to stop. In this, admittedly contrived case, the worst case is very bad and wrong!
Linear search. In the best case, the item we are looking for happens to be the first item of the list, just like the best case for random search. In the worst case, the item we want is at the end of the list or not present in the list at all. If the list contains n items, we will have to examine n items.
Binary search. In the best case, the item we are looking for is at the midpoint of the entire list, and the item is found in the first place we look, just like the best cases of random search and linear search. In the worst case, the algorithm has to divide the list in half over and over again until the remaining sublist is empty. How many times can a list of length n be divided in half before the sublist is empty? The answer turns out to be one more than log2(n), the “base-2” logarithm of n. This number is much less than n, for large n.

It is worth taking a moment to consider the difference between linear search (worst case we have to perform n operations) and binary search (worst case we have to perform log_2(n) operations). Consider the following table that shows in column 1 the size of the list (i.e., database) that we are searching, the maximum number of comparisons for linear search, and the maximum number of comparisons for binary search.

n	linear	binary
4	4	2
8	8	3
16	16	4
32	32	5
64	64	6

Notice that the cost of running linear search directly matches the size of the database (this is why we call it linear search). On the other hand, each time we double the size of the database, the cost to binary search is only 1 extra comparison. That is amazing and, of course, highly desirable.

Notice, however, that unlike linear search, binary search requires that the data be sorted. It is natural, therefore, to ask what is the cost of sorting a list of strings (or numbers). There are many approaches to sorting data (here is a beautiful “15 sorting algorithms in 6 minutes” https://www.youtube.com/watch?v=kPRA0W1kECg). Without going into detail on how they work, I will just say that a list of length n can be sorted in worst case time of nlog2(n) — that is n times the base-2 logarithm of n. This is a bit more expensive of just n (since n is being multiplied by a value greater than 1). But, once sorted, an element can be added to a sorted list in log_2(n) time and so it is well worth the effort to sort the list in order to allow searching to be done so efficiently.

Can we do better then log2(n)? Dictionaries (aka hash tables) can search in constant time — that means that no matter how large the database, searching a dictionary takes a fixed number of steps that is completely independent of the size of the information that we are searching. How is that possible? Here is the basic idea of how dictionaries work. When you search a dictionary for a string (I’ll call that the “key”), the first thing that happens is that some numeric value is computed for the key; no matter what type the key is—a number, a string, an object reference, you name it—some numeric value is computed based on the key. That numeric value is used to index into the dictionary. Since both the computation of the value and the indexing into the dictionary take constant time, we can (usually) locate the value in constant time.

The first step of indexing into the dictionary by key is converting the key to some numeric value, called the hash code. To convert the key to a hash code, we use a hash function that takes the key as input, and somehow generates a number. Let’s look at a javascript hash function that computes a number based on a key that is a string.

This function uses the builtin Javascript function charCodeAt takes a string of length 1 and returns an integer value. For example, charCodeAt("a") is 97, and charCodeAt("b") is 98, etc. We add up all these numeric values of the individual characters in the string to get some number. Some hash functions are better than others. This one is not very good. There are two properties we would like a hash function to have: (1) It should distribute keys evenly, or uniformly throughout the list (also called the hash table); (2) It should minimize the collisions between keys: keys that have the same hash code. The function we just wrote could work as a hash function, but it is bad from both perspectives. First, does the code distribute keys evenly throughout the table? No. Most strings will have multiple letters, and I would expect small indices in the table to be used infrequently. Most words are limited in length, however, and so very large indices in the table will not be used at all. Second, does the function minimize collisions? No. Two words that are anagrams (you can rearrange the letters in one to get the other) will have the same hash code. For example, the following anagrams will have the same hash code: “opts”, “post”, “pots”, “spot”, “stop”, “tops”. And there are many other words that we expect to have the same hash value using this function, such as “pat” and “let”. There are much better hash functions, but we won’t go into more detail here.

PHP meets SQL

Although we have only scratched the surface of SQLite’s functionality, we have covered enough material to show how we can tie together a fully functioning website that includes HTML/Javascript (client), PHP (server), and a server-side database (SQLite) for storing user information.

Let’s build a simple website that allows a user to enter their name and email that is then sent to your server where the data is stored in a SQL database. Recall from the previous lecture that we can collect and send user data using a form and a submit button.

This bit of HTML code should be familiar. The form contains two text input fields that allows a user to enter their name and email. The button is of type “submit” which means that when it is selected the server-side action script “submit.php” is sent the data in the form.

Before we start working on the server-side PHP script, let’s think about the database that will store our user data. At a minimum, we will need two columns to store the name and email. These will each be of type text. Because it may be useful later on to know when a user signed-up, let’s also store the data. There are several different formats for storing the data, but for simplicity, we will just store it as text in the form [Month Day, Year, Time]. We also have to decide if we want a unique primary key. Since the combination of a person’s name, email, and date of submission should be unique, we might consider not having a unique identifier. However, even if it is not necessary, I like to add the unique identifier because it adds very little overhead to the size of the table and gives us an absolute guarantee that there will be no conflict in the database with identical entries.

To start let’s create a database called myusers by entering the following command at the cloud 9 prompt:

sqlite3 myusers

And now let’s create a table of the same name with four columns. At the sqlite> prompt enter:

CREATE TABLE myusers (id integer PRIMARY KEY AUTOINCREMENT, name text, email text, date text);

You can now quit SQLite by entering .exit at the prompt.

We now need to turn our attention to the server-side PHP script that will receive the form from the client, extract the user name and date, and add this information to the SQL table that we just created on our server.

We’ll start by extracting the user name from the post-form sent by the client (the strings “name” and “email” correspond to the name of the input fields specified in the client-side HTML code. We extract these using the $_POST[...] construction and store the results in the variables name and email (recall that in PHP, variables all begin with a $).

<?php

$name  = $_POST["name"];
$email = $_POST["email"];

?>

We now need to figure out the date and time. For simplicity we will specify time in the server’s timezone (it is possible to determine the timezone of the client). PHP has a built-in function date that returns a string containing the date and time. The parameters for this function specify the formatting for the date and time: “F” returns the month spelled out (January,…,December); “j” returns the day of the month without leading zeros (1,…,31); “Y” returns a 4-digit year (1999, 2016, etc.); “g” returns a 12-hour format of the hour without leading zeros (1,…,12); “i” returns the minutes with leading zeros (00,…,59); and “a” returns “am” or “pm”. We store the result of this function call in a new variable date (not to be confused with the function date).

<?php

$name  = $_POST["name"];
$email = $_POST["email"];
$date  = date("F j, Y, g:i a");

?>

At this point, we have all of the data that we need. Let’s now see how to insert this database into the table that we created earlier. We first have to specify that we want to access the “myusers” database. PHP has built-in functionality for accessing SQL and SQLite databases. We can open the database using the line of code $db = new SQLite3('myusers'); The variable db provides us access to all of the tables stored in this database. As we will see next, with this variable, we can perform all of SQLite functions described above.

The last line in the script below inserts a row into the table consisting of the user name, email, and date. The form of any SQLite command is $db->query(<command>);, where command is any of the SQLite commands described above. Unlike the commands above, we need to combined the command name (INSERT INTO…) with the variables containing the data. To do so, the parameters to the SQLite insert function are constructed by concatenating (.) the basic structure of the command (INSERT INTO …) with the three variable holding the data.

<?php

$name  = $_POST["name"];
$email = $_POST["email"];
$date  = date("F j, Y, g:i a");

$db    = new SQLite3('myusers');
$db->query( 'INSERT INTO myusers VALUES( NULL,"' . $name . '", "' . $email . '", "' . $date . '")' );

?>

You may have noticed the extra quotation marks (“) in the construction of the parameter list. This is because we want to put double quotes around each parameter so that if they contain commas (as in the date) this won’t get confused with the comma delimited parameter list. The final command will look something like:

INSERT INTO myusers VALUES( NULL,"Hany Farid", "farid@cs.dartmouth.edu", "March 3, 2016, 7:51 pm")

Recall that we need to send some content back to the server, so let’s send back a thank you message.

<?php

$name  = $_POST["name"];
$email = $_POST["email"];
$date  = date("F j, Y, g:i a");

$db    = new SQLite3('myusers');
$db->query( 'INSERT INTO myusers VALUES( NULL,"' . $name . '", "' . $email . '", "' . $date . '")' );

print $name . ", thanks for signing up on " . $date;

?>

At this point we could be done, but it is always a good idea to do some error checking to make sure that the insertion into the database was successful. We can do this because the SQLite insert command returns a boolean value specifying if the insertion was succesful. We can then print different messages depending on whether the insert was successful or not. In the PHP code below, we store the boolean value in the variable q and then use a conditional (if-else) to determine what to return to the user.

<?php

$name  = $_POST["name"];
$email = $_POST["email"];
$date  = date("F j, Y, g:i a");

$db    = new SQLite3('myusers');
$q     = $db->query( 'INSERT INTO myusers VALUES( NULL,"' . $name . '", "' . $email . '", "' . $date . '")' );

if( $q ) {
    print $name . ", thanks for signing up on " . $date;
} else {
    print "oops, something went wrong, our crack IT staff is working on the problem.";
}

?>