Everyone in the trade press is required to write something about the Web and the Internet in their column. It should deal with commercial pornography, child molesting, international terrorism, and all the stuff that sells magazines.
Well, I don't have anything on that right now. I just haven't been out terrorizing child molesters with pornography with my stolen credit card numbers for a while.
In fact, I am in the mood to write one of my boring technical columns about SQL programming tricks. How can I fool my editor into thinking that I am on top of the right trends? The Internet is a network, and networks are modeled with graphs, so I am doing a column on representing graphs in SQL. Yeah, that's the ticket!
The edges can be directed or undirected. A directed edge shows a relationship between the two nodes that runs in one direction only, such as the path of an escalator between two floors. An undirected edge shows a symmetric relationship between the two nodes it connects, such as a staircase that runs both ways between two floors. Edges can be named themselves (for example, "Highway I-20"), but in most data processing examples an edge is identified by its initial and terminal nodes in what is called an adjacency table. Let's build two tables, one for the nodes, which we will make cities, and one for the graph itself, which will be Internet connections between them:
CREATE TABLE Nodes (node_nbr INTEGER NOT NULL PRIMARY KEY, city_name CHAR(20) NOT NULL); CREATE TABLE Graph (init INTEGER NOT NULL REFERENCES Nodes(node_nbr) ON DELETE CASCADE, term INTEGER NOT NULL REFERENCES Nodes(node_nbr) ON DELETE CASCADE);
To keep information on the edges, I would need another table:
Remember that I am storing only the actual edges, so the table will probably be much smaller. But even if it is not, the rows are only two integers wide, so the physical size will be fairly small. To see how this works, let's start off with a small table of unconnected nodes. First let's decide on the cities in the network:
INSERT INTO Nodes (1, 'Atlanta');
INSERT INTO Nodes (2, 'Boston');
INSERT INTO Nodes (3, 'Chicago');
INSERT INTO Nodes (4, 'Detroit');
INSERT INTO Nodes (5, 'Evanston');
Then we need to make sure that they can at least talk to themselves:
INSERT INTO Graph VALUES (1, 1);
INSERT INTO Graph VALUES (2, 2);
INSERT INTO Graph VALUES (3, 3);
INSERT INTO Graph VALUES (4, 4);
INSERT INTO Graph VALUES (5, 5);
The general code for this operation should be put into a stored procedure. Skipping over the error handling and other fine points, the procedure looks like this:
CREATE PROCEDURE NewNode (IN new_node INTEGER, IN city_name CHAR(20)) BEGIN ATOMIC INSERT INTO Nodes VALUES (new_node, city_name); INSERT INTO Graph VALUES (new_node, new_node); END;To connect two nodes, I must build this table with node_nbr pairs to avoid NULLs and to follow the first constraint. (See Figure 1.) Again skipping over the error handling and other fine points, the procedure looks like this:
CREATE PROCEDURE NewEdge (IN new_node_1 INTEGER, IN new_node_2 INTEGER) BEGIN ATOMIC INSERT INTO Graph (init, term) VALUES (new_node_1, new_node_2); INSERT INTO Graph (init, term) VALUES (new_node_2, new_node_1); END;Right now the graph has no edges, so I'll make some:
CALL New Edge(1,2);
CALL New Edge(3,4);
The graph and the Graph table now looks like this:
| Graph | |
| init | term |
| ============ | |
| 1 | 1 |
| 1 | 2 |
| 2 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
A condition that is not quite as strong as completeness is the transitive closure of a graph. This is the state of our graph right now. The definition can be stated in terms of our Graph table as: If row (a,b) and row (b,c) are in the Graph table, then so is row (a,c). Such a graph is called closed.
Transitive closure partitions the graph into subgraphs, each of which is complete. This concept can be applied to HTML and other files in an Internet or Intranet Web site. Closure lets you quickly find those subgraphs for any member (n) with a query such as this:
Inserting a new edge and adding the other rows of its transitive closure takes more work, however. You will need something like this in a procedure body or script, which will add the following rows to the Graph table:
CREATE PROCEDURE TransClose
(j INTEGER NOT NULL, k INTEGER NOT NULL)
AS
BEGIN ATOMIC
ý create new edge first
INSERT INTO Graph (init, term) VALUES (j, k);
INSERT INTO Graph (init, term) VALUES (k, j);
COMMIT;
INSERT INTO Graph (init, term)
SELECT DISTINCT G2.init, G3.term
FROM Graph AS G1, Graph AS G2, Graph AS G3
WHERE -- define G1 as the new edge
(G1.init IN (j,k) AND G1.term IN (j,k))
AND (G2.term = G1.init AND G1.term = G3.init)
ý link G2 and G3 together thru G1
AND G2.term IN (j,k)
AND G3.init IN (j,k)
ý do not recreate any existing edges
AND NOT EXISTS (SELECT *
FROM Graph AS G4
WHERE G4.init = G2.init
AND G4.term = G3.term);
COMMIT;
END;
I'll dissect this with the diagram shown in Figure 2.
Subgraph G1 is the new edge I added. Subgraph G2 is all of the old edges that end at G1. Subgraph G3 is all of the old edges that start at G1. I want to make new edges with the nodes of G2 and G3. This query will result in a lot of redundant pairs in spite of using the select distinct option. Because I also want to avoid repeating any edges that already exist in the table, the where clause finishes with a test to see that the edge I created is indeed a new one.
| Graph | |
| init | term |
| ========== | |
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
| 2 | 4 |
| 3 | 1 |
| 3 | 2 |
| 3 | 3 |
| 3 | 4 |
| 4 | 1 |
| 4 | 2 |
| 4 | 3 |
| 4 | 4 |
| 5 | 5 |
If the size of the table is a problem, you can define it with this Data Definition Language:
CREATE TABLE HalfGraph
(init INTEGER NOT NULL
REFERENCES Nodes(node_nbr)
ON DELETE CASCADE,
term INTEGER NOT NULL
REFERENCES Nodes(node_nbr)
ON DELETE CASCADE,
CHECK (init < term),
PRIMARY KEY (init, term));
and then reconstruct the full graph with this view.
CREATE VIEW Graph (init, term)
AS SELECT init, term
FROM HalfGraph
UNION
SELECT term, init
FROM HalfGraph
UNION
SELECT init, init
FROM HalfGraph
UNION
SELECT term, term
FROM HalfGraph;
The problem is that this view is not updatable, so you will have some tricky predicates in your stored procedures.
For example, the graph in Figure 3, which is defined by this table:
CREATE TABLE Graph1
(init INTEGER NOT NULL,
term INTEGER NOT NULL);
INSERT INTO Graph1 VALUES (1, 1);
INSERT INTO Graph1 VALUES (2, 2);
INSERT INTO Graph1 VALUES (3, 3);
INSERT INTO Graph1 VALUES (4, 4);
INSERT INTO Graph1 VALUES (1, 2);
INSERT INTO Graph1 VALUES (2, 1);
INSERT INTO Graph1 VALUES (2, 3);
INSERT INTO Graph1 VALUES (3, 2);
INSERT INTO Graph1 VALUES (1, 3);
INSERT INTO Graph1 VALUES (3, 1);
INSERT INTO Graph1 VALUES (3, 4);
INSERT INTO Graph1 VALUES (4, 3);
has a walk of length three from node 1 back to node 4 by visiting nodes 1, 2, 3, and 4 in that order. There is a walk of length four from node 1 to node 2 by visiting nodes 1, 3, 4, 3, and 2 in that order. However, you will notice that there is no walk of length three between node 3 and node 4.
What I would like you to come up with is a query that will produce a table with the number of walks of length two for every pair of nodes in the graph. You do not have to tell me what the walks actually are -- I just want to know how many there are. The query should generalize to a walk of any length.



CREATE VIEW Walk1 (i, j, tally) AS SELECT init, term, IF init = term THEN 0 ELSE 1 ENDIF FROM Graph1;We now do a selfjoin on the view to get the square of the matrix. I will put this query in another view, so that it can be used for computing longer and longer walks.
CREATE VIEW Walk2 (i, j, tally) AS SELECT W1.i, W2.j, SUM(W1.tally * W2.tally) FROM Walk1 AS W1, Walk1 AS W2 WHERE W1.j = W2.i GROUP BY W1.i, W2.j;
| Walk2 | ||
| i | j | tally |
| ============== | ||
| 1 | 1 | 2 |
| 1 | 2 | 1 |
| 1 | 3 | 1 |
| 1 | 4 | 1 |
| 2 | 1 | 1 |
| 2 | 2 | 2 |
| 2 | 3 | 1 |
| 2 | 4 | 1 |
| 3 | 1 | 1 |
| 3 | 2 | 1 |
| 3 | 3 | 3 |
| 3 | 4 | 0 |
| 4 | 1 | 1 |
| 4 | 2 | 1 |
| 4 | 3 | 0 |
| 4 | 4 | 1 |