DBMS, April 1997
DBMS Online: SQL for Smarties By Joe Celko

Graphs in SQL

Joe explains graphs in SQL and provides a graph puzzle.

Everyone in the trade press is required to write something about the Web and the Internet in their column. It should deal with commercial pornography, child molesting, international terrorism, and all the stuff that sells magazines.

Well, I don't have anything on that right now. I just haven't been out terrorizing child molesters with pornography with my stolen credit card numbers for a while.

In fact, I am in the mood to write one of my boring technical columns about SQL programming tricks. How can I fool my editor into thinking that I am on top of the right trends? The Internet is a network, and networks are modeled with graphs, so I am doing a column on representing graphs in SQL. Yeah, that's the ticket!

Graphs in SQL

Graphs are mathematical structures made up of nodes connected by edges (also called vertices and arcs in the literature) that are used to represent networks, roads, railroad lines, or just about anything else you can think of. Sometimes the edges carry information, such as road names; sometimes the nodes carry information, such as city names; and sometimes they both have meanings.

The edges can be directed or undirected. A directed edge shows a relationship between the two nodes that runs in one direction only, such as the path of an escalator between two floors. An undirected edge shows a symmetric relationship between the two nodes it connects, such as a staircase that runs both ways between two floors. Edges can be named themselves (for example, "Highway I-20"), but in most data processing examples an edge is identified by its initial and terminal nodes in what is called an adjacency table. Let's build two tables, one for the nodes, which we will make cities, and one for the graph itself, which will be Internet connections between them:

CREATE TABLE Nodes
(node_nbr INTEGER NOT NULL PRIMARY KEY, 
city_name CHAR(20) NOT NULL);
CREATE TABLE Graph 
(init INTEGER NOT NULL 
REFERENCES Nodes(node_nbr)
ON DELETE CASCADE, 
term INTEGER NOT NULL 
REFERENCES Nodes(node_nbr)
ON DELETE CASCADE);

To keep information on the edges, I would need another table:

CREATE TABLE Edges
(line_nbr INTEGER NOT NULL PRIMARY KEY,
init INTEGER NOT NULL, -- where the line starts
term INTEGER NOT NULL, -- where the line terminates
line_name CHAR(20) NOT NULL,
UNIQUE (init, term),
CHECK (init < term));
For this discussion, I will use only the Node and Graph tables and model undirected graphs. I now need some conventions and constraints: SELECT (COUNT(*) * COUNT(*)) AS total_rows
FROM nodes;

Remember that I am storing only the actual edges, so the table will probably be much smaller. But even if it is not, the rows are only two integers wide, so the physical size will be fairly small. To see how this works, let's start off with a small table of unconnected nodes. First let's decide on the cities in the network:

INSERT INTO Nodes (1, 'Atlanta');
INSERT INTO Nodes (2, 'Boston');
INSERT INTO Nodes (3, 'Chicago');
INSERT INTO Nodes (4, 'Detroit');
INSERT INTO Nodes (5, 'Evanston');

Then we need to make sure that they can at least talk to themselves:

INSERT INTO Graph VALUES (1, 1);
INSERT INTO Graph VALUES (2, 2);
INSERT INTO Graph VALUES (3, 3);
INSERT INTO Graph VALUES (4, 4);
INSERT INTO Graph VALUES (5, 5);

The general code for this operation should be put into a stored procedure. Skipping over the error handling and other fine points, the procedure looks like this:

CREATE PROCEDURE NewNode
(IN new_node INTEGER,
IN city_name CHAR(20))
BEGIN ATOMIC
INSERT INTO Nodes VALUES (new_node, city_name); 
INSERT INTO Graph VALUES (new_node, new_node); 
END;
To connect two nodes, I must build this table with node_nbr pairs to avoid NULLs and to follow the first constraint. (See Figure 1.) Again skipping over the error handling and other fine points, the procedure looks like this:
CREATE PROCEDURE NewEdge
(IN new_node_1 INTEGER,
IN new_node_2 INTEGER)
BEGIN ATOMIC
INSERT INTO Graph (init, term) 
VALUES (new_node_1, new_node_2);
INSERT INTO Graph (init, term) 
VALUES (new_node_2, new_node_1);
END;
Right now the graph has no edges, so I'll make some:

CALL New Edge(1,2);
CALL New Edge(3,4);

The graph and the Graph table now looks like this:

Graph
init term
============
1 1
1 2
2 1
2 2
3 3
4 4
5 5
If I were to connect node 2 (Boston) to node 3 (Chicago), there would be a path from node 1 (Atlanta) to node 3 (Chicago) by traveling through node 2 (Boston), and a path from node 1 (Atlanta) to node 4 (Detroit) by traveling through node 2 (Boston) then node 3 (Chicago). As I add more edges, the graph will have more paths to other nodes. A graph with a single edge path from any node to all other nodes is a complete graph.

A condition that is not quite as strong as completeness is the transitive closure of a graph. This is the state of our graph right now. The definition can be stated in terms of our Graph table as: If row (a,b) and row (b,c) are in the Graph table, then so is row (a,c). Such a graph is called closed.

Transitive closure partitions the graph into subgraphs, each of which is complete. This concept can be applied to HTML and other files in an Internet or Intranet Web site. Closure lets you quickly find those subgraphs for any member (n) with a query such as this:

SELECT :n, ' connects to ', term
FROM Graph
WHERE init = :n;
Deleting a node is as easy as it was to add it. When you remove a row from the Nodes table, the action will cascade over and remove all rows wherever it appeared as either a init or term value in the Graph table. Referential actions are wonderful things!

Inserting a new edge and adding the other rows of its transitive closure takes more work, however. You will need something like this in a procedure body or script, which will add the following rows to the Graph table:

  1. The new (j, k) row and its reversal (k, j).
  2. All of the pairs made from j and all of the term values that start at k, and their reversals.
  3. All of the pairs made from k and all of the term values that start at j, and their reversals.
CREATE PROCEDURE TransClose 
(j INTEGER NOT NULL, k INTEGER NOT NULL)
AS 
BEGIN ATOMIC
ý create new edge first
INSERT INTO Graph (init, term) VALUES (j, k); 
INSERT INTO Graph (init, term) VALUES (k, j); 
COMMIT;
INSERT INTO Graph (init, term)
SELECT DISTINCT G2.init, G3.term 
  FROM Graph AS G1, Graph AS G2, Graph AS G3 
WHERE -- define G1 as the new edge
  (G1.init IN (j,k) AND G1.term IN (j,k))
     AND (G2.term = G1.init AND G1.term = G3.init) 
ý link G2 and G3 together thru G1
     AND G2.term IN (j,k) 
     AND G3.init IN (j,k) 
ý do not recreate any existing edges
     AND NOT EXISTS (SELECT * 
         FROM Graph AS G4 
         WHERE G4.init = G2.init 
         AND G4.term = G3.term); 
COMMIT;
END;
I'll dissect this with the diagram shown in Figure 2.

Subgraph G1 is the new edge I added. Subgraph G2 is all of the old edges that end at G1. Subgraph G3 is all of the old edges that start at G1. I want to make new edges with the nodes of G2 and G3. This query will result in a lot of redundant pairs in spite of using the select distinct option. Because I also want to avoid repeating any edges that already exist in the table, the where clause finishes with a test to see that the edge I created is indeed a new one.

Graph
initterm
==========
1 1
1 2
1 3
1 4
2 1
2 2
2 3
2 4
3 1
3 2
3 3
3 4
4 1
4 2
4 3
4 4
5 5

If the size of the table is a problem, you can define it with this Data Definition Language:

CREATE TABLE HalfGraph 
  (init INTEGER NOT NULL 
     REFERENCES Nodes(node_nbr)
     ON DELETE CASCADE, 
term INTEGER NOT NULL 
     REFERENCES Nodes(node_nbr)
     ON DELETE CASCADE,
CHECK (init < term),
PRIMARY KEY (init, term));
     
and then reconstruct the full graph with this view.
CREATE VIEW Graph (init, term)
AS SELECT init, term 
     FROM HalfGraph
   UNION 
   SELECT term, init
     FROM HalfGraph
   UNION 
   SELECT init, init
     FROM HalfGraph
   UNION 
   SELECT term, term 
     FROM HalfGraph;
     
The problem is that this view is not updatable, so you will have some tricky predicates in your stored procedures.

Puzzle

This puzzle has to do with graphs, and it will probably make you do some research, too. (You can find this month's puzzle answer below.) A path is a set of connected edges in a graph that starts at one node and ends at another without any backtracking. A walk is also a set of connected edges in a graph that starts at one node and ends at another, but it permits you to backtrack. You measure the length of a walk by the number of edges you travel to get from the start to the finish.

For example, the graph in Figure 3, which is defined by this table:

CREATE TABLE Graph1
(init INTEGER NOT NULL,
term INTEGER NOT NULL);

INSERT INTO Graph1 VALUES (1, 1);
INSERT INTO Graph1 VALUES (2, 2);
INSERT INTO Graph1 VALUES (3, 3);
INSERT INTO Graph1 VALUES (4, 4);
INSERT INTO Graph1 VALUES (1, 2);
INSERT INTO Graph1 VALUES (2, 1);
INSERT INTO Graph1 VALUES (2, 3);
INSERT INTO Graph1 VALUES (3, 2);
INSERT INTO Graph1 VALUES (1, 3);
INSERT INTO Graph1 VALUES (3, 1);
INSERT INTO Graph1 VALUES (3, 4);
INSERT INTO Graph1 VALUES (4, 3);

has a walk of length three from node 1 back to node 4 by visiting nodes 1, 2, 3, and 4 in that order. There is a walk of length four from node 1 to node 2 by visiting nodes 1, 3, 4, 3, and 2 in that order. However, you will notice that there is no walk of length three between node 3 and node 4.

What I would like you to come up with is a query that will produce a table with the number of walks of length two for every pair of nodes in the graph. You do not have to tell me what the walks actually are -- I just want to know how many there are. The query should generalize to a walk of any length.


Joe Celko is an Atlanta-based guru with Northern Lights Software Ltd. and a member of the ANSI X3H2 Database Standards Committee. He is also the author of two books on SQL: SQL For Smarties (Morgan-Kaufmann, 1995) and Instant SQL Programming (Wrox Press, 1995). You can contact Joe via CompuServe at 71062,1056 or via email at 71062.1056@compuserve.com.


Figure 1.


--This figure shows the nodes in the graph. Some of the nodes are connected, adding edges to the graph.


Figure 2.


--G1, G2, and G3 are the subgraphs of the graph.


Figure 3.


--This graph shows walks used to go from node to node.


Answer:

You probably had to research this one to discover that an adjacency matrix of the graph raised to the nth power gives you the number of walks of length n. An adjacency matrix of a graph is a square matrix in which the rows are starting nodes and the columns are the finish nodes; the main diagonal is always zeros. Our adjacency list model can be used like an adjacency matrix in a query that performs matrix multiplication. The adjacency list is already a table of walks of length one. We've already solved the puzzle for one case! However, we need to add column to the table for the tally of walks. Let's do that with a view:
CREATE VIEW Walk1 (i, j, tally)
AS SELECT init, term, IF init = term THEN 0 ELSE 1 ENDIF
   FROM Graph1;
We now do a selfjoin on the view to get the square of the matrix. I will put this query in another view, so that it can be used for computing longer and longer walks.
CREATE VIEW Walk2 (i, j, tally)
AS SELECT W1.i, W2.j, SUM(W1.tally * W2.tally)
   FROM Walk1 AS W1, Walk1 AS W2
   WHERE W1.j = W2.i
   GROUP BY  W1.i, W2.j;
  
Walk2
i j tally
==============
1 1 2
1 2 1
1 3 1
1 4 1
2 1 1
2 2 2
231
2 4 1
3 1 1
3 2 1
3 3 3
3 4 0
4 1 1
4 2 1
4 3 0
4 4 1



Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
April 1997 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1997 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated Tuesday, March 18, 1997