Mittwoch, 30. September 2015

Directed Graphs in PostgreSQL

In this post we want to make use of CTE in PostgreSQL to compute paths in directed graphs. We consider only the edges of the graphs, this means, that if some graph has a vertex without any edge in or out of this vertex, it is not shown here.
Given the edges of some graph, we want to construct the startpoints and endpoints only of paths, which are not contained in other paths.
First we want to construct some different graphs and we want also to note the desired output of the query when applied to the specific graph, to clarify what our solution should do.
Lets start with a graph wich consinsts of three components:

drop table if exists p;
create table p(
  startPoint integer,
  endPoint integer);

-- example 1:
delete from p;
insert into p(startPoint, endPoint) values (4,3),(2,5),(5,20),(1,10),(10,15),(15,16),(16,18);

-- desired output of example 1:
-- startpoint | endpoint | distance |     path
--------------+----------+----------+---------------
--          1 |       18 |        4 | 1,10,15,16,18
--          2 |       20 |        2 | 2,5,20
--          4 |        3 |        1 | 4,3
In the above example, we do not want to see for example startpoint=10, endpoint=16, distance=2 and path=10,15,16 because this path is already contained in the first path of the above table, which starts at 1 and ends at 18.
Now we want to take a look at a more complex example. In the following graph, we have two components and also paths of interest which are of length two, namely 1,4.
-- example 2:
delete from p;
insert into p(startPoint,endPoint) values (3,9),(9,2),(9,10),(10,7),(3,17),(1,4),(6,5),(5,9);

-- desired output of example 2:
-- startpoint | endpoint | distance |    path
--------------+----------+----------+------------
--          1 |        4 |        1 | 1,4
--          3 |        2 |        2 | 3,9,2
--          3 |        7 |        3 | 3,9,10,7
--          3 |       17 |        1 | 3,17
--          6 |        2 |        3 | 6,5,9,2
--          6 |        7 |        4 | 6,5,9,10,7
In this example we have several starting points occuring more than once, namely 3 and 6 which is ok, since they correspond to different paths.
What about cycles in a graph?
Should the query go in an infinte loop?
Should it terminate, and if it terminates, what should the output be?
Here is the most simple cyclic graph with more than one vertex:
It makes sense, that the query terminates, and if some component of the graph contains a loop, than this component should not be shown. In the above graph, the result would be an empty table:
-- example 3: (graph contains cycles)
delete from p;
insert into p(startPoint, endPoint) values (1,2),(2,1);

-- desired output of example 3:
-- startpoint | endpoint | distance | path
--------------+----------+----------+------

-- notice that in this example the desired output is an empty table, since it makes no sense to define the starting and the end poin of a cycle;
What happens if a graph contains multiple routes from the same starting and endpoints? Here a small graph :
In this case all paths should be shown. (One might consider an alternative solution for example, to take only the shortest or the longest path.):
-- example 4: (graph contains multiple routes)
delete from p;
insert into p(startPoint, endPoint) values (1,2),(2,3),(1,3);

-- desired output of example 3:
-- startpoint | endpoint | distance | path
--------------+----------+----------+-------
--          1 |        3 |        1 | 1,3
--          1 |        3 |        2 | 1,2,3
In the last example, we want to consider a graph which has more than one component one of which contains a cycle and the other not:
Here we want to show only the paths of the component which has no cycle:
-- example 5: (graph contains two components, one of which contains a cycle, the other not)
delete from p;
insert into p(startPoint, endPoint) values (1,2),(2,1),(3,4),(4,5);

-- notice that in this example, it makes sense to ignore the component of the graph which contains a cycle and to just show the paths of the other components which do not have a cycle

-- desired output of example 4:
-- startpoint | endpoint | distance | path
--------------+----------+----------+-------
--          3 |        5 |        2 | 3,4,5
All this can be done with CTE in PostgreSQL, which at this time is not available in MySQL. The basic idea to recursively define a table is the same as the definition of a recursive function. In the recursive function you need some starting value. This starting value is implemented through a "starting table", or "starting query". This query is then "UNION [ALL]" to the recursive part.
One solution is:
with recursive t(intermediatePoint,startPoint,endPoint,distance,path) as (
    select
        ( select greatest(max(startPoint),max(endPoint)) from p)+1 as  intermediatePoint,
        startPoint, endPoint, 1 as distance, cast(startPoint as text) || ',' || cast(endPoint as text) from p
      union all
    select t2.startPoint as intermediatePoint, t1.startPoint, t2.endPoint,1+t2.distance, cast(t1.startPoint as text) || ',' || t2.path from p t1, t t2
     where t1.endPoint = t2.startPoint
      and 1+t2.distance <= (select count(*) from p)
) select startpoint, endpoint,distance,path from t
     where
       startPoint not in (select intermediatePoint from t)
       and endPoint not in (select intermediatePoint from t)
     order by startpoint, endpoint;
To keep track if some path is contained in some other path, we keep track of intermediate points, when joining together two paths. In this way we can later recognize if some edge is an intermediate point or not. If the starting or ending points have been considered to be intermediate points, than the corresponding entry is not shown, because this means, that this path is contained in some other path.
Since paths of length two (edges) do not have intermediate points, but we have to assign them also a value, because of "UNION [ALL]", on idea would be to assign them the value NULL. Unfortunotely this did not work. I guess this is becaus in the definiton of the recursive table t(intermediatPoint,startPoint,endPoint,distance,path) we have nothing said about the data-types of the columns involved, for example of the column intermediatePoint. So I guess, to determine the data-type of this column, the data-type of the corresponding column before the "UNION-ALL" is taken, but this guess might be wrong. So in order to have some value of which is distinct from all edges, I decided to take the maximum over all edges and add one to it.
To avoid the query to go into an infinte loop when it tries to join paths of a cycle, we make use of the following observation:
The start and endpoints with corresponding paths which are of interest to us can have at most a length which is equal to the number of edges in the graph. This would correspond to, a graph of consecutive integers where there is an edge from every integer n to n+1 for example 1,2,3,..,100,101.

Keine Kommentare:

Kommentar veröffentlichen