Search in Graph Theory

Generally speaking, search is one of the most fundamental techniques for designing algorithms. It means to visit some interconnected objects, in a specific manner that will hopefully solve the problem, mostly in a depth-first manner and breadth-first manner, until we find the object or state that is the answer to the problem.

For example, in the problem of finding a solution to a Sudoku puzzle, one may apply the depth-first search by trying to fill in possible digits from 1 to 9 into vacant slots, and backtrack when filling in such a way does not result in a feasible solution. In the problem of finding the shortest path in a 2-dimensional maze, most commonly described by a 2-dimensional array of 0s and 1s, one might apply the breadth-first search instead, trying one move at a time and ensuring the result path found is the shortest.

The term "search in graph theory" means to search a graph in its algorithmic representation as input. However, given that a "search in general" is to visit interconnected objects, if the search algorithm is guaranteed to halt, we are able to create a finite set of objects as the vertex set, with their interconnections as the edge set, and build a graph (usually called the implicit graph of the problem, while the one in its algorithmic representation is said to be explicit) such that searching in it yields the answer equivalently, although we won't do so due to efficiency and space complexity issues.

In fact, our point-of-interest in the search in graph theory differs quite vastly from the one in general. When referring to search in general, we mostly refer to searching for a specific object or state; when referring to search in graph theory, we mostly won't care about a specific vertex, but structural information reflected during the search. You will know what it means by after reading this text, I think.

Search Forests

The principle of search, given that it's a fundamental technique, is quite simple. When it comes to graph theory, there's not much difference. What we care most about are still breadth-first search (BFS) and depth-first search (DFS), mostly in the following forms:

Algorithm (Framework of Breadth-First Search in Graph)

The input is a graph or a digraph , we'll have performed a breadth-first search when the algorithm halts:

Initialize empty queue , and empty set to track enqueued vertices.
Loop while :
1. Pick up any , enqueue it into , and update .
2. Loop while is not empty:
  1. Dequeue from , and visit it.
  2. Iterate For every :
    1. If , enqueue into and update .

For convenience, we say each iteration at step 2 to be one pass of BFS at .

Convention (Iterating Edge Set Incident to Vertex)

For graph or digraph , thanks to its algorithmic representation, we may iterate by retrieving the next edge of currently visited edge.

That is, we define for every vertex , such that begins the iteration by returning the first edge in , and returns the next edge of , until if is the last edge to visit.

Algorithm (Framework of Depth-First Search in Graph)

The input is a graph or a digraph . We'll have performed a depth-first search when the algorithm halts:

Initialize empty stack , and empty set to track visited vertices.
Loop while :
1. Pick up any , Push onto .
2. Loop while is not empty:
  1. Pop from , visit and update if .
  2. Get , and if (let ):
    1. Push onto .
    2. If , push onto .

For convenience, we say each iteration at step 2 to be one pass of DFS at .

And it's quite easy to see their time and space complexity as:

Theorem

Both frameworks of BFS and DFS in graph run in time complexity and space complexity .

Proof. It won't be hard to see that every vertex is visited exactly once, and every edge incident to the vertex is checked exactly once in a digraph or at most twice in a graph. To book-keep the set of enqueued or visited vertices, a bitset of bits suffices, in which each set operation can be done in . To find the next vertex for each pass of BFS / DFS, it suffices to linearly scan the bits in the bitset of , which is amortized . Thus the time complexity is eventually .

In BFS, every vertex will be enqueued only once; in DFS, every stack item is associated with vertex while each is unique. So there're at most items in queue or on stack, and thus the space complexity is .

Due to the manner of visiting all vertices and edges in a graph or a digraph, a search in graph theory is sometimes called a graph traversal.

In practise, every search-based graph theoretic algorithm can be seen as a variant of the frameworks of BFS and DFS above. They might prematurely halt when they've found their target, but given that we are hopefully able to arrange the input graph deliberatedly, so that the target to find appear as the last vertex to visit or edge to check, we must expect every search-based algorithm to traverse the whole graph in general. So should serve as a conservative estimation of worst case time complexity of every search-based algorithm, if the accumulated time complexity of extra operations introduced by the very algorithm does not supersede it.

The principle of search in graph theory is just simple as the search in general, with nothing special. But if you think it's everything about search in graph theory, then you will miss the whole forest. And this time, it's indeed a forest called the search forest.

Definition

As an example for demonstration, let a digraph with its adjacency list be shown as below:

Consider performing BFS and DFS on this digraph. We scan the adjacency list from the top to bottom for unvisited vertices. For each vertex, we visit its incident edges by iterating the linked list chained into its corresponding slot in the adjacency list.

We want to represent how the vertices are being visited during BFS or DFS in some way. One way is to arrange them in a search forest, which is defined as:

Definition (Search Forests)

For a BFS or DFS on a specific algorithmic representation of graph or digraph, a search forest is a way to arrange vertices by their order of being visited in the search. It's a collection of search trees, in which every tree fulfils:

Every vertex selected as the vertex to start a pass of BFS or DFS, is the root of a search tree;
If edge causes to be enqueued into the BFS queue, or causes to be pushed onto the DFS stack, then there's is a children of , with edge connecting them;
If are children of through edges and , and appears earlier than in (thus is dequeued from BFS queue or popped from DFS stack earlier than ), then the subtree of is on the left of the subtree of .

Thus, every search tree is a rooted tree whose order of children matters. And in addition to the order of children, the order of the search trees in a search forest also matters, the search tree visited earlier should appear on the left of the one visited later.

For convenience, we usually add a modifier to the search forest to indicate what kind of search is it for, e.g. a depth-first search forest / DFS forest, a breadth-first search forest / BFS forest.

According to this definition, the BFS forest and DFS forest of the digraph we present can be shown as below:

The claim, that the BFS forest and DFS forest are on a specific algorithmic representation of graph, means just a minor change in the algorithmic representation can lead to another search forest of completely different structure.

For example, consider permutating the edges in the adjacency list of the digraph we've just visited above. For readability, the differences have been marked red:

The permutated adjacency list shall represent the same digraph semantically. However, when we run BFS and DFS on the digraph of permutated adjacency list, we will see:

The difference in BFS forests is relatively minor, just a permutation of subtrees can be seen. However, in the DFS forests, even the parenthood of vertices has been changed, which is a major difference, since the underlying digraph has been in a completely different isomorphism class.

Semantically, if a search-based graph theoretic algorithm would like to be correct, it must be able to handle whatever algorithmic representation of the input graph. That is, every possible visiting order of vertices and edges must be considered, which is equivalent to ensuring that the algorithm works on every possible search forest of the input graph.

In practice, many search-based algorithms are designed to extract structural information from a graph, which is preserved by graph isomorphisms, e.g. cut vertices / edges of a graph, SCCs of a digraph (actually we will introduce them later in this text). And if the algorithm is designed for this purpose, semantically it must work on every graph in the same isomorphism class, that is, it must be able to handle every search forest of every graph in the isomorphism class.

To see how hard it is, let's consider a digraph isomorphic to the previous digraph we demonstrated, by and :

And the BFS forest and DFS forest of the isomorphic digraph with respect to the adjacency list above is shown as below:

As we can see, the search forest has nearly nothing in common with the previous ones. Thus, designing a correct search-based algorithm to work on a whole isomorphic class will be an insurmountable task, without regaining some trackability of the search forests.

Categorization of the Edges

One of the most important purposes for defining the search forests is to categorize the edges in graph and digraph:

Convention (Drawing Graphs with Repect to Search Forest)

Given a search forest of the graph or digraph , clearly all edges in search forest are from , and those edges used to construct the search forest are called the tree edges. In general, clearly all edges in are not tree edges, and they are said to be non-tree edges if so.

A path made up purely of tree edges is said to be a tree path, it's to be found in the search tree.

When drawing a search forest, we usually draw the non-tree edges in addition to the tree edges, so that we can study the categorization of edges with respect to this search forest. If we are to do so, we will draw those tree edges with solid lines and non-tree edges with dashed lines.

Noted that if we draw in such a way, all vertices in and all edges in of graph or digraph appears in the drawing, and the interconnection between vertices are unchanged. So we obtain another plot of graph or digraph with respect to the search forest by drawing in such way.

Definition (Categorization of the Edges)

Clearly due to the implementation of BFS and DFS, if would like to be a non-tree edge, must have been enqueued / visited when is being visiting and is being checked. According to the relationship between in the search forest, when is checked for the first time, we may categorize non-tree edge into one of these three categories:

Back edge: if is the ancestor of , or .
Forward edge: if is a descendant of .
Cross edge: if none of the cases above is applicable.

According to the definition, we may draw a digraph with respect to one of its BFS forest and DFS forest in the following way:

(I've hidden the labels of the tree edges in case of their being too messy.)

In the BFS forest, we can see is back edge, and are cross edges; in the DFS forest, we can see is back edge, are forward edges, and is cross edge.

But what can we do with categorizing all these edges? Well, one of the simplest usages is to check whether a graph or digraph is acyclic. First, it's easy to see:

Lemma

If there's a back edge in the search forest, then there's a cycle.

Proof. Let be the back edge such that is the ancestor of or . Clearly there's tree path to be found in the search tree, and is a cycle.

One may start to use this condition as the principle to design a cycle detection algorithm. However, the condition is sufficient, not necessary. And if we are not able to prove its necessity, there is likely a search tree without any back edge, of a graph with a cycle in it, and the algorithm based on such a condition may not be correct.

In fact, consider a BFS forest , in which there's no back edge, but the search digraph clearly contains a cycle. So if one implements a cycle detection algorithm based on BFS, he's doomed.

However, we will see later in this text, the condition is necessary in a DFS forest, so a cycle detection based on DFS may work.

Search Colorings and Timestamps

To study the properties of the search forest, given that it's generated from a specific BFS or DFS, we must stick to the behaviours of BFS and DFS themselves.

First, one can easily find a vertex should be one of these three states during a BFS or DFS:

Unvisited: the vertex has not been dequeued from the BFS queue, or popped from the DFS stack.
Visiting: the vertex has been dequeued from the BFS queue, or popped from the DFS stack, and its incident edges are being iterated and checked.
Visited: the incident edges to this vertex has been iterated and checked.

And soon we will see, that keeping track of these states will be useful for studying behaviours of BFS and DFS. So we tweak our framework of BFS and DFS with search colorings and timestamps (SCT) to fulfill this purpose:

Definition (Search Colorings and Timestamps)

A search coloring is a vertex coloring for indicating vertex state during a BFS or DFS. That is, we color the vertex when it's in unvisited state, when it's in visiting state, and when it's in visited state. The search coloring also affects our way of drawing the graph and the search tree: we will color the vertices in white (), gray () or black () to indicate their corresponding states.

Clearly, the vertex state is transient during the search, and a monolithic search coloring is insufficient to describe dynamic behaviour. So we introduce the timestamp, which is a counter starting at and will increment by whenever there's a state transition. Meantime, we take multiple snapshots of the search colorings, and index them with respect to the timestamp.

In order to apply the search colorings, first we need to initialize the initial search coloring , which marks all vertices as unvisited as expect, and the initial timestamp . Then, whenever there's a state transition of vertex into state , we perform an operation of updating the search coloring of to , that takes the snapshot and performs the update of , to synchronize the search coloring with vertex state. And after that, if we want to retrieve the snapshot of the search coloring taken at timestamp , we just need to refer to .

Algorithm (Framework of BFS with SCT)

The input is a graph or a digraph , we'll have performed a BFS with SCT when the algorithm halts:

Initialize empty queue , and empty set to track enqueued vertices. Initialize the search coloring and timestamp .
Loop while :
1. Pick up any , enqueue it into , and update .
2. Loop while is not empty:
  1. Dequeue from , and visit it.
  2. Update the search coloring of to .
  3. Iterate For every :
    1. If , enqueue into and update .
  4. Update the search coloring of to .

Algorithm (Framework of DFS with SCT)

The input is a graph or a digraph . We'll have performed a DFS with SCT when the algorithm halts:

Initialize empty stack , and empty set to track visited vertices. Initialize the search coloring and timestamp .
Loop while :
1. Pick up any , Push onto .
2. Loop while is not empty:
  1. Pop from , visit and update , as well as the search coloring of to , if .
  2. Get .
  3. If (let ):
    1. Push onto .
    2. If , push onto .
  4. If , then update the search coloring of to .

So by integerating SCT into BFS and DFS, we are finally able to demonstrate their dynamic behaviours:

It won't be hard to find:

Lemma

The valid range for timestamp to index snapshot is , such that and .

Proof. Due to the implementation of BFS and DFS, every vertex changes its color twice, one from to and the other from to , so it won't be hard to find all vertices are colored black at timestamp .

Lemma

For every vertex , there exists a discover timestamp and a finish timestamp , such that for timestamp :

Proof. This is just another rephrashing of the fact that every vertex changes its color twice during a search, one from to and the other from to .

Conventionally, we shall do:

Convention (Modifiers by Search Colorings)

For convenience, if at timestamp , we have for , then is said to be white at timestamp . We will also say to be gray or black at timestamp if is or .

At timestamp , if there's a -path such that every vertex in is white, then is said to be a (-)white-path at timestamp ; similarly if every vertex in is black, then is said to be a (-)black-path.

At timestamp , if every vertex in a search tree is white, then the search tree is said to be white at timestamp ; if every vertex in a search tree is black, then the search tree is said to be black at timestamp .

So we may also rephrase the lemmas above as:

Lemma

The valid range for timestamp to index snapshot is , such that every vertex is white at timestamp and every vertex is black at timestamp .

Lemma

For every vertex , there exists a discover timestamp and a finish timestamp , such that is white at timestamps , gray at timestamps , and black at timestamps .

Generally speaking, the "continuity" and "monotonicity" of search colorings in timeline can be utilized to model constraints in search trees. For example, if the vertex is white at timestamp , and black at timestamp , then it must have been visited somewhere in between and . The truly hard part is about the timing of setting up checkpoints, and soon we will see.

Search Trees and Timestamps

For any search tree, regardless of BFS or DFS, it's easy to see:

Lemma

For every search tree in the search forest (in which we view as an induced subgraph of the searched graph or digraph, by vertices in the tree), let it be rooted at , there must be a discover timestamp and a finish timestamp , such that for any vertex :

However, it suffices to check whether , in order to determine whether is true.

Proof. If , clearly it must be the one that is discovered after the tree begins its construction at , and visited before the tree finishes its construction at . Conversely, for each pass of BFS or DFS, there's only one tree being constructed, that is, search trees are constructed one at a time. And if changes its color during the phase of the tree construction of , it must be inside the tree.

For the comment that it suffices to check , from the implementation of BFS and DFS we can clearly see, that once is dequeued from the BFS queue and DFS stack, the tree construction will not complete until all edges of have been checked, so ensuring the discover timestamp suffices.

And consider a gray vertex, whose edges are being checked, it's easy to find:

Lemma

For vertex , if at timestamp , there's such that is white, then and are in the same search tree.

Proof. At timestamp , let the search tree is being constructed, which is the one that is in.

Assume it's at timestamp that edge is being checked, and whether will be enqueued to BFS queue or pushed onto DFS stack is being determined. Clearly is still gray and the search tree is still being constructed. If is gray or black at timestamp , then there must be ; otherwise if is white, either has already been in the BFS queue or DFS stack, or will be enqueued to BFS queue or pushed onto DFS stack due to , either case will cause to be visited before the BFS queue or the DFS stack is emptized, so .

These lemmas will lead to:

Theorem (White-Path Theorem for Search Trees)

Consider a search tree rooted at , for any , we have iff there's a -white-path (in the searched graph or digraph) at timestamp .

Proof. The necessity part is clear: any -tree-path in is such a -white-path.

For the sufficiency part, we assume the contrary, that at timestamp there's a -white-path , but at timestamp we find is white. Clearly, at timestamp , there's some edge such that is black and is white, since is black and is white, and then cannot be true, while . Noted that at timestamp , is gray and is white, which implies , and is a contradiction to observed at timestamp .

By applying this argument to iteratively, until there's no edge with black and white endpoints, we will reach a point that is a -black-path at timestamp , and therefore .

This shall conclude the proof of the whole component in graph will be visited when we search it in BFS manner, which is required by the proof of algorithm of generating partite set for graph without odd cycle, in our previous text.

Shortest Tree Paths in BFS Trees

Intuitively (for programmers and CS students) speaking, one of the most important motivations to apply BFS is, to find a solution with the fewest moves. In the case of search in graph theory, "the fewest moves" means the shortest path to reach a vertex, which is relative to the root of each search tree. In other word, we would like to show:

Assertion

In a BFS forest, every -tree-paths to be found in the search tree rooted at is a shortest -path.

However, since intuition fools human eyes (and minds) indefinitely, it won't be useful unless we elevate the intuition to mathematical proofs.

To study the mathematical properties of a BFS forest, we tweak the BFS a little bit more, by integrating a level counter relative to each vertex:

Algorithm (Framework of BFS with SCT and Level Counter)

The input is a graph or a digraph , we'll have performed a BFS with SCT and level counter when the algorithm halts:

Initialize empty queue , and empty set to track enqueued vertices. Initialize the search coloring and timestamp . Initialize empty level counter .
Loop while :
1. Pick up any , enqueue into , and update
  .
2. Loop while is not empty:
  1. Dequeue from , and visit it.
  2. Update the search coloring of to .
  3. Iterate For every :
    1. If , enqueue into and update
      .
  4. Update the search coloring of to .

Again, this modification does not change the general logic of a BFS, except for pluging-in a level counter with respect to every vertex, and renaming the variables so that it will match our proofs better.

And it's very obvious that:

Lemma

In a BFS forest, every -tree-path to be found in the search tree rooted at fulfills .

Proof. Noted that the edge that causes to be enqueued into at step 2.3.1 is the tree edge, and if keeps track of the length of the -tree-path , then will be extended into the -tree-path , and . It has been true for queue head of that , and by iteratively reasoning on every vertices dequeued from , which iterates a permutation of , we prove the lemma.

Corollary

In a BFS forest, for search tree rooted at and .

Proof. The shortest -path must exists since , and will not be longer than the -tree-path to be found in , such that .

And if we are to prove , we will need to characterize the BFS queue :

Theorem (Characterization of the BFS Queue)

For any moment of the BFS queue such that is non-empty and rendered as , we have:

Proof. An intuitive interpretation will help understanding what we are to prove: means the vertices in the queue are monotonic in their level in the search tree; means the vertices in the queue won't cross the span of levels.

Clearly we must reason on every pass of the BFS at . At step 2.1, we have such that . Assume we dequeue at step 2.2.1 and then enqueue vertices at step 2.2.3.1, we have:

And noted that , and , now we have:

Which holds for the new queue . By reasoning on every possible state when is non-empty during a BFS, we prove the theorem.

Corollary

For a BFS tree and vertices , .

Proof. Consider the very pass of BFS constructing , let the vertices in be enqueued and dequeued in the order of the list , which fulfills .

Consider another list with respect to every iteration at step 2.2.2, is the dequeued vertex, are enqueued due to , such that after the iteration, , and after the iteration, . The previous theorem enforces:

Clearly is a sublist of . Each sublist must not be empty, since otherwise the queue is empty and the algorithm must have terminated, while even at the iteration that finishes searching , we have , which is non-empty. Each sublist must overlap with another, since at least the vertices are enqueued by previous operations.

The full and overlapping coverage of by enforces the level counter of vertices in to be ordered linearly, that .

Just a tiny remark, given that BFS in general can be viewed as BFS on the implicit tree of the problem, one might clone the procedure of the characterization of BFS queue above to prove the correctness of BFS under that problem.

What the theorem and corollary above imply are even clearer when looking into our example of BFS with SCT again, where we can clearly see a BFS can be fundamentally seen as level traversal in each search tree of its search forest, when expressed in the language of tree traversals.

And with the characterization of the BFS queue, we are finally able to show:

Theorem

In a BFS forest, for search tree rooted at and .

Proof. Assume the contrary, that given a -tree-path is not a shortest -path, and there's a -path such that .

Without loss of generality, let be a -tree-path that is a shortest -path, and be a -tree-path that is a shortest -path, so that we have , and thus .

Otherwise given that and , we may recursively reason on vertex first, and then . The recursion won't last forever, since at every recursion, we use vertex or as , such that . Such a descending chain of is finite and capped at , in which -path is of length , a tree path and sufficiently a shortest -path.

At the timestamp of , since is the tree edge, there's , which is impossible when , since will be added to not later than then. So must be true, which implies . and a contradiction to .

Corollary

In a BFS forest, every -tree-paths to be found in the search tree rooted at is a shortest -path.

Proof. It's equivalent to the previous theorem we've proved.

By now, we have elevated our intuition about BFS to find the shortest path to relatively rigid mathematical proofs.

The shortest tree path property in BFS tree also forbids the existence of certain kind of edges in the search tree:

Corollary

In a BFS forest, for search tree rooted at and vertices , we have .

Proof. Let -tree-path be , and -tree-path be , thus .

Due to the existence of the edge , we can see is a -walk such that , which can be fixed into a potentially even shorter -path, and is a contradiction to is a shortest -path.

Search Subtrees and Timestamps

So in the case of search trees, things have gone quite fluently, by simply setting up checkpoints on the discover and finish timestamp, we've established the white-path theorem for search trees, to determine which search tree a given vertex will be in.

We want to move a step further now: for any given vertex and a subtree rooted at in a search tree , can we setup appropriate checkpoints relative to , and check -white-paths such that we will know whether is in the subtree , just like what we have done in the white-path theorem for search trees?

Well, the answer turns out to be negative for the case of BFS subtrees. Consider the BFS tree , at the timestamp , is sufficiently a white-path, however it does not secure to lie in the subtree rooted at , since is possibly enqueued due to an check earlier, in this case it's the tree edge . On the other hand, the answer turns out to be positive for the case of DFS subtrees, which we will show later.

So what's the chasm between ensuring a vertex reachable by white-path to lie in a search tree and in a subtree? For me, I think in order to ensure the vertex to lie in the search tree, it suffices to ensure that the vertex is enqueued into the BFS queue or pushed onto the DFS stack, and the rest can be left to the algorithms' nature of emptizing the BFS queue or DFS stack per iteration with respect to search trees. However ensuring the vertex to lie in a specific subtree, a tree path must be secured by the white-path.

In fact, while the white-path fails to secure a tree path in a BFS forest, it succeeds in a DFS forest. While the counterexample above suffices to show its failure, we must come up with a mathematical proof to secure its success.

First, let's observe the stack per iteration at step 2.2.2, and relate it to the events happen at this iteration, in the framework of DFS with search coloring.

For convenience, we omit the second component of each item in the observed states of the stack . We can see each iteration at step 2.2.2 transform the stack in exactly one of these three ways (the right hand side of the list is the top of the stack):

: is popped, edge is observed, and there's in this case. So both and are pushed onto the stack. Vertex is marked immediately, due to the item to be popped at step 2.2.1 in the next iteration. So we can associate the event with this transition.
: is popped, is observed, and there's in this case. So is pushed onto the stack, waiting for next edge of to be checked in the next iteration.
: is popped, edge is observed, and is marked at step 2.2.4 in the end of this iteration, so we can associate the event with this transition.

Example

Let's observe the first search tree of the DFS with SCT example during the search, omitting stack transitions of the case 2 () that neither push vertex into nor pop vertex from the stack. We can see:

The observation leads to the characterization of the DFS stack:

Theorem (Characterization of the DFS Stack)

In a DFS forest, for any subtree rooted at , of search tree , we have:

Where mark the discover and finish timestamps of vertex , and mark the timestamps of to start and finish construction.

For any vertex , we have:

While it suffices to check whether in order to determine whether is true.

Proof. The proof is based on the observation of the stack during DFS above.

Due to the case 1 (), it won't be hard to find for each observed state of the stack , there's a tree edge connecting every two consecutive vertices in the stack . Viewing the stack as a list where the stack top is on the right hand side, it won't be hard to find there's a -tree-path for any , thus is the descendant of .

Assume for vertex and the subtree rooted at , there's a vertex added when is not on the stack. Without loss of generality, for the -tree-path in , let be the only vertex absent from in , otherwise we argue on the absent vertices closer to in first. Let the next vertex of in be , there's a transition , to push onto , and there's a transition after and before that pops . However this is impossible: since every vertex will only be pushed onto the stack once, and is still on stack, so it has not been popped from the stack. And due to the last-in-first-out (LIFO) nature of a stack, there's no way to pop from the stack without popping first.

In this way, those pushed onto the stack after and before are sufficiently the descendant of , and in the subtree rooted at . On the other hand, those vertices in are necessarily added with on stack, which is after and before . The event happens at , and happens at , thus and , such that .

It suffices to check , since is on the top of in the stack if , and the LIFO nature of a stack guarantees to be true. By combining it with the case of we prove the sufficiency of this predicate.

One will soon notice how similar it is to the lemma we use to determine whether is in the search tree . So one may expect there's a theorem similar to the white-path theorem for search trees here, and indeed:

Theorem (White-Path Theorem For DFS)

In a DFS forest, consider a subtree rooted at , of search tree . For any , we have iff there's a -white-path (in the searched graph or digraph) at timestamp .

Proof. For necessity, every -tree-path in subtree is such a -white-path.

For sufficiency, we use a similar argument to the white-path theorem for search tree. Assume the contrary, that in the -white-path found at , and there's such that is black and is white at timestamp . When is checked at timestamp , since is white then, there must be , and therefore and must be pushed onto the stack , whose LIFO nature guarantees:

And thus must be black at timestamp , which is a contradiction.

Noted that there's no gray vertex at timestamp , since due to the previous theorem, enforces . By applying this argument to iteratively, until there's no edge with black and white endpoints, we will reach a point that is a -black-path at timestamp , and therefore by the previous theorem.

Corollary

In a DFS forest of a digraph, if is a cross edge, then .

Proof. Otherwise at timestamp , there's a -white-path to be found, and thus is a descendant of , and can only be a tree edge or a forward edge, which is contradiction to is a cross edge.

Again, let's look into our example of DFS with SCT, we can see a DFS can be fundamentally seen as in-order traversal of each tree in its search forest, when expressed in the language of tree traversals. This is comparable to the BFS, which may be seen as level traversal of the search trees instead.

Categorizing Edges by Search Colorings

The DFS is discussed much more frequently than the BFS in search algorithms. One major factor might be the absent of white-path theorem in BFS, and another factor is the difficulty to categorize edges in BFS forest.

In a BFS search, it won't be hard to find , so when inspecting an edge , either is black or is white can be seen. When is black, can be back edge or cross edge; when is white, can be tree edge, forward edge or cross edge. Which is totally chaotic.

On the other hand, in a DFS forest, when inspecting edge , can be either white, black or gray, and it turns out to be feasible to categorize these edges by their colorings. Consider the DFS forest of a digraph, we can easily see:

is white: is tree edge, since will be visited in the next iteration.
is gray: has been on the stack, and must be on the stack top, which is on top of , therefore must be a back edge.
is black: has already been visited. If is in the subtree rooted at , then is a forward edge; otherwise is a cross edge.

The situation is slightly different in the DFS forest of a graph, when is black:

Theorem

In a search forest of a graph, every is either a tree edge or a back edge.

Proof. For an edge cannot be a forward edge: assume the contrary, so is a descendant of , and is checked after . Noted that is undirected, so occurs in both and , and the occurence in is earlier and the first time, which will render as a back edge.

For an edge cannot be a cross edge: assume the contrary, so . Again, since is undirected, it must also occur in . So there's -white-path to be found at timestamp , which is a contradiction to is not a descendant of .

In this way, in a DFS forest of a graph, we have:

is white: is a tree edge.
is gray: is a back edge.
is black: is a back edge. Since when is descendant of , it will render as a back edge now; and the case of being a cross edge is impossible.

However, when case 3 happens, it must be the second time that is marked or reported as a back edge, so we may simply ignore this signal.

The DFS also guarantees a back edge to be found in a DFS forest of a graph or digraph with a cycle:

Corollary

In a DFS forest of a digraph with a cycle, there's a back edge.

Proof. The case is trivial when the cycle is a loop. Assume it's not a loop, let be a cycle in the digraph, such that . At timestamp . there's a -white-path , so by white-path theorem is a descendant of , and therefore is a back edge.

The strategy we applied to digraph is not directly applicable to a graph. In a digraph, only occurs in ; but in a graph, occurs in both and , and the white-path theorem does not prevent from being a tree edge. So we will need finer case-by-case discussions in a graph:

Corollary

In a DFS forest of a graph with a cycle, there's a back edge.

Proof. The case is trivial when the cycle is a loop. Assume cycle is not a loop, and is the vertex such that .

When , we may let . if is a tree edge, then is a back edge, and vice versa.

When , without loss of generality, we may let , such that , since otherwise will be a cycle such a form. Now is also a descendant of . Since is not a direct children of , is guaranteed to be a back edge now.

In this way, the cycle detection algorithm based on DFS will work, since we've guaranteed the condition it use to detect cycle is both sufficient and necessary.

The Topological Sort Problem

Let's look into some real world application of the search in graph theory.

A topological sort is an order with a partial order embedded into it, so that:

The order is a linear order, which means .
The relation is compatible with , that (or ).

In short, to find a topological sort means to fill in the gap so that uncomparable pairs in are now comparable in .

In graph theory, we know connection relation in a DAG defines a partial order , and the topological sort problem in graph theory generally means to find a topological sort of .

Meantime, given that there're finite elements in , it's obvious that we may put them in a list such that . So finding a topological sort of a DAG suffices to put vertices in into such a list .

There're many ways to find a topological sort in a DAG, we are going to introduce two of them: one is intuitive, and the other is based on DFS.

Reference Counting

One of the most intuitive may be to apply the reference counting:

Algorithm (Reference Counting for Topological Sort)

The input is a DAG , the output is a list which represents a topological sort of :

Initialize zero reference counter , such that .
Iterate for every , do update .
Initialize the empty list , and the empty queue .
Iterate for :
1. if , enqueue into .
Loop until queue is empty:
1. Dequeue from , and append to the tail of .
2. Iterate for every :
  1. Do update .
  2. Enqueue into if .
Halt with as the topological sort.

Theorem

The reference counting algorithm yields a topological sort.

Proof. For convenience, let be the set of removing elements in from .

The reference counter actually keep tracks of the while updating . When is appended to , it's guaranteed that .

Every vertex in must be enqueued or dequeued for exactly once. Otherwise assume , We claim that . Since if , it must be enqueued at step 4.1; otherwise its in-degree must be reduced to due to some vertex , however in the iteration of at step 5, the step 5.2.2 will observe and cause to be enqueued into then.

However, if is true, is not a DAG by argument of the maximal path, which is a contradictin to must be a DAG for being subgraph of the DAG . In this way, , every vertex in is in and compatible with the connection relation , thus is a topological sort.

Theorem

The reference counting algorithm runs in time complexity and space complexity .

Proof. To initialize the in-degrees in at step 2 requires a full scan of edges, which is done in (and actually in any algorithmic representation we've known so far, but it won't affect the outcome). Scanning initial vertices at step 4 and appending vertices in step 5 are all done in , with the queue implemented as linked list. Thus the eventual time complexity is .

For space complexity, the can be implemented as an array of length . There're elements in the output , and at most elements in the queue . Thus the eventual space complexity is .

DFS-Based

The topological sort problem can also be solved by performing a DFS on the DAG:

Algorithm (Topological Sort by DFS)

The input is a DAG , the output is a list which represents a topological sort of :

Initialize the empty list .
Perform the framework of DFS, with some tweaks:
1. When finish visiting , append to .
Reverse the ordering of the list .
Halt with as the topological sort.

Simple as it is, this one is much more mind-blowing to think how it works.

Intuitively speaking, one can quickly find the list records a post-order traversal of the DFS forests, and we reverse the post-order traversal to represent the topological sort of the DAG. So if the algorithm is correct, there must be some kind of connection between the post-order traversal and the topological sort, that can be proved mathematically, and we need to find out.

First, let's see how an edge in DAG affect the finish timestamps of its endpoints:

Lemma

In a DFS forest of a DAG, if , then .

Proof. In the search forest of a DAG, any edge can only be a tree edge, a forward edge or a cross edge. Since being a back edge will sufficiently render the digraph to have a cycle, which is a contradiction.

When is a back edge or a forward edge, clearly is in the subtree of , and thus .

When is a cross edge, we know there's , and there can't be , otherwise is in the subtree of and is a back edge, which is a contradiction. So in this case it's only .

In whatever case, we have .

Why should we care about the finish timestamps? Well, noted that in the post-order traversal, we append to the at step 2.1 when finish visiting , which happens at timestamp . So knowing the ordering of the finish timestamps suffices to infer the ordering of the vertices in the list . And indeed, there's:

Theorem

The topological sort algorithm by DFS yields a topological sort.

Proof. By the previous lemma, if , then is appended to after at step 2.1. Thus, after reversing at step 3, occurs before .

For any , just take and categorize the edges in . Let , we will see , and thus occurs before in . Therefore the output list is compatible with connection relation , and thus a topological sort.

Theorem

The topological sort algorithm by DFS runs in time complexity and space complexity .

Proof. The DFS requires time complexity , reversing the list requires time complexity , thus the eventual time complexity is .

The DFS requires space complexity , the allocated list requires space complexity , thus the eventual space complexity is .

So we can see the DFS-based algorithm is not much harder than the reference counting one. And in practice, the DFS-based one is actually simpler to implement, as it requires just a DFS plus reversing a list.

Kosaraju's SCC Algorithm

The idea of the topological sort algorithm by DFS can be furtherly extended to finding SCCs in digraphs.

First, consider two SCCs of digraph , let and . Clearly we have:

And after performing a DFS on digraph , we can see:

Lemma

In a DFS forest of digraph , if , then .

Proof. If , noted that since is a DAG, then there's no -white-path for at timestamp . Thus the subtree rooted at and the subtree rooted at are disjoint, and we have:

If , noted that for any vertex , there's a -white-path at timestamp , then is the descendant of . Thus the subtree rooted at is a descendant of the subtree rooted at , and we have:

In whatever case, we are guaranteed to have .

One will soon realize how similar this lemma is to the previous lemma that is crucial to the ordering of vertices in the topological sort algorithm by DFS.

In this way, after our performing a DFS on digraph , appending vertex to a list when finish visiting , and reversing after the DFS, we obtain a list that contains a topological sort of . That is, if occurs after in the topological sort of , then contains a sublist such that the representative occurs before .

Noted that the list is a mixture of the representatives and common vertices from every SCCs. A simple idea is to iterate the for representative from each SCC, and mark the common vertices that are in the same SCC as the representative. This is sufficiently doable when we are we are able to throttle vertices from other SCCs during the process of marking, which is accomplished by exploting the transposition of digraph :

Corollary

In a DFS forest of digraph , if , then .

Proof. The contrapositive of the previous lemma is, if then . Since it's impossible for , the condition becomes . We know , by swapping the positions of the variables and , we get what we are to prove.

Actually, it's quite intuitive if one remember that , which is proved soon after the introduction to the concept of transposition.

In this way, if we iterate the representatives by the order of topological sort of the SCCs, and for each representative we mark the reachable vertices through a DFS on , given that transposition will not change strong connectivity but will throttle vertices from other components, we will be marking one SCC at a time.

And this gives birth to the Kosaraju's SCC algorithm:

Algorithm (Kosaraju)

The input is a digraph , the output is a mapping from vertex to the representative of the SCC, such that :

Initialize an empty stack .
Perform the framework of DFS on , with some tweaks:
1. When finish visiting , push onto stack .
Initialize the map , as , and evaluate the transposition of input .
Initialize an empty set for keeping track of visited vertices in DFS.
Loop until stack is empty:
1. Pop from stack .
2. If :
  1. Perform one pass of DFS on at vertex , with the visited vertices stored in and some tweaks:
    1. For each visited , update .
Halt with as the result.

Here, we use a stack instead of a list , since appending the vertices to list , reversing and iterating the vertices in sequentually, is obviously equivalent to pushing all the vertices onto stack and then popping them from , one by one, in a LIFO manner.

Theorem

The Kosaraju's SCC algorithm runs in time complexity and space complexity .

Proof. For the time complexity, pushing all vertices in to stack by DFS takes , evaluating the transposition may take , and the eventual popping-and-marking iterations at step 3 accumulates to . Thus the eventual time complexity is .

For the space complexity, the stack and mapping take , however creating the algorithmic representation of the transposition of may take space . Thus the eventual space complexity is .

In this way, assigning vertices to their SCCs in digraph is a problem can be accomplished by a DFS-based algorithm, within linear time complexity proportional to the magnitude of vertices and edges.

Tarjan's Cut Vertex / Edge Algorithms

The DFS can also be used to find cut vertices and cut edges in a graph. Without loss of generality, let the graph be connected, or we can study subgraph induced by every connected components. Clearly all vertices will be in the same search tree.

Let's consider the case of the cut vertex first:

Lemma

In a DFS forest of a graph, a vertex is a cut vertex, iff one of following conditions is true:

Vertex is the root of a search tree, and contains or more children.
Vertex is not the root of a search tree, and there's a direct children of , such that from the subtree rooted at , there's no back edge to the ancestor of (excluding ).

Proof. It won't be hard to see both conditions are sufficient: removing in case 1 disconnects the children, and in case 2 disconnect the subtree rooted at and other part of the search tree. So we just need to prove its necessity.

Assume is a cut vertex such that are separated after the removal of . Let the children of be , we partition the search tree that is in into edge sets , so that is the vertices of the search tree removing vertices from the subtree rooted at , and is the vertices of subtree rooted at . Let's perform a case-by-case discussion about where should be in.

First, if and , then there's no back edge from to . And it seems like must be an independent case, however we will show there must be in this case.

Assume the contrary that , then if there's no back edge from to , it suffices to check the former case, and so does . If both has back edge to , then are connected after the removal of , and will not be separated, which is a contradiction. In this way, there's no way for to be true, so must be the root of the search tree in this case.

Now we conclude the proof of the necessity of the two conditions.

Then we will also consider the case of cut edges:

Lemma

In a DFS forest of a graph, let be the ancestor of , edge is a cut edge iff is a tree edge and there's no back edge from the subtree rooted at to the ancestor of (including ).

Proof. Again, it's easy to see the condition is sufficient, and we just need to show it's also necessary.

First, cannot be an back edge, since otherwise it suffices to induce a cycle containing . And if is a tree edge, then there must not be an back edge from inside the subtree rooted at to an ancestor of . Otherwise for the tree paths , there's a cycle containing , and is not a cut edge then, which is a contradition.

As we can see, no matter we would like to detect cut vertex or cut edge in a graph by DFS, we must be capable of detecting whether there's a back edge from a subtree rooted at specific vertex , to an ancestor of (including or excluding ). Clearly this can't be accomplished without tweaking the framework of DFS, the problem is, what kinds of tweaks make sense.

Luckily, the computer scientist Robert Tarjan showed us, that it suffices to introduce a properly maintained timestamp:

Definition (Lowest Timestamp in Cut Vertex / Edge Problems)

In a DFS forest of a graph, the lowest timestamp can be used to infer, that for each , whether there's a back edge from inside the subtree rooted at , to an ancestor of . It's book-kept in the following ways:

When entering the subtree rooted at , we inialize .
For each edge visited, we do:
1. When is a back edge, update .
2. When is a tree edge, after returning from the subtree rooted at , update .
When exiting the subtree rooted at , by checking whether there's , we know whether there's a back edge from inside the subtree rooted at , to an ancestor of (excluding ).

Noted that at step 2.1, we need to fetch the discover timestamp for the target endpoint of a back edge, therefore it's also required to book-keep the discover timestamp of each vertex.

Theorem

The lowest timestamp fulfills its purpose.

Proof. The neccesity is obvious. If there's a back edge from inside the subtree rooted at to an ancestor of (excluding ), we will see will be propagated to eventually, so that can be seen.

For the sufficiency, when is seen, there must be , such that is a back edge for some descendant of . Consider the tree path from the root of the search tree to , both and lies in that tree path and occurs earlier than , thus is an ancestor of .

So we can integrate the lowest timestamp into the framework of DFS, and implement the algorithms for finding cut vertices and cut edges:

Algorithm (Tarjan's Cut Vertex Algorithm)

The input is a graph , the output is a vertex set of cut vertices in :

Initialize empty set , and empty map .
Perform the framework of DFS with SCT on , with some tweaks:
1. For each pass of DFS at , if has or more direct children, then update .
2. When entering the subtree rooted at , update .
3. When checking a back edge , update
  .
4. When returning from a subtree rooted at direct children of (through a tree edge):
  1. Update .
  2. If is not the root of a search tree and (back edge to is allowed), then update .
Halt with being the set of cut vertices in .

Algorithm (Tarjan's Cut Edge Algorithm)

The input is a graph , the output is a edge set of cut edges in :

Initialize empty set , and empty map .
Perform the framework of DFS with SCT on , with some tweaks:
1. When entering the subtree rooted at , update .
2. When checking a back edge , update
  .
3. When returning from a subtree rooted at direct children of , through the tree edge :
  1. Update .
  2. If (mind the vertex being checked), then update .
Halt with being the set of cut edges in .

Warning (Handling Tree Edges in DFS of a Graph)

When performing DFS of a graph, one had better be super careful about the fact, that an edge may occur in both and .

Failing to handle such a situation will lead to a bug: after entering through as the tree edge, when inspecting , if we don't know has been used as the tree edge, we will wrongly use as a back edge in , due to 's being gray.

Empirically speaking, when implementing a DFS algorithm to search a graph, I would recommend one to keep track of additional information, of through which tree edge we enter this vertex, or which vertex is the parent (in the case of no multiple edge), as an extra argument or a local state. Then when checking edges, we may use it to ignore the tree edge, or the edge to the parent.

In this way, the problems of finding cut vertex and cut edge can be solved efficiently by DFS-based search algorithms, within the time complexity of .

Tarjan's SCC Algorithm

The idea of using the lowest timestamp in DFS can also be carried out to the problem of finding SCCs in a digraph:

By the white-path theorem for DFS, we know every vertex of an SCC lies in the subtree rooted at , for .
Althought the vertex of SCC is also in the subtree rooted at if , but given that we just need to know if there's no edge from inside to when exiting , in order to throttle from .
Meantime, there's a lemma we've derived while studying Kosaraju's algorithm, that for two distinct SCCs , we know . So after picking all subtrees of other SCCs than , off the subtree of , we will be eventually able to identify vertices that are indeed in , thus we are able to identify all subtrees of SCCs in a digraph.

We will show how this can be accomplished in this section.

In a DFS forest of a graph, the edge going out of a subtree can only be a back edge; however in a DFS forest of a digraph, the edge going out of a subtree can be a back edge or a cross edge. So we will need to handle the cross edges properly if the algorithm would like to be true. And indeed, handling the cross edges is the hardest part:

Example

Consider the following search forest of a digraph:

As we can see, there're 3 SCCs and 4 cross edges inside. Cross edge is crucial, since it propagates the presence of the back edge from to . cross edge is also crucial since it propagates the presence of the back edge from to . However, are redundant.

So the problem becomes, for every cross edge we encounter, based on what standard can we judge it to be crucial or not?

There might be many answer to this question, but all of them are not computationally helpful. For example, a naive (but sound) approach is to do one pass of DFS at the target endpoint of the cross edge, and see the deepest element on the stack / the highest ancestor (closest to the search tree root) of the source endpoint that it will hit. However, such a naive approach is prone to deliberatedly constructed DFS forest maximizing the number of edges that it must visit (e.g. ). Doing in such a way will result in the algorithm to run in quadratic complexity, that is, to run in time complexity , which is much slower than the Kosaraju's SCC algorithm, and not worth considering.

Meantime, from the example above we can see, a candidate standard is not even worth considering if it can't gather sufficient information within time complexity per cross edge, since otherwise the algorithm can't run within time complexity , and is slower than the Kosaraju's SCC algorithm we've fully studied.

In fact, the sufficiently workable standard in Tarjan's SCC algorithm are organically combined with other part of the algorithm, that is, like pieces of zigsaw puzzles nippled with each other, so I find it hard to come up with a easily inspectable standard as the one in the cut vertex / edge problem. So I decide to take out these pieces of zigsaw puzzles one by one, to let you see how they are combined into a workable version:

Definition (Lowest Timestamp in SCC Problem)

In a DFS forest of a digraph, the lowest timestamp can be used to infer, that for each , whether there's an edge from inside the subtree rooted at , to a target endpoint such that but has not been assigned to an SCC yet, at the moment of exiting the subtree rooted at . It's book-kept in the following ways:

When entering the subtree rooted at , we initialize .
For each edge visited, we do:
1. When is a back edge, update .
2. When is a cross edge such that has not been assigned to an SCC yet, update .
3. When is a tree edge, after returning from the subtree rooted at , update .
When exiting the subtree rooted at , by checking whether there's , we know whether there's an edge from inside the subtree rooted at , to a vertex such that but has not been assigned to an SCC yet.

Please notice when we inspect a back edge to at step 2.1, we don't check whether it has been assigned to an SCC yet, since principly speaking, we can't determine which SCC is in, without knowing which an ancestor of is reachable from the subtree of .

Now we obtain the first piece of zigsaw puzzles, which provides a bizzare information, the , to some vertex such that is discovered prior to but has not been assigned to an SCC yet. However, we clearly don't know where this piece can be placed until we know what the information "already assigned to an SCC or not" is good for.

Here comes the second piece of the zigsaw puzzles:

Assertion (The SCC Algorithm's Presumption)

We are designing an SCC algorithm that has the following properties:

The algorithm is based on DFS.
When exiting a subtree rooted at , if there's , we assign all vertices that has not been assigned to an SCC, under the subtree rooted at , to an SCC with as the representative.

Lemma

Any algorithm designed under the SCC algorithm's presumption assigns every vertex to an SCC when it halts.

Proof. Assume the contrary, that there's an vertex in the search tree rooted at such that not assigned to an SCC after traversing . If so, upon exiting this search tree, we can see there's for some outside this search tree, which must be reached for a cross edge with as the target endpoint.

Given that there're finite vertices in a graph, we may recursively trace such subtrees rooted at that for reachable via a cross edge, until there's no more cross edge to go. Eventually we reach a in the search tree rooted at , such that there's no cross edge to another search tree. In this way, upon exiting , we will see and thus must be assigned to an SCC, which is contradiction to is not assigned to an SCC.

After unwinding the logic above, we will see every vertex must be assigned to an SCC before exiting the search tree it is in.

The second piece is just what we are demanding in the beginning of this section, adding how we will use the lowest timestamp to it.

Here comes the third piece of the zigsaw puzzles:

Lemma

Under the SCC algorithm's presumption, if is in the SCC represented by such that , then is not assigned to an SCC before exiting the subtree rooted at .

Proof. Clearly is in the subtree rooted at , and and are strongly connected. Let a -path be . Noted that by white-path theorem for DFS, every vertex in must be a descendant of .

Clearly the last edge of the path must be an edge in the form of , and given that will be seen when exiting subtree rooted at , and propagated to all of its ancestors up to , clearly will not assign to an SCC before exiting the subtree rooted at .

And clearly can be constructed by prepending edges , in which we may assume will not be assigned to an SCC before exiting the subtree rooted at . If so, we will categorize and see what happens in each cases.

If is a tree edge or a forward edge, clearly is a descendant of , and if is assigned to an SCC before exiting subtree rooted at , then will also be assigned to that SCC, which is a contradition.

If is a back edge or a cross edge, assume there's a proper ancestor of such that , we have:

, by the property of a back edge or a cross edge in a DFS forest.
after is visited, and by the way we book-keep the lowest timestamp (noted that is not assigned to an SCC before exiting subtree rooted at , by presumption).
, after is propagated to , again, by the way we book-keep the lowest timestamp.
, by is a descendant of .

And by putting all the pieces together, we will eventually see:

So is a descendant of now, and assigning the SCC to subtree rooted at will inevitably touch , which is a contradiction.

By constructing the path through prepending edges, we will eventually reach a point that will not be assigned to an SCC before exiting subtree rooted at .

The third piece shows what we will see when we run the algorithm over an SCC, which serves as the neccesary condition and gives us some intuition.

And finally, here comes the fourth and the last piece of the zigsaw puzzles:

Algorithm (Backtracking Path Algorithm)

The input is a vertex , alongside with the runtime information of the an algorithm designed under the SCC algorithm's presumption, the output is a the backtracking path of , which is a -path , such that :

Initialize empty walk and set current vertex .
Loop until :
1. Let be the vertex that assigns to upon exiting subtree rooted at , for edge .
2. Append to that , and update .
Halt with as the backtracking path of .

Proof. We map the elements in to a partial order set by mapping defined by , and in the partial order , we have , that is, to take the lexical order.

Consider every iteration of step 2, there must be : when is a tree edge, is assigned to , there must be and thus ; when is a back edge or a cross edge, is assigned to , noted that it's either and the current iteration is the last iteration of the step 2, or , in which there's .

Thus consider the vertices in the -walk except for the last endpoint, for each edge there must be , and the strict descending chain of can't be infinite, thus the algorithm will eventually halt. Meantime, since repeated element can't appear in a strict descending chain, the -walk must be a path.

The idea of the backtracking algorithm is simple: to backtrack the vertex propagating its lowest or discover timestamp to the current vertex iteratively, whenever its possible. And when the terminal of the backtracking is hit, we will find:

Lemma

The backtracking path algorithm yields a -path , such that is an ancestor of , and thus strongly connected to .

Proof. Assume the contrary that is not a descendant of . By the partial order set we can see there's , so can't be a descendant of . And the only possibility is that are in different subtrees.

Noted that can't contain a common ancestor of , since such a implies (induced by ), however is expected.

In this way, the backtracking path must contain a cross edge , such that is outside the subtree rooted at and is inside the subtree rooted at . This must be possible since is outside the subtree rooted at and is inside the subtree rooted at .

When exiting the subtree rooted at , will be assigned to the SCC with as the representative. But consider when is being checked, in order to assign to , it's required for not to be assigned to an SCC, which is a contradiction to is a valid backtracking path.

Lemma

Under the SCC algorithm's presumption, when exiting the subtree rooted at such that , if is in the subtree but not assigned to an SCC yet, then and are strongly connected.

Proof. Run the backtracking path algorithm on to obtain a -path . Noted that there's also , and both are ancestors of , we just need to compare their parenthood relation.

If is a proper descendant of , then upon exiting , will be assigned to the SCC with as the representative, which is contradiction to has not been assigned to an SCC when exiting .

If is a proper descendant of , take , there're two possibilities:

When is a tree edge, is a descendant of , and thus a descendant of .
When is a back edge or a cross edge, . Since will be propagated to , we can see .
Thus , and is a descendant of .

Therefore, no matter which case, we always have as a descendant of . By iteratively reasoning on every edge in , we will end up with being a descendant of , which is a contradiction.

So the only possibility is , and thus and must be strongly connected.

Now the time for solving this zigsaw puzzle comes. By putting all the pieces above together, we eventually have:

Theorem

Any algorithm designed under the SCC algorithm's presumption assigns every vertex to its SCC correctly when it halts.

Proof. In a DFS forest of a graph, Consider any SCC represented by such that , we argue that will be assigned to the SCC represented by iff .

If is assigned to the SCC represented by , then we know and are strongly connected, and thus .

If , it's under the subtree rooted at , and will not be assigned to any SCC before exiting the subtree rooted at . If is not assigned to the SCC represented by , then it will be assigned to a proper ancestor of (given that the algorithm must assign every vertex to an SCC), noted that is also a descendant of , thus and are strongly connected, but is expected, which is a contradiction. Thus any must be assigned to the SCC represented by , upon exiting the subtree rooted at .

In this way, the correctness of the algorithm is guaranteed whenever it fulfills the SCC algorithm's presumption. Now we just need to implement it.

The SCC algorithm's presumption says, "assign all vertices that has not been assigned to the SCC represented by when exiting the subtree rooted at such that ". Although we can clearly do another pass of DFS at for this purpose, but doing so will require allocating extra space for search coloring and timestamps dedicated to this pass of DFS, and will check as many edges as those incident to these unassigned vertices (throttling the search when an vertex assigned to an SCC is hit), it's actually quite uneconomical and unwise.

A wiser way is to introduce a stack : whenever a vertex is visited, push onto ; noted that for subtree rooted at , the elements in the subtree will be on the top of in , by their discovery time; when exiting a vertex such that , pop vertex from and assign to the SCC represented by , until is found. The SCC algorithm's presumption guarantees us that when exiting subtree rooted at such that , all vertices belongs to an SCC other than the one represented by will have been assigned to their own SCC, so they must have been popped from the stack the procedure of popping and assigning SCC relative to commences. In this way, introducing the stack will not undermine our requirement of assigning vertex to their own SCC correctly.

The stack will occupy only up to space, but eliminate the need of allocating extra book-keeping space for passes of DFS, and the time wasted on checking edges incident to unassigned vertices. So it's favourable in our implemetation.

In the end, we can assemble all of these ideas above into the Tarjan's SCC algorithm:

Algorithm (Tarjan's SCC Algorithm)

The input is a digraph , the output is a mapping from vertex to the representative of the SCC, such that :

Initialize an empty stack , the map , as , and empty map .
Perform the framework of DFS with SCT on , with some tweaks:
1. When entering a vertex , push onto , update
  .
2. When checking a back edge , update
  .
3. When checking a cross edge such that , update .
4. When returning from a subtree rooted at direct children of (through a tree edge), update .
5. When exiting a subtree rooted at such that :
  1. Pop from , assign , until (it's a do-while structure: assign first, and then check).
Halt with as the result.

Theorem

The Tarjan's SCC algorithm runs in time complexity and space complexity .

Proof. For the time complexity, performing a DFS takes , pushing vertices in to stack accumulates to , the popping-and-marking iterations at 2.5.1 accumulates to , book-keeping the lowest timestamps takes per edge and thus accumulates to . Thus the eventual time complexity is .

For the space complexity, the stack, mapping and timestamp take , performing a DFS takes , thus the eventual space complexity is .

Interestingly, both Kosaraju's and Tarjan's SCC algorithms use a stack and run in linear time complexity, proportional to the magnitude of vertices and edges.

However, Tarjan's SCC algorithm eliminates the need of evaluating the transposition of the input digraph. Meantime, in Kosaraju's SCC algorithm, in order to assign vertices to their own SCCs, it's required to do a pass of DFS at the representative in the transposition . While in Tarjan's algorithm, it suffices to just pop vertices from a stack and assign them, until the representative of the SCC to assign is hit.

In this way, the Tarjan's SCC algorithm is arguably faster and more space efficient than the Kosaraju's SCC algorithm. Although the principle of the Tarjan's SCC algorithm is far much harder to elaborate, as is seen from above. In my opinion, both of them are worth studying, to see how different minds may clash, in solving the SCC problem.

October 6, 2024

graph-theory combinatorics