Kruskal’s Minimum Spanning Tree Algorithm( Detection of cycles using node, subtree, element count logic)
About the problem
The problem statement is for any given undirected graph finding a minimum cost spanning tree of n-1 edges(obvious for an n-vertice graph) where there is a price tag on each edge. So there can be possibly many sets of edges that you can choose. This means that if there are m edges given for an n-vertice graph, there are (m C n-1) sets of edges that can claim to be a graph and they all will be related to a cost which is the sum of its contributing edges.
But according to the problem statement, we must find a spanning tree. A spanning tree is a set of n-1 edges such that, between any two vertices in the graph, there must exist a path constituted from the edges.
Now we have all these sets of spanning trees, but what we are interested in is the one with the minimum cost. There is another algorithm, i.e. Prim’s algorithm which is easy to code w.r.t. Kruskal as the algorithm has some complications while handling the graph data structures.
About the algorithm
Below is the Kruskal Minimum Spanning Tree Algorithm:
- Sort the edges in increasing order of their cost
- Create a new graph of no edges
- Select the minimum cost edge and add it to the new graph if the graph doesn’t constitute a cycle
- Repeat from step 3 until you find you find n-1 edges
Note that if edges exhaust in the middle of the algorithm, there must be some error with the structure of the initial graph itself. The graph must provide initial connectivity to all the vertices to find some solution to the problem. Because in the problem we are not only generating the minimum spanning tree, we are selecting our solution from all the other possible solutions available in the graph.
In theoretical computer science, it is classified as a greedy algorithm. As it selects the minimum edge possible. Although being a greedy algorithm, it not only claims to provide the optimum solution but actually provides it. For a proper argument of the proof, refer to the book CLRS.
Implementation
So let us talk about the implementation of the algorithm. Here, I will explain the code in C++ and explain more about the data and processes.
Actual Implementation
Although we have talked about the fancy insights of the process, but not talked about step 3. So how do we detect there is a cycle in a graph. A common intuition will come from a naive’s approach which is to run a BFS or DFS on the graph and check for appropriate conditions. For short graphs, this process may be not time-consuming but for large networks, it can be very time-consuming. But time can be improved if we try to understand the problem carefully.
Restating the cycle detection problem, if we are able to tell somehow that vertice A and B are present within the same component, we can say that connecting A and B will definitely form a cycle. For this purpose, we will try to assign some kind of unique id to each of the components.
So, the algorithm requires us to provide two utilities, the first one is find(). A function that will return a unique ID of the component in which vertex A is present. The other one is the set_union(), this function merges two separate trees to form a single tree i.e., to create a unique ID for each vertice in the finalized component.
So as we have mentioned if we follow such a design structure we will achieve required data systems. But for union_find() will take too long for recolouring some large component of the graph. Although applying this type of system is fine and will eventually achieve the purpose, but we will do it in a simpler and illustrative way.
Programming Infrastructure
So, let me introduce you to our node which will serve the purpose of label distinction in our process.
typedef struct node { bool root_status = true; int elcount = 1; struct node* next = NULL; int uniqid = 0;}node;
So we maintain an array of n nodes of the above type. Initially, we can assume that every single vertice represents a single tree in itself. For any tree, the root will provide us with a unique ID for a component. If any node is a root node we will turn the root_status variable on, elsewise turnoff. For those nodes elements, who are roots there is another variable elcount that simply keeps tracks of the total number of nodes in the process. See there is a pointer to another node next that represents a link to the element on its top. For the sake of find function, this link is very important to reach up to the root node across a component. In the end, just make sure you provide some unique ID for all the nodes in the starting, it can be random numbers or simply the indices of the nodes itself, whatever just make sure it is unique.
Now let us define the find function as below:
node* find(int a) { node* x = &node_array[a]; while (x->next) { x = x->next; } return x;}
This function returns a pointer to the root node, such that other set_union operations can be implemented if required.
The next part is the set_union function which helps in querying us over all the edges in ascending order.
void union_ab(node* a, node* b) { if (a->elcount < b->elcount) { swap(a, b); } // assume a.elcount > b.elcount, if not read above line b->root_status = false; b->next = a; a->elcount += b->elcount;}
The function changes the root_status of the tree with less number of nodes to false and updates the elcount of the larger node. Also connecting the root of the smaller tree to the root of the larger tree.
I am attaching an awesome problem related to finding the minimum cost spanning tree.
Try the implementation, read the formal proof argument from the textbooks and don’t forget to code it;
Following is the github repo for my C++ kruskal implementation.
Thank you for reading this article and I hope this helps otherwise Google and Youtube are there for you always.
Happy Learning (: — —— :)
/* Ping me up for any queries.*/