Intro
So you want to make a web-crawler from scratch. First, we will go through the different web crawler components and then look at the necessary logistics to write the code. Let's get started!
The algorithm
Web crawling can be seen as a traversal of all the web pages on the internet. We know about two graph traversal methods, BFS & DFS. If you choose DFS, you will never finish your search, and it's not probably what we want here. What we will implement is BFS, i.e., Breadth-first search.
In breadth-first search, here is a list of steps that you will perform:
Visit a website
Extract all the links that are present on that website
Push these links into the main queue
Repeat the same.
Purpose of web crawling
So we are downloading a lot of websites, but why are we doing this? Well, there are several uses:
Data mining
Web indexing
Web analysis
Web scraping
Web ranking
Downloading a website
Once you have a link in your crawler, the next thing you want is to download this link. You can either write your own downloaders or use the predefined libraries in different languages. For example: If you are using Linux, you can use wget. Problems that you might face in this step:
Servers may not respond(time issue)
Website will take too long to download(time issue)
You might be downloading a pdf or a video(size issue)
If you are using a library for downloading a website, make sure you can customize the wait time and max file size. Again, if you are implementing in C++, you can use libcURL.
Processing data
Once you can download HTML files, you want to process these data. One thing that you definitely want is to extract the links from this HTML file. You might want to extract text, images, or some special type of data for other purposes. You can do that using web scraping. You can parse the downloaded HTML file and process your data.
Extracting the links
For extracting the links from the HTML file, you can use a regex library or tool. As a reminder, not all links in the HTML file will be valid. You have to ignore links that are used to navigate within a webpage. Also, there are bad links that you have to ignore. This step is crucial, so you want to spend a good amount of time here.
Restarting the process
Once you are finished with extracting the links, you want to repeat this process. You can print the progress of your crawler on the terminal or a UI for your own ease. Also, make sure you are deleting the temporary data that you are using.
Making a multithreaded one
If your computer has resources, you want to use them. That’s why you would love a multithreaded implementation. You can first learn to create and destroy threads. Then you can decide on the main crawler loop, which will run and handle these processes. We followed one simple architecture to create a thread, pass the website link, timeout, and max file size as the arguments, and let it do the work. After finishing this task, this thread can destroy itself and send a signal to the mother thread.
Good luck with programming!
I hope you can finish your implementation and make an awesome spider. If you want to understand more about the topic, you are most welcome to visit: