This article, Design a Crawler, will contain information about the first phase of the search engine i.e., crawler. In my last article, I had given an overview of the search engine and its working. I will define various functions of the crawler and its working.

Crawler Introduction

The web crawlers crawl through the content of the webpage to crawl all the different web pages linked to it. It has many other synonyms like spider and bot. The crawler scans the content of a site, being crawled, and learns about some of the information like the domain of the website, URL, links, etc. It takes the first page of the site as a seed page, which directs the crawler to crawl all the different web pages linked to it.

Here is the algorithm that defines the process that web crawler follows:-

Algorithm Basic-Crawler: Input: URL Output: links stored in storage

  1. Enter the URL of the page to start with.
  2. Scan the page and extract all the links from the page.
  3. Fetch the link.
  4. If the link is already visited then skip and fetch the next link.
  5. Follow each link and store the non-visited link in the storage.
  6. repeat step 3 until all the links have been visited.

The goal of this article is to get you started with the basics of programming i.e., python to develop a basic crawler.

Getting started with Python Programming

We will pave the way for you to get started with Python Programming using your web browser. You won’t require an additional program to run the python code. To get started follow the steps:

Figure 1: Google Colab
  1. Open Google Colab in your web browser using your Google username and password.
  2. Click on the “NEW NOTEBOOK” link which is on the bottom right corner of the screen.
  3. Edit the name of the notebook as “Crawler.ipynb”. To do this click on the top left corner of the screen where it is mentioned “Untitled0.ipynb”. As mentioned in Figure 2.
  4. You can start typing the code in the area which is in the middle of the page.
Figure 2: Crawler.ipynb
  1. We will learn about the following topics:
  • Output
  • Input
  • Basic data structure: list
  • urllib module Type the following command in the code block, as shown in Figure 3:
# For output we use print function</p>
print("Hello World!!")
# Define a variable
# varname = value
var_int = 10
var_float = 1.24
var_str = "Hello World!!"
# how to find out the data type of a variable? 
We will use type() function.
print(type(var_int))
print("Var_int=",var_int)
Figure 3:

To execute you need to press the play button. You can see the output of the code just below the code section.

In python, for comments, we use the # character. Just place # in front of the line, it will make the line a comment. A comment is a statement that will not be executed by the python interpreter. Check line number 2 in Figure 3.

print(“Hello World!!”) #print the string which is placed inside the print function with “”(quotes).

To declare variable we need to use the following syntax. Check line number 5, 6, & 7 in Figure 3.

varname = value

As compared to other programming languages, Python does not require explicit declaration of datatype with variables. The variable takes the data type of the value stored inside it.

As you can see in Figure 3, variables var_int, var_float, var_str are variables of datatype integer, float, and string, respectively.

If you want to check the datatype use the function : type(variable name). Check the second output.

print(type(var_int)) output the datatype as <class, ‘int’> which means it is a integer data type.

We can print more than one item using print statement. Check line number 10 in Figure 3.

#Input, for input we will use input function, returns a string
var = input("Enter a number:")
print(type(var))
# We will use a type conversion to convert string to integer, int(), and other functions are float(), str()
var_int = int(var)
print(type(var_int))
Figure 4

Now I will explain about input statements.

Input statement take input from keyboard and then save it in a variable. Check line number 2 in Figure 4.

Function input return the value as a string, therefore it is required to change the type of from string to the type we require. In this case, we input a number, therefore we will convert it to integer by using int(variablename) function.

Check line number 3 in Figure 4 for return value from input function.

Check line number 5 & 6 in Figure 4 for conversion from string to integer and type checking after conversion.

We will not understand the next part i.e., Basic data structure: list

List is a data structure which can take values of different data types as well. Check Figure 5.

Figure 5

As demonstrated in Line number 6 and 8, list can be created using the variables previously defined and with anything we want.

Line number 7 & 9, prints the type of var_list and the contents of the list.

Note: since var_list is a variable therefore it will change as the assignment of a new value is done in the variable.

In my next article, I will explain the modules required, code of the crawler using the algorithm mentioned above. I will also prepare a YouTube video to explain the stuff. Previous related articles can be accessed from here:

References