The title is quite intriguing but interesting. Code a custom search engine for a fun project will extend your knowledge of the field. As a student of computer science, I had always thought about how a search engine works, it knows about a newly commissioned web resource, whether a website is up or down, why a website takes time to show up in search results? There are many questions, but the only answer is a search engine. This is the first article in series of search engines that will cover the basics, modules of search engines, development of the modules with code and results, integrating the modules with one framework.
We all use a search engine daily like
ok Google when the cab is going to arrive, when is my birthday?
Alexa, will it rain today?
Search engines like Google and duckduckgo do the same things, but ideologically different. So let’s get down to the basics of search engines and get our hands dirty with coding a basic search engine.
Search engines are a complex piece of code which does so many things like searching for the answer to the question user had asked, calculating which website comes in the first 10 results. To explain we will now first understand different components of search engine and they are:
These three processes team up to make a search engine and keep it up and running. Let’s dive inside these components one by one.
Crawling is a process that feeds the search engine with the required data. It first visits the page through the internet, downloads the information, and saves it to the database of the search engine. Some of the information a crawler seeks are:
- title of the website
- domain information
- other links on the website
The software which does crawling is termed as spider or crawler. A crawler when plans to visit a website first checks for a file robots.txt on the server where the website is hosted. This file contains rules for the crawler. It informs the crawler what to crawl and what not to crawl. After the crawler had read the rules then it crawls the index page of the website and jumps on to the next website after it. These crawlers can’t search the entire website in one go so they stop after scanning the first page(mostly).
Now I will make you understand the next process i.e., parsing.
It is a process in which the webpage is scanned for the content inside it.
So how do we do it?
We do it by extracting the features from the webpage and store them with the handle i.e, web URL. Some of the features of the websites are:
- website domain name
- website title
- frequently-used tokens(words)
- headings on the page
- links to other websites
- links to other pages of the site
- text marked as bold
- thumbnails of images on website
These features are stored in a database. After storing, indexing is done based on indexing criteria. It is discussed in the next section.
It is a process of processing the data of the crawled websites and arranging them in any order so that we can retrieve the required information later. It is described in figure 1 below.
In Figure 1, text acquisition is done in the parsing phase, and data is stored in a document store, text transformation is done which cleans text and then pushed again for index creation. Index creation is done after it.
I hope you have understood the basics of search engines.
In my next article, I’ll explain each module in detail. Other related articles can be accessed form the list below: